Characterization of the sesame (Sesamum indicum L.) global transcriptome using Illumina paired-end sequencing and development of EST-SSR markers
- Wenliang Wei†1,
- Xiaoqiong Qi†1,
- Linhai Wang1,
- Yanxin Zhang1,
- Wei Hua1,
- Donghua Li1,
- Haixia Lv1 and
- Xiurong Zhang1Email author
© Wei et al; licensee BioMed Central Ltd. 2011
Received: 30 January 2011
Accepted: 19 September 2011
Published: 19 September 2011
Sesame is an important oil crop, but limited transcriptomic and genomic data are currently available. This information is essential to clarify the fatty acid and lignan biosynthesis molecular mechanism. In addition, a shortage of sesame molecular markers limits the efficiency and accuracy of genetic breeding. High-throughput transcriptomic sequencing is essential to generate a large transcriptome sequence dataset for gene discovery and molecular marker development.
Sesame transcriptomes from five tissues were sequenced using Illumina paired-end sequencing technology. The cleaned raw reads were assembled into a total of 86,222 unigenes with an average length of 629 bp. Of the unigenes, 46,584 (54.03%) had significant similarity with proteins in the NCBI nonredundant protein database and Swiss-Prot database (E-value < 10-5). Of these annotated unigenes, 10,805 and 27,588 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. In total, 22,003 (25.52%) unigenes were mapped onto 119 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). Furthermore, 44,750 unigenes showed homology to 15,460 Arabidopsis genes based on BLASTx analysis against The Arabidopsis Information Resource (TAIR, Version 10) and revealed relatively high gene coverage. In total, 7,702 unigenes were converted into SSR markers (EST-SSR). Dinucleotide SSRs were the dominant repeat motif (67.07%, 5,166), followed by trinucleotide (24.89%, 1,917), tetranucleotide (4.31%, 332), hexanucleotide (2.62%, 202), and pentanucleotide (1.10%, 85) SSRs. AG/CT (46.29%) was the dominant repeat motif, followed by AC/GT (16.07%), AT/AT (10.53%), AAG/CTT (6.23%), and AGG/CCT (3.39%). Fifty EST-SSRs were randomly selected to validate amplification and to determine the degree of polymorphism in the genomic DNA pools. Forty primer pairs successfully amplified DNA fragments and detected significant amounts of polymorphism among 24 sesame accessions.
This study demonstrates that Illumina paired-end sequencing is a fast and cost-effective approach to gene discovery and molecular marker development in non-model organisms. Our results provide a comprehensive sequence resource for sesame research.
Sesame (Sesamum indicum L.), a member of the Pedaliaceae, is a diploid (2n = 26) dicotyledon and one of the oldest oil seed crops, growing widely in tropical and subtropical areas [1, 2]. Sesame seeds are an important source of oil (44-58%), protein (18-25%), and carbohydrates (13.5%) , and are traditionally consumed directly. They are used as active ingredients in antiseptics, bactericides, viricides, disinfectants, moth repellants, and antitubercular agents because they contain natural antioxidants such as sesamin and sesamolin . Among the primary edible oils, sesame oil has the highest antioxidant content  and contains abundant fatty acids such as oleic acid (43%), linoleic acid (35%), palmitic acid (11%), and stearic acid (7%) . In addition, sesame oil is important in the food industry because of its distinct flavor. These characteristics have stimulated interest in the biochemical and physiological composition of sesame oil .
Previous studies on sesame have mainly focused on quantitative genetics , traditional genetic breeding , and genetic relationships and diversity among sesame germplasm collections [9, 10]. Although much effort has been devoted to cloning key genes and characterizing fatty acid elongation and unsaturated fatty acid biosynthesis in sesame [11–13], the molecular mechanisms behind fatty acid biosynthesis and metabolism remain unclear. Publicly available datasets are of limited use for future sesame research, such as elucidating the molecular mechanisms of specific traits and understanding the complexity of the transcriptome, gene expression regulation, and gene networks. Progress in novel gene discovery and molecular breeding in sesame has been limited by the lack of genomic information. For example, only 3,328 expressed sequence tag (EST) sequences in sesame have been deposited in the dbEST GenBank database (as at January 2011).
Molecular markers play an important role in many aspects of plant breeding, such as identification of the genes responsible for desirable traits. Molecular markers have been widely used to map important genes and assist with the breeding of oil crops. However, in sesame, only 10 genomic simple sequence repeat (SSR)  and 44 EST-SSR  markers have been developed. Genetic relationships and diversity among germplasm collections have been investigated mostly using AFLP, ISSR, and RAPD markers. In sesame, marker-assisted selection and molecular breeding lag behind other crops owing to a lack of effective molecular markers. Thus, a rapid and cost-effective approach to develop molecular markers for sesame is required. Compared with other types of molecular markers, SSRs have many advantages, such as simplicity, effectiveness, abundance, hypervariability, reproducibility, codominant inheritance, and extensive genomic coverage . Based on the original sequences used to identify simple repeats, SSRs can be divided into genomic SSRs and EST-SSRs. Traditional methods to isolate and identify genomic SSRs are costly, labor-intensive, and time-consuming [17, 18]. In addition, the interspecific transferability of genomic SSRs is limited because of either a disappearance of the repeat region or degeneration of the primer binding sites . Alternatively, EST-SSRs are derived from expressed sequences, which are more evolutionary conserved than noncoding sequences; therefore, EST-SSR markers have a relatively high transferability. With the increasing number of ESTs deposited in public databases, an expanding number of EST-SSRs have been developed, and the polymorphism and transferability of EST-SSRs have been evaluated in many plant species [20–30].
The transcriptome is the complete set and quantity of transcripts in a cell at a specific developmental stage or under a physiological condition. The transcriptome provides information on gene expression, gene regulation, and amino acid content of proteins. Therefore, transcriptome analysis is essential to interpret the functional elements of the genome and reveal the molecular constituents of cells and tissues. Transcriptome or EST sequencing is an efficient way to generate functional genomic-level data for non-model organisms. Large collections of EST sequences are invaluable for gene annotation and discovery [31, 32], comparative genomics , development of molecular markers [34, 35], and population genomics studies of genetic variation associated with adaptive traits . Recently, an increasing number of EST datasets have become available for model and non-model organisms, but relatively few ESTs are currently available for sesame.
Numerous technologies have been developed to analyze and quantify the transcriptome. Initially, a traditional sequencing method was used, but this approach is costly, time-consuming, and sensitive to cloning biases since it involves cDNA library construction, cloning, and labor-intensive Sanger sequencing. Because of the deep coverage and single base-pair resolution provided by next-generation sequencing instruments, RNA sequencing (RNA-seq) is an efficient method to analyze transcriptome data. Theoretically, any high-throughput sequencing technology can be used for RNA-seq, such as the Illumina Genome Analyzer, Applied Biosystems' SOLiD, and Roche 454 Life Sciences system. Because of the increased read length by 454 pyrosequencing compared to the other two platforms [37–39], the 454 system is usually adopted for non-model organisms to create a transcriptome database , and a short-read-based technology such as the Solexa platform has been used for resequencing . Recent algorithmic  and experimental (e.g., Illumina/Solexa mate-pair and short-read paired-ends libraries) advances are likely to increase the applicability of Illumina sequencing and de novo assembly, which has been successfully and increasingly used for model [40, 42–44] and non-model organisms [39, 45–47]. These technologies are efficient, inexpensive, and reliable for genome and transcriptome sequencing, and suitable for non-model organisms such as sesame.
In this study, we sampled the pooled transcriptomes of roots, leaves, shoot tips, flowers, and the developing seeds of sesame using Illumina paired-end sequencing technology to generate a large-scale EST database and develop a set of EST-SSRs. To our knowledge, this study is the first to characterize the complete transcriptome of sesame by analyzing large-scale transcript sequences using an Illumina paired-end sequencing strategy. These EST datasets will serve as a valuable resource for novel gene discovery and marker-assisted selective breeding in sesame.
Illumina paired-end sequencing and de novo assembly
To obtain a global overview of the sesame transcriptome and gene activity at nucleotide resolution, RNA was extracted from five different sesame tissues including the roots, leaves, flowers, developing seeds, and shoot tips, and equally mixed. To minimize systematic bias from transcriptome sampling and Illumina sequencing, and to enhance the accuracy of detecting low-abundance transcripts, three cDNA libraries from the same pooled RNA sample were constructed and sequenced separately using an Illumina HiSeq2000 genome analyzer.
Using a paired-end sequencing strategy, contigs from the same transcript can be identified and the distances between these contigs evaluated. SOAPdenovo allowed us to map the reads back to the contigs, and connect the contigs into scaffolds using 'N' to represent unknown sequences between each pair of contigs. Contigs in the three libraries were assembled into 109,263, 103,440, and 97,951 scaffolds with average lengths of 408 bp, 412 bp, and 406 bp, and with median lengths of 610 bp, 592 bp, and 576 bp in libraries 1, 2, and 3, respectively. The distribution of scaffolds is shown in Figure 1. Though 83.18%, 73.89%, and 72.84% of scaffolds did not show a gap in libraries 1, 2, and 3 (Additional file 1), respectively, 0.64 Mb, 0.91 Mb, and 0.92 Mb gaps (1.54%, 2.14%, and 2.32% of the total scaffolds in libraries 1, 2, and 3, respectively), respectively, remained unclosed.
To further shorten the remaining gaps, paired-end reads were used to fill scaffold gaps. We gathered the paired-end reads with one end mapped on the unique contig and the other located in the gap region, and filled the small gaps within the scaffolds. Sequences with the smallest number of Ns and could not be extended on either end, were defined as unigenes. At this point, more than half of the gaps were filled. For example, in library 1, only 0.20 Mb of gaps (0.49% of the total unigene sequences) remained unclosed (Additional file 1), while in libraries 2 and 3, 0.43 Mb and 0.45 Mb of gaps (1.06% and 1.21% of the total unigene sequences), respectively, remained unclosed. The de novo assembly in libraries 1, 2, and 3 yielded 84,546 unigenes with an average length of 490 bp, 82,709 with an average length of 484 bp, and 78,235 with an average length of 477 bp, respectively. The respective median unigene lengths in the three libraries were 671 bp, 642 bp, and 624 bp (Figure 1).
The contig, scaffold, and unigene size distributions for the three libraries were consistent (Figure 1), implying that the Illumina sequencing solution was reproducible and reliable. Therefore, unigenes from the three libraries were pooled and assembled into nonredundant unigenes for further analysis. In total, 86,222 nonredundant unigenes with a total length of 54.25 Mb, ranging from 200 bp to 12,298 bp, with an average length of 629 bp and a median length of 947 bp, were obtained. The length of 53,969 (62.59%) nonredundant unigenes ranged from 200 to 500 bp, 17,453 (20.24%) ranged from 501 to 1,000 bp, and 14,800 (17.16%) were more than 1,000 bp in length (Figure 1).
Annotation of all nonredundant unigenes
Distribution of unigenes on Arabidopsis chromosomes
No. of hits
Unigenes involved in fatty acid biosynthesis
Arabidopsis hit ID
chloroplastic acetylcoenzyme A carboxylase 1
acetyl Co-enzyme a carboxylase biotin carboxylase subunit
acetyl Co-enzyme a carboxylase carboxyltransferase alpha subunit
acetyl-CoA carboxylase carboxyl transferase subunit beta
acetyl-CoA carboxylase 2
catalytics; transferases; [acyl-carrier-protein] S-malonyltransferases; binding
3-ketoacyl-acyl carrier protein synthase III
fatty acid biosynthesis 1
3-ketoacyl-acyl carrier protein synthase I
3-ketoacyl-acyl carrier protein synthase III
NAD(P)-binding Rossmann-fold superfamily protein
thioesterase superfamily protein
NAD(P)-binding Rossmann-fold superfamily protein
fatA acyl-ACP thioesterase
fatty acyl-ACP thioesterases B
fatty acid desaturase 2
Reconstruction of oil accumulation metabolic pathways
According to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 22,003 unigenes were grouped into 119 cellular metabolic or signaling pathways including cellular growth, differentiation, apoptosis, migration, endocrine, and numerous biosynthesis metabolic pathways (Additional file 2). Being an important oil crop, previous research has focused mostly on fatty acid and lipid metabolism pathways. Unigenes encode the majority of enzymes in the fatty acid biosynthesis pathway. Specifically, unigenes encoded enzymes for the biosynthesis of oleic acid, stearic acid (FatA), and palmitic acid (FatA and FatB), the main constituents of sesame seed oil. Additionally, six ESTs (unigene8278, unigene34351, unigene15844, unigene17939, unigene27262, and unigene31569) encoded oleoyl-ACP desaturase (FAD2, EC:1.14.19.-), which catalyzes polyunsaturation of oleoyl-ACP (18:1) to linoleoyl-ACP (18:2). Since oleic and linoleic acids are the major components of sesame oil, FAD2 is a potential biological target to modulate sesame oil composition.
Frequency and distribution of EST-SSRs in the sesame transcriptome
Frequency of EST-SSRs in sesame
Frequency of di- and trinucleotide EST-SSR repeat motifs in sesame
Identification of polymorphic markers
Characterization of 40 EST-SSRs among 24 sesame accessions
Forward primer (5'-3')
Reverse primer (5'-3')
No. of alleles
Illumina paired-end sequencing
Transcriptome sequencing is an important tool for expression pattern identification and gene discovery. Numerous technologies have been developed to analyze and quantify the transcriptome. For example, traditional EST sequencing methods, such as Sanger sequencing, have made significant contributions to current genomics research and dbEST database construction, but this approach is costly, time-consuming, and sensitive to cloning biases. Next-generation sequencing (NGS) technologies present opportunities for plant genomic analyses with or without a complete genome sequence. Because of the potential for high throughput, accuracy, and the low cost, NGS is widely applied to analyze transcriptomes qualitatively and quantitatively, and has been used successfully for de novo transcriptome sequencing and assembly in many organisms [33, 39, 42–47, 49].
Only 3,328 ESTs from a cDNA library for developing sesame seeds (5-25 days after pollination) have been deposited in NCBI dbEST databases . In the present study, a transcriptome sequencing analysis of mixed RNA from five sesame tissues (root, leaf, flower, developing seed, and shoot tip) was conducted using the Illumina platform. Six Gbp of data were generated and assembled into unigenes. This large number of reads with paired-end information produced much longer unigenes (mean: 629 bp) than those in previous studies [35, 37, 49, 51–53]. This increased transcriptome nucleotide coverage depth facilitated de novo assembly, enhanced the sequencing accuracy, and avoided possible contamination. For example, no transposable element contamination was detected in our database. The unigenes were subjected to BLASTx analysis against the TAIR Version 10 database, and 44,750 unigenes showed homology to 15,460 Arabidopsis genes. Moreover, our database revealed that unigenes encoded for the majority of enzymes involved in fatty acid biosynthesis (Table 2), suggesting that relatively short reads from Illumina paired-end sequencing for a non-model organism can be effectively and accurately assembled.
In our study, 54.03% (46,584 of 86,222) of the sesame unigenes had homologs in the Nr or Swiss-Prot protein databases, whereas in Epimedium sagittatum, whitefly , and sweet potato , only 38.50%, 16.20%, and 46.21% unigenes, respectively, had homologs in the Nr database. The average unigene length in our database was 629 bp, while the length in the above three databases was 246 bp, 266 bp, and 581 bp, respectively. The higher percentage of hits found in our study was partially a result of the increased number of long sequences in our unigene database; the results for whitefly  support this conclusion. Homologs in other species were not found for the remaining 45.97% (39,638) of the unique sequences. Specifically, 74.35% of unigenes shorter than 300 bp, and 3.00% of unigenes longer than 1,000 bp, showed no BLAST matches (Figure 2), which suggests that longer contigs were more likely to show BLAST hits in the protein databases. The shorter sequences may lack a characterized protein domain, or they may contain a known protein domain but not show sequence matches due to the short query sequence, resulting in false-negative results. Additionally, only limited genomic and transcriptomic information is currently available for sesame, and consequently, many sesame lineage-specific genes might not be included in current databases.
Many of the sesame unigenes were assigned to GO categories and COG classifications (Figures 3 and 4). Most representative unigenes were mapped to specific pathways, such as metabolism pathways, biosynthesis of secondary metabolites, plant-pathogen interactions, the spliceosome, and starch and sucrose metabolism, using the KEGG database (Additional file 2). Importantly, most of the genes involved in the biosynthesis of fatty acids were identified. Unigenes without BLASTx hits may function as sesame-specific genes.
Our results indicate that high-throughput RNA-seq is an efficient, inexpensive, and reliable platform for transcriptomic analysis in non-model organisms. The large number of sequences generated in this study provides valuable sequence information at the transcriptomic level for novel gene discovery, or for the investigation of sesame molecular mechanisms.
EST-SSR frequency and distribution in the sesame transcriptome
Previously, genetic diversity analysis of sesame germplasm has mostly depended on AFLP, ISSR, and RAPD markers [9, 10, 54–56]. Polymorphic SSR markers play an important role in genetic diversity research, population genetics, linkage mapping, comparative genomics, and association analysis. In the present study, 7,702 perfect microsatellites exceeding 12 bp were identified from the sesame EST dataset, and 8.93% of the EST sequences possessed SSRs. The SSR frequency in this study is consistent with the range of frequencies reported for other dicotyledonous species (2.65-16.82%) . The EST-SSR frequency is dependent on several factors such as genome structure or composition , arithmetical method for SSR detection, and the parameters for exploration of microsatellites.
Dinucleotide repeats were the most frequent SSR motif type. This finding is consistent with results reported for Arabidopsis, peanut, canola, sugar beet, cabbage, soybean, sunflower, sweet potato, pea, and grape , whereas trinucleotide repeats were the most abundant class of SSRs in cereals such as rice, wheat, and barley . Among the dinucleotide repeats, AG/CT (46.29%) was the most frequent motif in our dataset, whereas CG/CG (0.04%) motifs were very rare. Among the trinucleotide repeats, the AAG/CTT motif was common (6.23%) among the microsatellites. Our results are consistent with those for other plant species [49, 57, 58, 60, 61]. In plants, TC and CTT repeats are typically found in transcribed regions and occur at a high frequency in 5' UTRs; CT microsatellites in 5' UTRs may be involved in antisense transcription and play a role in gene regulation .
EST-SSR marker polymorphism
The majority of sesame EST-SSRs generated high-quality amplicons, suggesting that ESTs are suitable for specific primer design. In this study, 45 (90%) of the primer pairs designed from ESTs successfully yielded amplicons. Among the successful primer pairs, 40 of the 45 amplicons were of the expected size. The deviation of five amplicons from the expected size may have been due to the presence of introns [63, 64], large insertions or repeat number variations, a lack of specificity, or assembly errors. The failure of five primer pairs to produce amplicons may have been caused by the location of the primer(s) across splice sites, large introns, chimeric primer(s), or poor-quality sequences . These results suggest that the assembled unigenes were of high quality and that the EST-SSRs identified in our dataset could be used in the future.
Using the EST-SSRs in our dataset, the mean number of alleles per locus (6.55) and the mean He (0.76) and Ho (0.84) were investigated across 24 sesame accessions. The PIC values ranged from 0.46 to 0.82 (mean: 0.70). The difference between He and Ho at all loci may be the result of a very high self-pollination rate within the population. These findings indicated that polymorphism was relatively high, which is corroborated by sesame genomic SSRs . Since we identified 7,702 SSRs in our dataset, more PCR primers could be designed in the future as tools for germplasm polymorphism assessment, quantitative trait loci mapping, and functional gene cloning in sesame.
In this study, a large EST dataset composed of 86,222 unigenes derived from the sesame transcriptome was assembled. These results indicated that Illumina paired-end sequencing is a fast and cost-effective approach to novel gene discovery and molecular marker development in non-model organisms. Based on the generated sequences, 7,702 EST-SSRs were identified and characterized as potential molecular markers. Fifty primer pairs were randomly selected to detect polymorphism among 24 sesame accessions, and 40 (80%) of these primer pairs successfully amplified fragments, revealing abundant polymorphism. The EST-SSR markers developed in this study can be used for construction of high-resolution genetic linkage maps and to perform gene-based association analyses in sesame. To our knowledge, this is the first application of Illumina paired-end sequencing technology to investigate the whole transcriptome of sesame and to assemble RNA-seq reads without a reference genome. The dataset will improve our understanding of the molecular mechanisms of fatty acid biosynthesis, lignan biosynthesis, and other biochemical processes in sesame.
Sample collection and preparation
Sesame cv. Zhongzhi No. 11 was grown at the experimental station of the Oil Crops Research Institute, Chinese Academy of Agricultural Sciences, Wuhan, China. Young roots, leaves, flowers, developing seeds, and shoot tips of plants at anthesis were collected, frozen immediately in liquid nitrogen, and stored at -70°C until use.
RNA extraction and library preparation for transcriptome analysis
Total RNA was isolated using the TRIzol reagent according to the manufacturer's instructions (Invitrogen). The total RNA concentration was quantified using an ultraviolet (UV) spectrophotometer, and RNA quality was assessed on 1.0% denaturing agarose gels. Equal volumes of RNA from each of the five tissues were pooled. The mixed RNA extract was subjected to Solexa sequencing analysis at the Beijing Genomics Institute (BGI; Shenzhen, China). RNA quality and quantity were verified using a NanoDrop 1000 spectrophotometer and an Agilent 2100 Bioanalyzer prior to further processing at BGI. The total RNA was treated with DNase I prior to library construction, and poly-(A) mRNA was purified with Magnetic Oligo (dT) Beads. The mRNA was fragmented by treatment with divalent cations and heat. The cleaved RNA fragments were transcribed into first-strand cDNA using reverse transcriptase and random hexamer-primers, followed by second-strand cDNA synthesis using DNA polymerase I and RNaseH. The double-stranded cDNA was further subjected to end-repair using T4 DNA polymerase, the Klenow fragment, and T4 polynucleotide kinase followed by a single < A > base addition using Klenow 3' to 5' exo-polymerase, then ligated with an adapter or index adapter using T4 DNA ligase. Adaptor-ligated fragments were separated by size on an agarose gel, and the desired range of cDNA fragments (200 ± 25 bp) were excised from the gel. PCR was performed to selectively enrich and amplify the cDNA fragments. After validation with an Agilent 2100 Bioanalyzer and ABI StepOnePlus Real-Time PCR System, the cDNA library was sequenced on a flow cell using an Illumina HiSeq2000 sequencing platform. In total, three duplicate cDNA libraries were constructed and sequenced separately using an Illumina HiSeq2000 genome analyzer to minimize the likelihood of systematic biases and random error in sequencing and allow for the detection of low-abundance transcripts. The sequence data were deposited in the NCBI Sequence Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra under accession number SRP006700.
Data filtering and de novo assembly
The cDNA library was sequenced on an Illumina HiSeq2000 sequencing platform. Image deconvolution and quality value calculations were performed using Illumina HCS 1.1 software. The raw reads were cleaned by removing adapter sequences, low-quality sequences (reads with ambiguous bases 'N'), and reads with more than 10% Q < 20 bases. De novo assembly of the clean reads was performed using SOAPdenovo http://soap.genomics.org.cn/soapdenovo.html with the default settings, except for the K-mer value, which was set at a specific value . The best assembly was achieved with K = 29, which was chosen for de Bruijn graph construction. Although a higher K-mer value reduced the number of assembled contigs, it increased the reliability and produced longer contigs. Contigs without ambiguous bases were obtained by conjugating the K-mers in an unambiguous path. Next, the reads were mapped back to the contigs using SOAPdenovo to construct scaffolds with the paired-end information. The program detected contigs from the same transcript as well as the distances between these contigs. Next, SOAPdenovo connected the contigs between each pair of contigs using 'N' to represent unknown bases, thus generating scaffolds. Paired-end reads were used again for scaffold gap filling to obtain sequences with the least Ns and those that could not be extended at either end. Such sequences were defined as unigenes. Finally, the overlapping unigenes from three libraries were assembled into a continuous sequence using the overlapping ends of different sequences, and redundant sequences were removed to yield the maximum length nonredundant unigenes using the TIGR Gene Indices Clustering (TGICL) tools. The parameters were set at a similarity of 94% and an overlap length of 100 bp. The assembled unique sequences were deposited in the NCBI Transcriptome Shotgun Assembly database http://www.ncbi.nlm.nih.gov/genbank/TSA.html under accession numbers JL321729-JL346699, JL349641-JL382688, and JL473672-JL478462.
Unigenes were aligned with the NCBI Nr and Swiss-Prot protein databases using BLASTx with an E-value of less than 10-5. Unigenes that did not have homologs in these databases were scanned using ESTScan . Blast2GO  was used to obtain GO annotation of the unigenes based on BLASTx hits against the NCBI Nr database with an E-value threshold of less than 10-5. WEGO  was used for GO functional classification of all unigenes and to plot the distribution of the sesame gene functions. The unigene sequences were also aligned to the COG database to predict and classify functions. Pathway assignments were carried out based on the KEGG database . Additionally, a BLASTx search against the TAIR Version 10 database http://www.arabidopsis.org/ was performed with an E-value threshold of less than 10-5.
EST-SSR detection and primer design
Potential SSR markers were detected among the 86,222 unigenes using the MISA tool http://pgrc.ipk-gatersleben.de/misa. The parameters were adjusted for identification of perfect di-, tri-, tetra-, penta-, and hexanucleotide motifs with a minimum of 6, 5, 4, 4, and 4 repeats, respectively. Mononucleotide repeats were ignored since distinguishing genuine mononucleotide repeats from polyadenylation products and single nucleotide stretch errors generated by sequencing was difficult. Primer pairs were designed using BatchPrimer3 . The major parameters for primer pair design were set as follows: primer length of 18-23 bases (optimal 20 bases), PCR product size of 100-400 bp (optimal 200 bp), GC content of 40-70% (optimal 50%), and annealing temperatures of 50-60°C (optimal 55°C). Based on these parameters, 50 primer pairs were designed and synthesized for germplasm polymorphism detection in sesame.
Survey of EST-SSR polymorphism
Twenty-four sesame accessions including Chinese landraces, cultivars, and foreign collections (Additional file 4) were selected for polymorphism investigation with the EST-SSRs. Total DNA was isolated from sesame seedlings using the CTAB method . PCR amplifications were conducted in a final volume of 10 μL containing 50 ng template DNA, 1× PCR buffer, 2.0 mM MgCl2, 2.5 mM dNTPs, 4 μM of each primer, and 0.8 U Taq polymerase (Fermentas). The PCR reaction cycling profile was 94°C for 4 min followed by 35 cycles at 94°C for 40 s, 55°C for 40 s, 72°C for 1 min, and a final extension at 72°C for 10 min. The separation of alleles was performed on a 6% polyacrylamide gel with a 50-bp DNA marker (Promega) to calculate the length of the EST-SSR amplicons. PCR products were mixed with a half volume of loading buffer. The mixture was denatured at 95°C for 4 min before being loaded on the gel. Gels were stained with silver nitrate as previously described . Perfect amplified loci were tested for polymorphism by genotyping 24 sesame accessions. The genetic diversity and mean allele number were calculated using Popgene version 1.32 . Polymorphic information content (PIC) was obtained with PIC_CALC and GenAlex6 .
The authors are grateful to Dr. Yi Huang from OCRI-CAAS for assistance with manuscript editing. We thank the anonymous referees and the editor for their comments and suggestions that helped improve the manuscript. This work was supported by the National Basic Research Program of China (973 Program) (no. 2011CB109304), the National Natural Science Foundation of China (no. 30871552), and China Agriculture Research System (no. CARS-15).
- Ashri A: Sesame breeding. Plant Breeding Reviews. Volume 16. Edited by: Janick J. 2010, Oxford: John Wiley & Sons
- Bedigian D, Harlan J: Evidence for cultivation of sesame in the ancient world. Econ Bot. 1986, 40 (2): 137-154. 10.1007/BF02859136.View Article
- Bedigian D, Seigler DS, Harlan JR: Sesamin, sesamolin and the origin of sesame. Biochem Syst Ecol. 1985, 13 (2): 133-139. 10.1016/0305-1978(85)90071-7.View Article
- Fukuda Y, Nagata M, Osawa T, Namiki M: Contribution of lignan analogues to antioxidative activity of refined unroasted sesame seed oil. J Am Oil Chem Soc. 1986, 63 (8): 1027-1031. 10.1007/BF02673792.View Article
- Cheung SC, Szeto YT, Benzie IF: Antioxidant protection of edible oils. Plant Foods Hum Nutr. 2007, 62 (1): 39-42. 10.1007/s11130-006-0040-6.PubMedView Article
- Chung CH, Yee YJ, Kim DH, Kim HK, Chung DS: Changes of lipid, protein, RNA and fatty acid composition in developing sesame (Sesamum indicum L.) seeds. Plant Sci. 1995, 109 (2): 237-243. 10.1016/0168-9452(95)04160-V.View Article
- Wei L-B, Zhang H-Y, Zheng Y-Z, Miao H-M, Zhang T-Z, Guo W-Z: A genetic linkage map construction for sesame (Sesamum indicum L.). Genes & Genomics. 2009, 31 (2): 199-208. 10.1007/BF03191152.View Article
- Were BAi: Genetic improvement of oil quality in sesame (Sesamum indicum L.): assembling tools. PhD thesis. 2006, Swedish University of Agricultural Sciences, Department of Crop Science
- Laurentin H, Karlovsky P: Genetic relationship and diversity in a sesame (Sesamum indicum L.) germplasm collection using amplified fragment length polymorphism (AFLP). BMC Genetics. 2006, 7 (1): 10-PubMed CentralPubMedView Article
- Ercan AG, Taskin M, Turgut K: Analysis of genetic diversity in Turkish sesame (Sesamum indicum L.) populations using RAPD markers. Genet Resour Crop Ev. 2004, 51 (6): 599-607.View Article
- Yukawa Y, Takaiwa F, Shoji K, Masuda K, Yamada K: Structure and expression of two seed-specific cDNA clones encoding stearoyl-acyl carrier protein desaturase from sesame, Sesamum indicum L. Plant Cell Physiol. 1996, 37 (2): 201-205.PubMedView Article
- Jin U-H, Lee J-W, Chung Y-S, Lee J-H, Yi Y-B, Kim Y-K, Hyung N-I, Pyee J-H, Chung C-H: Characterization and temporal expression of a [omega]-6 fatty acid desaturase cDNA from sesame (Sesamum indicum L.) seeds. Plant Sci. 2001, 161 (5): 935-941. 10.1016/S0168-9452(01)00489-7.View Article
- Kim M, Kim H, Shin J, Chung C-H, Ohlrogge J, Suh M: Seed-specific expression of sesame microsomal oleic acid desaturase is controlled by combinatorial properties between negative cis-regulatory elements in the SeFAD2; promoter and enhancers in the 5'-UTR intron. Mol Genet Genomics. 2006, 276 (4): 351-368. 10.1007/s00438-006-0148-2.PubMedView Article
- Dixit A, Jin M-H, Chung J-W, Yu J-W, Chung H-K, Ma K-H, Park Y-J, Cho E-G: Development of polymorphic microsatellite markers in sesame (Sesamum indicum L.). Mol Ecol Notes. 2005, 5 (4): 736-738. 10.1111/j.1471-8286.2005.01048.x.View Article
- Wei L-B, Zhang H-Y, Zheng Y-Z, Guo W-Z, Zhang T-Z: Development and utilization of EST-derived microsatellites in sesame (Sesamum indicum L.). Acta Biochim Biophys Sin (Shanghai). 2008, 34 (12): 2077-2084.
- Powell W, Morgante M, Andre C, Hanafey M, Vogel J, Tingey S, Rafalski A: The comparison of RFLP, RAPD, AFLP and SSR (microsatellite) markers for germplasm analysis. Mol Breed. 1996, 2 (3): 225-238. 10.1007/BF00564200.View Article
- Zane L, Bargelloni L, Patarnello T: Strategies for microsatellite isolation: a review. Mol Ecol. 2002, 11 (1): 1-16. 10.1046/j.0962-1083.2001.01418.x.PubMedView Article
- Squirrell J, Hollingsworth PM, Woodhead M, Russell J, Lowe AJ, Gibby M, Powell W: How much effort is required to isolate nuclear microsatellites from plants?. Mol Ecol. 2003, 12 (6): 1339-1348. 10.1046/j.1365-294X.2003.01825.x.PubMedView Article
- Rungis D, Berube Y, Zhang J, Ralph S, Ritland CE, Ellis BE, Douglas C, Bohlmann J, Ritland K: Robust simple sequence repeat markers for spruce (Picea spp.) from expressed sequence tags. Theor Appl Genet. 2004, 109 (6): 1283-1294. 10.1007/s00122-004-1742-5.PubMedView Article
- Scott KD, Eggler P, Seaton G, Rossetto M, Ablett EM, Lee LS, Henry RJ: Analysis of SSRs derived from grape ESTs. Theor Appl Genet. 2000, 100 (5): 723-726. 10.1007/s001220051344.View Article
- Luro FL, Costantino G, Terol J, Argout X, Allario T, Wincker P, Talon M, Ollitrault P, Morillon R: Transferability of the EST-SSRs developed on Nules clementine (Citrus clementina Hort ex Tan) to other Citrus species and their effectiveness for genetic mapping. BMC Genomics. 2008, 9: 287-10.1186/1471-2164-9-287.PubMed CentralPubMedView Article
- Aggarwal RK, Hendre PS, Varshney RK, Bhat PR, Krishnakumar V, Singh L: Identification, characterization and utilization of EST-derived genic microsatellite markers for genome analyses of coffee and related species. Theor Appl Genet. 2007, 114 (2): 359-372. 10.1007/s00122-006-0440-x.PubMedView Article
- Poncet V, Rondeau M, Tranchant C, Cayrel A, Hamon S, de Kochko A, Hamon P: SSR mining in coffee tree EST databases: potential use of EST-SSRs as markers for the Coffea genus. Mol Genet Genomics. 2006, 276 (5): 436-449. 10.1007/s00438-006-0153-5.PubMedView Article
- Eujayl I, Sledge MK, Wang L, May GD, Chekhovskiy K, Zwonitzer JC, Mian MA: Medicago truncatula EST-SSRs reveal cross-species genetic markers for Medicago spp. Theor Appl Genet. 2004, 108 (3): 414-422. 10.1007/s00122-003-1450-6.PubMedView Article
- Peakall R, Gilmore S, Keys W, Morgante M, Rafalski A: Cross-species amplification of soybean (Glycine max) simple sequence repeats (SSRs) within the genus and other legume genera: implications for the transferability of SSRs in plants. Mol Biol Evol. 1998, 15 (10): 1275-1287.PubMedView Article
- Thiel T, Michalek W, Varshney RK, Graner A: Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet. 2003, 106 (3): 411-422.PubMed
- Zhang LY, Bernard M, Leroy P, Feuillet C, Sourdille P: High transferability of bread wheat EST-derived SSRs to other cereals. Theor Appl Genet. 2005, 111 (4): 677-687. 10.1007/s00122-005-2041-5.PubMedView Article
- Cho YG, Ishii T, Temnykh S, Chen X, Lipovich L, McCouch SR, Park WD, Ayres N, Cartinhour S: Diversity of microsatellites derived from genomic libraries and GenBank sequences in rice (Oryza sativa L.). Theor Appl Genet. 2000, 100 (5): 713-722. 10.1007/s001220051343.View Article
- Cordeiro GM, Casu R, McIntyre CL, Manners JM, Henry RJ: Microsatellite markers from sugarcane (Saccharum spp.) ESTs cross transferable to erianthus and sorghum. Plant Sci. 2001, 160 (6): 1115-1123. 10.1016/S0168-9452(01)00365-X.PubMedView Article
- Liewlaksaneeyanawin C, Ritland CE, El-Kassaby YA, Ritland K: Single-copy, species-transferable microsatellite markers developed from loblolly pine ESTs. Theor Appl Genet. 2004, 109 (2): 361-369.PubMedView Article
- Bouck A, Vision T: The molecular ecologist's guide to expressed sequence tags. Mol Ecol. 2007, 16 (5): 907-924.PubMedView Article
- Emrich SJ, Barbazuk WB, Li L, Schnable PS: Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 2006, 16 (12): 1-5.
- Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17 (7): 1636-1647. 10.1111/j.1365-294X.2008.03666.x.PubMedView Article
- Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing. Plant J. 2007, 51 (5): 910-918. 10.1111/j.1365-313X.2007.03193.x.PubMed CentralPubMedView Article
- Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sederoff RR, Kirst M: High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics. 2008, 9: 312-10.1186/1471-2164-9-312.PubMed CentralPubMedView Article
- Namroud MC, Beaulieu J, Juge N, Laroche J, Bousquet J: Scanning the genome for gene single nucleotide polymorphisms involved in adaptive population differentiation in white spruce. Mol Ecol. 2008, 17 (16): 3599-3613. 10.1111/j.1365-294X.2008.03840.x.PubMed CentralPubMedView Article
- Parchman T, Geist K, Grahnen J, Benkman C, Buerkle CA: Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010, 11 (1): 180-10.1186/1471-2164-11-180.PubMed CentralPubMedView Article
- Sun C, Li Y, Wu Q, Luo H, Sun Y, Song J, Lui E, Chen S: De novo sequencing and analysis of the American ginseng root transcriptome using a GS FLX Titanium platform to discover putative genes involved in ginsenoside biosynthesis. BMC Genomics. 2010, 11 (1): 262-10.1186/1471-2164-11-262.PubMed CentralPubMedView Article
- Collins LJ, Biggs PJ, Voelckel C, Joly S: An approach to transcriptome analysis of non-model organisms using short-read sequences. Genome Inform. 2008, 21: 3-14.PubMed
- Trick M, Long Y, Meng J, Bancroft I: Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing. Plant Biotechnol J. 2009, 7 (4): 334-346. 10.1111/j.1467-7652.2008.00396.x.PubMedView Article
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS de novo assembly of whole-genome shotgun microreads. Genome Res. 2008, 18 (5): 810-820. 10.1101/gr.7337908.PubMed CentralPubMedView Article
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20 (2): 265-272. 10.1101/gr.097261.109.PubMed CentralPubMedView Article
- Hegedus Z, Zakrzewska A, Agoston VC, Ordas A, Racz P, Mink M, Spaink HP, Meijer AH: Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol. 2009, 46 (15): 2918-2930. 10.1016/j.molimm.2009.07.002.PubMedView Article
- Wang B, Guo G, Wang C, Lin Y, Wang X, Zhao M, Guo Y, He M, Zhang Y, Pan L: Survey of the transcriptome of Aspergillus oryzae via massively parallel mRNA sequencing. Nucleic Acids Res. 2010, 38 (15): 5075-5087. 10.1093/nar/gkq256.PubMed CentralPubMedView Article
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al: The sequence and de novo assembly of the giant panda genome. Nature. 2010, 463 (7279): 311-317. 10.1038/nature08696.PubMed CentralPubMedView Article
- Wang X-W, Luan J-B, Li J-M, Bao Y-Y, Zhang C-X, Liu S-S: De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics. 2010, 11 (1): 400-10.1186/1471-2164-11-400.PubMed CentralPubMedView Article
- Wu T, Qin Z, Zhou X, Feng Z, Du Y: Transcriptome profile analysis of floral sex determination in cucumber. J Plant Physiol. 2010, 167 (11): 905-913. 10.1016/j.jplph.2010.02.004.PubMedView Article
- Iseli C, Jongeneel CV, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol. 1999, 138-148.
- Zeng S, Xiao G, Guo J, Fei Z, Xu Y, Roe B, Wang Y: Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim. BMC Genomics. 2010, 11 (1): 94-10.1186/1471-2164-11-94.PubMed CentralPubMedView Article
- Chung Suh M, Jung Kim M, Hur C-G, Myung Bae J, In Park Y, Chung C-H, Kang C-W, Ohlrogge JB: Comparative analysis of expressed sequence tags from Sesamum indicum and Arabidopsis thaliana developing seeds. Plant Mol Biol. 2003, 52 (6): 1107-1123.View Article
- Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17 (7): 1636-1647. 10.1111/j.1365-294X.2008.03666.x.PubMedView Article
- Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y: De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas). BMC Genomics. 2010, 11 (1): 726-10.1186/1471-2164-11-726.PubMed CentralPubMedView Article
- Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009, 10: 219-10.1186/1471-2164-10-219.PubMed CentralPubMedView Article
- Laurentin H, Ratzinger A, Karlovsky P: Relationship between metabolic and genomic diversity in sesame (Sesamum indicum L.). BMC Genomics. 2008, 9 (1): 250-10.1186/1471-2164-9-250.PubMed CentralPubMedView Article
- Kim DH, Zur G, Danin-Poleg Y, Lee SW, Shim KB, Kang CW, Kashi Y: Genetic relationships of sesame germplasm collection as revealed by inter-simple sequence repeats. Plant Breed. 2002, 121 (3): 259-262. 10.1046/j.1439-0523.2002.00700.x.View Article
- Bisht IS, Mahajan RK, Loknathan TR, Agrawal RC: Diversity in Indian sesame collection and stratification of germplasm accessions in different diversity groups. Genet Resour Crop Ev. 1998, 45 (4): 325-335. 10.1023/A:1008652420477.View Article
- Kumpatla SP, Mukhopadhyay S: Mining and survey of simple sequence repeats in expressed sequence tags of dicotyledonous species. Genome. 2005, 48 (6): 985-998. 10.1139/g05-060.PubMedView Article
- Toth G, Gaspari Z, Jurka J: Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 2000, 10 (7): 967-981. 10.1101/gr.10.7.967.PubMed CentralPubMedView Article
- La Rota M, Kantety R, Yu J-K, Sorrells M: Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics. 2005, 6 (1): 23-10.1186/1471-2164-6-23.PubMed CentralPubMedView Article
- Morgante M, Hanafey M, Powell W: Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat Genet. 2002, 30 (2): 194-200. 10.1038/ng822.PubMedView Article
- Kantety RV, La Rota M, Matthews DE, Sorrells ME: Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat. Plant Mol Biol. 2002, 48: 501-510. 10.1023/A:1014875206165.PubMedView Article
- Martienssen RA, Colot V: DNA methylation and epigenetic inheritance in plants and filamentous fungi. Science. 2001, 293 (5532): 1070-1074. 10.1126/science.293.5532.1070.PubMedView Article
- Saha M, Mian M, Eujayl I, Zwonitzer J, Wang L, May G: Tall fescue EST-SSR markers with transferability across several grass species. Theor Appl Genet. 2004, 109 (4): 783-791. 10.1007/s00122-004-1681-1.PubMedView Article
- Varshney RK, Graner A, Sorrells ME: Genic microsatellite markers in plants: features and applications. Trends Biotechnol. 2005, 23 (1): 48-55. 10.1016/j.tibtech.2004.11.005.PubMedView Article
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21 (18): 3674-3676. 10.1093/bioinformatics/bti610.PubMedView Article
- Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L: WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006, 34 (Web Server): W293-297. 10.1093/nar/gkl031.PubMed CentralPubMedView Article
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28 (1): 27-30. 10.1093/nar/28.1.27.PubMed CentralPubMedView Article
- Kortt AA, Caldwell JB, Lilley GG, Higgins TJV: Amino acid and cDNA sequences of a methionine-rich 2S protein from sunflower seed (Helianthus annuus L.). Eur J Biochem. 1991, 195 (2): 329-334. 10.1111/j.1432-1033.1991.tb15710.x.PubMedView Article
- Porebski S, Bailey L, Baum B: Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Mol Biol Rep. 1997, 15 (1): 8-15. 10.1007/BF02772108.View Article
- Bassam BJ, Caetano-Anolles G, Gresshoff PM: Fast and sensitive silver staining of DNA in polyacrylamide gels. Anal Biochem. 1991, 196 (1): 80-83. 10.1016/0003-2697(91)90120-I.PubMedView Article
- Yeh FC, Boyle TJB: Population genetic analysis of co-dominant and dominant markers and quantitative traits. Belg J Bot. 1997, 129: 157-
- Peakall ROD, Smouse PE: GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Mol Ecol Notes. 2006, 6 (1): 288-295. 10.1111/j.1471-8286.2005.01155.x.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.