- Research article
- Open Access
Transcriptome sequencing of field pea and faba bean for discovery and validation of SSR genetic markers
BMC Genomics volume 13, Article number: 104 (2012)
Field pea (Pisum sativum L.) and faba bean (Vicia faba L.) are cool-season grain legume species that provide rich sources of food for humans and fodder for livestock. To date, both species have been relative 'genomic orphans' due to limited availability of genetic and genomic information. A significant enrichment of genomic resources is consequently required in order to understand the genetic architecture of important agronomic traits, and to support germplasm enhancement, genetic diversity, population structure and demographic studies.
cDNA samples obtained from various tissue types of specific field pea and faba bean genotypes were sequenced using 454 Roche GS FLX Titanium technology. A total of 720,324 and 304,680 reads for field pea and faba bean, respectively, were de novo assembled to generate sets of 70,682 and 60,440 unigenes. Consensus sequences were compared against the genome of the model legume species Medicago truncatula Gaertn., as well as that of the more distantly related, but better-characterised genome of Arabidopsis thaliana L.. In comparison to M. truncatula coding sequences, 11,737 and 10,179 unique hits were obtained from field pea and faba bean. Totals of 22,057 field pea and 18,052 faba bean unigenes were subsequently annotated from GenBank. Comparison to the genome of soybean (Glycine max L.) resulted in 19,451 unique hits for field pea and 16,497 unique hits for faba bean, corresponding to c. 35% and 30% of the known gene space, respectively. Simple sequence repeat (SSR)-containing expressed sequence tags (ESTs) were identified from consensus sequences, and totals of 2,397 and 802 primer pairs were designed for field pea and faba bean. Subsets of 96 EST-SSR markers were screened for validation across modest panels of field pea and faba bean cultivars, as well as related non-domesticated species. For field pea, 86 primer pairs successfully obtained amplification products from one or more template genotypes, of which 59% revealed polymorphism between 6 genotypes. In the case of faba bean, 81 primer pairs displayed successful amplification, of which 48% detected polymorphism.
The generation of EST datasets for field pea and faba bean has permitted effective unigene identification and functional sequence annotation. EST-SSR loci were detected at incidences of 14-17%, permitting design of comprehensive sets of primer pairs. The subsets from these primer pairs proved highly useful for polymorphism detection within Pisum and Vicia germplasm.
The Fabaceae (Leguminosae) is the third largest angiosperm family, containing c. 18,000 species attributed to 650 genera [1–3]. Legumes provide major benefits to cropping systems and the environment, due to the ability to perform symbiotic nitrogen fixation. In comparison to cereals, for which a broad range of genetic and genomic resources are available, genomic databases for legumes are generally still underdeveloped. However, recent advances in sequencing and genotyping technologies offer the opportunity to rapidly ameliorate the status of given species at relatively low cost . Major efforts are currently being directed towards the development of species-specific genomic tools and datasets. As an example, the whole genome sequence of soybean, a warm-season grain legume, has recently been determined http://www.phytozome.net/soybean.
Cool-season food legumes within the Hologalegina clade of the Fabaceae sub-family Papilionoideae, which includes lentil, chickpea, field pea and faba bean (pulses), are important food and fodder crops, especially in developing countries such as those of the Indian sub-continent . These species are important components of farming systems across Western Asia, the Middle East, North Africa, the Indian sub-continent, North America and Australia. In Australia, pulses are sown over c. 2 million hectares and produce c. 2.5 million tonnes of grain with a commodity value of over AU$ 675 million . Despite close phylogenetic relationships, pulse species vary considerably in aspects of biology such as genome size, fundamental chromosome number, ploidy level, and degree of reproductive self-compatibility. The genome size of chickpea is relatively small (c. 700 Mb), but pulses of the Vicieae tribe (lentil, pea and faba bean) exhibit much larger genome sizes (in the range from 4-13 Gb). Recently, generation of large-scale lentil transcriptome data by our group has substantially increased the volume of publicly available genomic data for this species . Similar strategies have been pursued for field pea and faba bean in the current study.
Field pea, which is the third most globally important grain legume crop (at 5.5 million hectares per year) after soybean and common bean (Phaseolus vulgaris L.), is a self-pollinating diploid (2n = 2x = 14) species with a genome size of c. 5 Gbp . Various studies have been performed to determine the genetic basis of multiple phenotypic traits in field pea [9–11] and to quantify diversity between different pea cultivars [12–16]. Recently, a comprehensive transcriptome analysis of field pea has been performed using second-generation sequencing technologies  that will contribute significantly to the enrichment of genomics resources for field pea. In contrast, faba bean has not been widely adopted on a global basis. In terms of cultivation area, this species ranks fourth among the cool-season food legumes (at 2.6 million hectares per year) after field pea, chickpea and lentil http://faostat.fao.org. Faba bean has been traditionally cultivated in the Mediterranean basin, the Nile valley, Ethiopia, Central and East Asia, Latin America, Northern Europe, North America and Australia . Faba bean is a diploid taxon (2n = 2x = 12), and exhibits facultative cross-pollination at frequencies ranging from 4-84%. The nuclear genome size of faba bean is one of the largest yet described among crop legumes, at c. 13 Gb. Formal genetic analysis of faba bean, such as through genetic linkage mapping and identification of quantitative trait loci (QTLs), has so far been hindered by these aspects of biology .
Conventional breeding methods based on phenotypic assessment are currently in use for breeding line selection in field pea and faba bean. Such methods are logistically demanding and time-consuming, especially for traits that require specific biotic or abiotic challenges, such as resistance to individual diseases. In addition to this, when breeding for types eaten as immature seed, quality testing adds considerable complexity to the relevant programs. There is consequently a major requirement for species-specific molecular genetic markers and derived linkage maps for field pea and faba bean, to enable germplasm advancement through genomics-assisted selection.
Current publicly available genetic and genomic tools for field pea and faba bean are limited in extent [20–23], comprising 18,552 and 5,253 ESTs, respectively that are available in Genbank. In addition to this, a recently sequenced Pisum sativum transcriptome generated a total of 81,449 unigenes that are also available for download as a fully annotated fasta format . Second-generation DNA sequencing systems such as the Roche 454 massively-parallel pyrosequencing platform are capable of rapidly producing species-specific genomic resources to address these short-comings. This system can generate 4-6 × 108 bp from each run, with individual read lengths of 400-500 bp , and is suitable for de novo sequencing of small genomes , whole genome resequencing , SNP detection , and in particular, sequencing of transcriptomes .
ESTs obtained from the latter activity provide valuable resources for gene discovery, large-scale expression analysis, improved genome annotation, elucidation of phylogenetic relationships and facilitation of breeding programs for both plants and animals through provision of SSR and single nucleotide polymorphism (SNP) genetic markers . SSR loci have been widely used for improvement of a range of crop species . Only a limited number of SSRs are available in public domain for field pea and faba bean, creating an incentive for further discovery and validation. In comparison with genomic DNA-derived SSRs, those located in ESTs are functionally associated with genic regions, and support potential diagnostic genetic marker development [31–34].
This study describes the development, de novo assembly and gene annotation of a transcriptome dataset derived from cDNA samples obtained from several tissues at various stages of development of multiple field pea and faba bean genotypes. Clustering and annotation to generate a unigene set has permitted computational identification of SSR loci, and the design and evaluation of a set of EST-SSR marker-directed primer pairs.
Materials and methods
Seeds of field pea were obtained from the Australian Temperate Field Crops Collection (ATFCC) held at the Department of Primary Industries, Horsham, Victoria, Australia. Faba bean seeds were obtained from the Australian faba bean breeding program at The University of Adelaide, South Australia, Australia. Three to four seeds from each variety of field pea (Parafield, Yarrum, Kaspa, 96-286*) and faba bean (Icarus, Ascot) were selected based on the criteria of genetic diversity and significant agronomic variation, and were sown into commercial potting mix. These genotypes were also potential parents for the genetic mapping populations of field pea and faba bean, to be used to dissect various traits of interest. Germinated plantlets were grown to maturity under glasshouse conditions with natural light at the Department of Primary Industries, Bundoora, Victoria, Australia. Selected plant tissues were harvested for RNA isolation from plants at various stages of development, including leaf (young and mature), stem, flowers, immature pods, mature pods and immature seeds. A total of 4-8 seeds were also germinated in Petri dishes in order to provide material for harvest of seedling root and shoot samples. All of the vegetative plant tissues (leaf and stem) were pooled for RNA isolation and designated LS (leaf/stem) tissue. All of the reproductive organs including flowers, immature pods, mature pods and immature seeds were also pooled for RNA isolation and designated FS (flower/seed) tissue. The seedling-derived root (RG) and shoot (SG) samples were used separately for RNA isolation.
RNA isolation and cDNA preparation
Total RNA isolation and cDNA synthesis were performed as described in an equivalent study performed for lentil .
EST sequence generation, assembly and annotation
cDNAs obtained from the four distinct RNA pools (LS, FS, RG and SG) were combined in equimolar ratio before proceeding to GS FLX library preparation. Approximately 5 μg of bulked cDNA was sheared by nebulisation at 206 kPa for 2-4 min. The GS FLX Titanium shotgun libraries were constructed following manufacturer's instructions (Roche Diagnostics, Castle Hill, NSW, Australia). The ssDNA libraries were quantified using real-time quantitative PCR. Finally, emulsion (em) PCR was performed using the Lib-L emPCR protocol (Roche Diagnostics, Castle Hill, NSW, Australia). The enriched beads obtained as a result of em-PCR were loaded onto picotitre plates for sequencing. All of the pooled cDNA libraries obtained from different genotypes of field pea and faba bean were separately sequenced on individual quarters of picotitre plates.
All sequence reads generated from different genotypes were de novo assembled using the Next Gene software (Softgenetics, State College, Pennsylvania, USA). The adaptor and primer sequences were removed prior to the assembly using the 'trimming' function (trim sequences with 100% similarity to the primer/adaptor sequence). De novo assembly was performed using the Greedy algorithm and error correction condensation. The Greedy algorithm searches for maximum overlap between reads and extends the overlap to form large contigs and is recommended for 454 reads or reads with average read length > 70 bp. The error correction condensation tool functions by dividing sequence reads in which homopolymers are found and at least 16 bases intervene between the homopolymer runs. These shorter reads were termed keywords, and comparison of keywords between reads allowed the correct determination of the bases at the end of each keyword. Sequence reads that contain variations of low frequency were then corrected.
Assembled contig outputs were deposited in the Transcriptome Shotgun Assembly (TSA) of GenBank (field pea; JR950756-JR964200 and faba bean; JR964201-JR970413). Contigs and singletons were compared against the M. truncatula (Mt 3.0), A. thaliana (TAIR 9 CDS [coding sequences]), G. max (Glyma 1.0) and P. sativum transcriptome databases using BLASTN  with a threshold E value of 10-10. Both field pea and faba bean unigene sets were also BLASTN analysed against respective EST and nucleotide sequences publicly available in GenBank. BLASTN analysis was also performed in the non-redundant database of GenBank using the tBLASTX algorithm to derive putative annotations of the unigene set. Gene ontology (GO) terms were assigned to unigenes that showed hits against the Arabidopsis thaliana database using the 'Gene Ontology at TAIR' tool.
Discovery of EST-SSRs, primer design and marker validation
Detection of EST-SSR loci and primer pair design was performed using the Batch Primer3 software http://probes.pw.usda.gov/cgi-bin/batchprimer3/batchprimer3.cgi. The parameters were designed for identification of perfect di-, tri-, tetra-, penta-, and hexanucleotide motifs with minimum of repeat numbers of 6, 4, 3, 3, and 3, respectively. Primer design parameters were set as follows: length range = 18 to 23 nucleotides with 21 as optimum; PCR product size range = 100 to 400 bp; optimum annealing temperature = 55°C; and GC content 40-60%, with 50% as optimum.
Genomic DNA was extracted from target plant genotypes for EST-SSR marker validation using the DNeasy® 96 Plant Kit (QIAGEN), following the manufacturer's instructions. Frozen leaf tissue from each genotype was used for each extraction and ground using a Mixer Mill 300 (Retsch®, Rheinische Straße, Haan, Germany). DNA was resuspended in 50 μl of water and dilutions were performed to obtain a final concentration of 10 ng/μl, followed by storage at -20°C. A collection of randomly selected EST-SSR primer pairs were validated experimentally, forward primers being synthesised with addition of a bacteriophage M13-matching sequence, to enable fluorescent tail addition through the PCR amplification process
. PCR conditions included a hot-start at 95°C for 10 minutes, followed by 10 cycles of 94°C for 30 s, 60-50°C for 30 s and 72°C for 30 s, followed by 25 cycles of 94°C for 30 s, 50°C for 30 s and 72°C for 30 s and a final elongation step of 72°C for 10 min. PCR products were separated using an ABI3730xl (Applied Biosystems, Foster City, California, USA) according to manufacturer's instructions with the addition of the ABI GeneScan LIZ500 size standard and amplification product sizes were determined using the GeneMapper® v3.7 software (Applied Biosystems).
EST sequencing and de novo assembly
A total of 720,324 and 304,680 reads were generated from a range of sampled tissues from 4 field pea genotypes and 2 faba bean genotypes, respectively, using the GS FLX Titanium chemistry. In addition to adaptor/primer sequence trimming, strings of 30-40 nucleotides from both the 5'- and 3'-termini of each sequence read were removed in order to generate high confidence data. Table 1 summarises the sequence output data for each species. After clustering and assembly, a total of 13,602 contigs and 86,476 singletons were obtained from field pea, representing a total of 100,078 unigenes (Additional files 1 and 2). In case of faba bean, a total of 86,027 of unigenes were obtained, comprising 6,370 contigs and 79,657 singletons (Additional files 3 and 4). The unigene sets were then further assessed for quality based on read length, and any remnant sequences less than 100 bp were excluded from further analysis, leaving a total of 13,583 contigs and 57,099 singletons (field pea) and 6,351 contigs and 54,089 singletons (faba bean). In field pea, the length of contigs ranged from 100 bp to 6587 bp, with an average of 719 bp, while for faba bean, contig length ranged from 104 bp to 3923 bp with an average of 615 bp. Average contig coverage was 13.8 fold (ranging from 1.20-fold to 21846.96-fold) for field pea and 8.9 fold (ranging from 1.26 fold to 2884.64 fold) for faba bean. The number of reads per contig for field pea varied between 2 and 57,215, with an average of 41, and the corresponding values for faba bean were between 2 and 16,713 with an average of 25 (Table 2). Distributions of read length and number of reads per contig are shown in Figure 1. The number of contigs with read length less than 200 bp was minimal (1% in field pea and 2.2% in faba bean). Most of the contigs were longer than 0.5 kb (62.7% in field pea and 53.7% in faba bean). In both species, the majority of the contigs were derived from less than 10 reads (Figure 1C, D). A total of 5.7% field pea contigs and 2.9% faba bean contigs were composed of more than 100 reads.
The length of singletons varied from 100-540 bp (field pea) and 100-537 bp (faba bean). For field pea, the largest proportion of the singletons (21.6%) varied from 301-350 bp, while for faba bean, the majority of singletons (17%) varied from 201-250 bp (Figure 2).
Since M. truncatula is the model legume species that is most closely related to field pea and faba bean, consensus sequences from all contigs and singletons were preferentially compared to Medicago coding sequences. In case of field pea, a total of 11,737 unique matches were obtained (6,224 contigs and 5,513 singletons) (Additional file 5). The unigene set was also compared against the nr database of GenBank. A total of 9,101 contigs and 13,194 singletons (22,295 unigenes) obtained matches at E < 10-10. Any query sequences that revealed a highest-ranking match against a non-plant species were removed from the list, leaving a total of 22,057 unique hits (Additional file 6 sheet 'final'). Finally, all of the consensus sequences were compared against the A. thaliana database. A total of 6,156 unique matches were obtained, consisting of 3,668 contigs and 2,488 singletons (Additional file 7).
The faba bean unigene set was also compared with the M. truncatula genome and a total of 10,179 hits were obtained (3,246 contigs and 6,933 singletons) at E < 10-10 (Additional file 8). The unigene set was subsequently compared to the nr database of GenBank, resulting in 18,244 unique hits composed of 4,508 contigs and 13,736 singletons. Any sequence that matched a non-plant database entry was removed from the list, resulting in 18,052 unique hits (4,668 contigs and 13,584 singletons) (Additional file 9, sheet 'final'). The unigene set was also compared to the A. thaliana database at a threshold value of E < 10-10 (Additional file 10), and a total of 4,883 hits were obtained, consisting of 1,948 contigs and 2,935 singletons. Finally, the field pea and faba bean unigene sets were also compared against the G. max EST sequence database that identified 19,451 unique matches for field pea and 16,497 for faba bean (Additional file 11). 'The contigs and singletons obtained from field pea in the current study were also compared against the unigene set generated from transcriptome analysis of field pea performed by Franssen et al. (2011) and as a result, a total of 45,161 overlapping hits were identified (10,832 contigs [24%] and 34,329 singletons [76%]) (Additional file 12). In some instances, more than one contig revealed hits to the same gene, which may be due to origin of more than one contig or singleton from a single gene due either to non-overlapping sequence reads or high levels of sequence error in a single read. This process has also demonstrated the benefits obtained from comparison between two complementary studies.
All of the ESTs and nucleotide sequences currently available in GenBank for field pea and faba bean were also downloaded on the local server to perform BLASTN searches against field pea and faba bean contigs and singletons obtained from the current study. In case of field pea, a total of 2,764 EST and 77,431 nucleotide sequences obtained from Genbank showed significant hits against unigene set generated in the current study (corresponding to 2,244 and 31,624 unique hits, respectively) (Additional file 13, sheets 1-2). For faba bean, a total of 549 ESTs (222 unique matches against faba bean unigene set) and 3,684 nucleotides (1,277 unique matches against faba bean unigene set) were found be common between Genbank and transcriptome data generated from the current study (Additional file 13, sheets 3-4).
All unique matches obtained from field pea and faba bean contigs by comparison against the A. thaliana database were annotated and GO terms were further assigned. For field pea, a total of 22,068 gene counts and 30,739 annotation counts were obtained, while for faba bean, these corresponding values were 11,869 gene counts and 17,075 annotation counts. Proportions of each unigene set attributed to major functional categories were determined (Figures 3, 4, 5, 6, 7, 8). In case of field pea, the intracellular component category of the cellular component classification class contributed the largest proportion of all annotations (19%), followed by the cytoplasmic component (15%), chloroplast component (11%), membrane component (11%), nuclear component and plasma membrane component (7%) categories. Other components such as plastid, cytosol, mitochondria, ER, golgi apparatus, cell wall, ribosome and extracellular components were represented at proportions less than 5% of total (Figure 3). Among the molecular function classification class, the enzyme activity, binding activity, hydrolase activity, transferase activity, molecular function and nucleotide binding categories included the majority of detected matches (Figure 4). In the biological processes classification class, cellular (26%) and metabolic processes (22%) constituted the major categories, followed by protein metabolism (9%) and unknown biological processes (7%), (Figure 5). Similar results were obtained for faba bean. In the cellular component classification class, the major contributors were intracellular and cytoplasmic components (20% and 16% respectively) (Figure 6). The enzyme activity (16%), binding activity (14%) and unknown molecular functions (10%) categories contributed the most in molecular function classification class (Figure 7) while among the biological processes classification class, cellular and metabolic processes (25% and 23% respectively) constituted the major categories (Figure 8).
In field pea, EST-SSR discovery was performed based on analysis of assembled contig templates, of which 2,345 (17%) contained at least one repetitive motif. A total of 2,932 distinct loci were identified, 588 template contigs containing at least two SSR loci eligible for primer pair design. A total of 2,397 SSR primer pairs were designed from these 2,932 distinct loci (Additional file 14, sheet Fieldpea). In the case of faba bean, a total of 1,097 distinct loci were identified in 914 of 6,351 assembled contigs (14%), from which 802 SSR primer pairs were designed (Additional file 14 sheet Fababean). Incidences of different repeat types were determined (Table 3), the most abundant being trinucleotide arrays for both field pea (1,383; 57.7%) and faba bean (495; 61.7%). Frequencies for each array type according to repeat unit number were also evaluated (Table 3), the most common class being n = 4 (43.3% for field pea and 48.6% for faba bean).
Validation of EST-SSR assays
A subset of 96 EST-SSR primer pairs each from field pea and faba bean data sets were selected for validation of marker assay performance. For field pea, a total of 86 (90%) successfully obtained amplification products from one or more template genotypes, of which 40 (46.5%) revealed polymorphism between 5 genotypes of field pea. Inclusion of a template sample from the non-domesticated species PS3689 (wild type landrace accession of Pisum sativum from Afghanistan) permitted polymorphism detection by 11 additional primer pairs (an increase to 59.3% of total) (Additional file 15, sheet Fieldpea). For faba bean, 81 primer pairs (84%) exhibited successful amplification, of which 24 detected polymorphic (29.6%) between cultivated V. faba genotypes (Icarus and Ascot). When the non-domesticated V. faba genotype ACC118 was included in the analysis, polymorphism rate increased to 48% (Additional file 15, sheet Fababean).
EST assembly and gene annotation
The increasing capacity of DNA sequencing technologies has permitted substantial increases in genomic resource availability for several legume crops that had been previously underdeveloped. Recently, large-scale transcriptome characterisation using the GS FLX platform has been performed for both lentil and pigeonpea [8, 37]. This technology can deliver large amounts of data at considerably lower costs as compared to traditional sequencing methods, and so provides an effective means to expedite analysis of less-studied species . In the present study, equivalent approaches have been applied to the two Vicieae species, field pea and faba bean, in order to develop a transcribed sequence database and to identify and validate EST-SSRs.
GS FLX sequencing has been shown to ineffectively process homopolymer regions that are longer than 8 bp in length . Therefore, poly(A) tails at mRNA termini may present major challenges, and result in under-representation of the 3'-ends of transcripts. In the present study, the problem was resolved through use of a modified primer with an interrupted polyd(T) tail. This contributed to an increase in the output of the total number of sequenced fragments by c. 6% (data not shown). A number of other transcriptome studies have used the same approach to overcome the homopolymer sequencing problems [39, 40].
Prior to sequencing, normalisation of the cDNA samples obtained from leaf and stem tissues was performed in order to increase the sequencing efficiency of rare transcripts. The normalisation process helps to reduce over -sampling of abundant transcripts that are presentin high quantities, hence increasing confidence of detecting a larger proportion of rare transcripts. Preliminary experiments indicated that normalisation of leaf/stem cDNA could increase the possibility of detecting rare transcripts by c. 10% (unpublished data). Similar approaches have been applied to detect rare transcripts in lentil, M. truncatula, Artemisia annua and greenhouse whitefly [8, 41–43].
The average contig lengths for the target species in this study are comparable to those observed in other studies (Pisum sativum, 454 bp , Pinus contorta, 500 bp ; lentil, 770 bp ; sweet potato, 790 bp ; mungbean, 843 bp ). A large proportion of the reads assembled into contigs in case of field pea (87%), which is comparable to the values observed in other studies (Glanville fritillary butterfly, 91% ; Eucalyptus grandis, 88% ; Acropora millepora larvae, 90% ). In contrast, a relatively smaller proportion (65%) of reads from faba bean assembled into contigs, resulting in lower length and depth as compared to the data derived from field pea. This may be due to the fact that the sequencing output for faba bean was comparatively smaller than that of field pea. Similar results have been observed in other studies . As a result of de novo assembly, a large number of singletons were obtained both for field pea (86,476) and faba bean (79,657), also as observed for other species [17, 42, 44, 48]. Although some singletons may arise as contaminating sequences or artefacts, the majority probably originate from transcripts expressed at low levels, and were consequently retained in the dataset. Many singleton sequences (15% for field pea and 17% for faba bean) exhibited high read quality due to matching of protein-encoding genes in the existing genic databases, and hence provide valuable sources of information. The remaining singletons could have resulted from various reasons such as incompleteness of known databases, sequencing errors, short read lengths leading to difficulty in assembly etc. [8, 31].
BLAST searches against databases of model plant species provided annotation data for field pea and faba bean ESTs, with totals of 22,057 and 18,052 unique hits, respectively. These values are very close to the estimated number of total genes (c. 25,000) present in a typical diploid plant genome, based on data from rice (Oryza sativa L.), sorghum (Sorghum bicolor L.), A. thaliana and Brachpodium distachyon[49, 50]. On this basis, the sequences annotated in this study are likely to represent c. 88% and c. 72% of the gene complements of field pea and faba bean, respectively. Such estimates are also supported by comparison with the M. truncatula genome, from which a total of 11,737 unique hits obtained from field pea represented c. 49% of the known gene space, and 10,179 unique hits from faba bean represented c. 41% of the known gene space. Comparisons were also made to G. max, which is more distantly related to the Vicieae tribe species than M. truncatula, being located outside the Hologalegina clade, A total of 19,451 unique hits from field pea and 16, 497 from faba bean represent c. 35% and 30% of the known gene space respectively, based on total of predicted 55,787 protein-coding loci in the palaeopolyploid genome of soybean. In comparison to the genome of A. thaliana, which is more distantly related to both model and crop legume species within the dicotyledonous plants, the corresponding values were c. 25% for field pea and c. 20% for faba bean.
Marker discovery and validation
One major advantage of second-generation DNA sequencing technologies is the capacity for computational interrogation of transcriptome data in order to develop large numbers of gene-based genetic markers such as SSRs and SNPs, of which few are currently available in the public domain for either field pea or faba bean. The EST-SSR primer pair sets generated in the current study will prove directly useful for the target species, and due to likely primer site conservation, may also be readily transferable to closely related species . The transcriptome data generated in the current study, being derived from distinct genotypes, may potentially be also used for the detection of SNP markers in field pea and faba bean, to further enrich the available genomic resources for these two species.
The relative proportions of SSR array types in field pea and faba bean were similar to those observed in other plant species [8, 52–54]. In theory, the frequencies of di-, tri-, tetra-, penta-, and hexanucleotide repeats should progressively decrease, based on the relative probability of replication slippage events. However, trinucleotide repeat units were predominant, followed by tetra-, di-, hexa-, and pentanucleotide repeat units. This observation is quite common for EST-derived SSRs, as trinucleotide expansions (or multiples thereof) within translated regions are capable of maintaining reading frame and hence generating a homopolymeric amino acid run within a partially or fully active protein.
The validation results for sub-sets of EST-SSR markers demonstrated that inclusion of non-domesticated genotypes in the study increased rates of polymorphism detection, consistent with the results of similar studies [8, 55]. EST-SSRs generated in the present study will consequently provide a valuable tool for the understanding of global genetic diversity among both non-domesticated and cultivated pea and faba bean germplasm, as well as for dissection of the genetic control of important agronomic traits.
In the current study, the generation of EST-datasets for field pea and faba bean has been described. Unigene sets obtained from field pea and faba bean were annotated against different genomic databases including those of M. truncatula, A. thaliana, G. max, and the nr database from GenBank. Furthermore, the EST dataset was used for design of EST-SSRs, subsets of which were validated across a number of cultivated and wild genotypes of pea and faba bean, indicating effectiveness of polymorphism detection and cross transferability.
Zhu H, Choi H-K, Cook DR, Shoemaker RC: Bridging model and crop legumes through comparative genomics. Plant Physiol. 2005, 137: 1189-1196. 10.1104/pp.104.058891.
Lavin M, Herendeen PS, Wojciechowski MF: Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst Biol. 2005, 54: 574-594.
Legumes of the world. Edited by: Lewis G, Schrire B, Mackinder B, Lock M. 2005, Kew Publishing
Cannon SB, Sterck L, Rombauts S, Sato S, Cheung F, Gouzy J, Wang X, Mudge J, Vasdewani J, Schiex T, Spannagl M, Monaghan E, Nicholson C, Humphray SJ, Schoof H, Mayer KFX, Rogers J, Quétier F, Oldroyd GE, Debellé F, Cook DR, Retzel EF, Roe BA, Town CD, Tabata S, Peer YV, Young ND: Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. PNAS. 2006, 103: 14959-14964. 10.1073/pnas.0603228103.
Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, Gill N, Joshi T, Libault M, Sethuraman A, Zhang XC, Shinozaki K, Nguyen HT, Wing RA, Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463: 178-183. 10.1038/nature08670.
Gepts P, Beavis WD, Brummer EC, Shoemaker RC, Stalker HT, Weeden NF, Young ND: Legumes as a Model Plant Family. Genomics for Food and Feed Report of the Cross-Legume Advances through Genomics Conference. Plant Physiol. 2005, 137: 1228-1235. 10.1104/pp.105.060871.
Gibson G: Pulse Australia-General Crop Information 2009, Pulse Australia. [http://www.pulseaus.com.au/crop_information.aspx]
Kaur SK, Cogan NOI, Pembleton LW, Shinozuka M, Savin KW, Materne M, Forster JW: Transcriptome sequencing of lentil based on second-generation technology permits large-scale unigene assembly and SSR marker discovery. BMC Genomics. 2011, 12: 265-10.1186/1471-2164-12-265.
Timmerman-Vaughan GM, Mills A, Whitfield C, Frew T, Butler R, Murray S, Lakeman M, McCallum J, Russell A, Wilson D: Linkage mapping of QTL for seed yield, yield components and development traits in pea. Crop Sci. 2005, 45: 1336-1344. 10.2135/cropsci2004.0436.
Gomez-Roldan V, Fermas S, Brewer PB, Puech-Pagès V, Dun EA, Pillot J-P, Letisse F, Matusova R, Danoun S, Portais J-C, Bouwmeester H, Bécard G, Beveridge CA, Rameau C, Rochange SF: Strigolactone inhibition of shoot branching. Nature. 2008, 455: 189-194. 10.1038/nature07271.
Hecht V, Knowles CL, Schoor JKV, Liew LC, Jones SE, Lambert MJM, Weller JL: Pea LATE BLOOMER1 is a GIGANTEA ortholog with roles in photoperiodic flowering, deetiolation, and transcriptional regulation of circadian clock gene homologs. Plant Physiol. 2007, 144: 648-661. 10.1104/pp.107.096818.
Burstin J, Deniot G, Potier J, Weinachter C, Aubert G, Baranger A: Microstaellite polymorphism in Pisum sativum. Plant Breed. 2001, 120: 311-317. 10.1046/j.1439-0523.2001.00608.x.
Tar'an B, Zhang C, Warkentin T, Tullu A, Vandenberg A: Genetic diversity among varieties and wild species accessions of pea (Pisum sativum L.) based on molecular markers, and morphological and physiological characters. Genome. 2005, 48: 257-272. 10.1139/g04-114.
Smýkal P, Horáèek J, Dostálová R, Hýbl M: Variety discrimination in pea (Pisum sativum L.) by molecular, biochemical and morphological markers. J Appl Genet. 2008, 49: 155-166. 10.1007/BF03195609.
Zong X, Redden RJ, Liu Q, Wang S, Guan J, Liu J, Xu Y, Liu X, Gu J, Yan L, Ades P, Ford R: Analysis of a diverse global Pisum sp. collection and comparison to a Chinese local P. sativum collection with microsatellite markers. Theor Appl Genet. 2009, 118: 193-204. 10.1007/s00122-008-0887-z.
Jing R, Vershinin A, Grzebyta J, Shaw P, Smýkal P, Marshall D, Ambrose MJ, Ellis N, Flavell AJ: The genetic diversity and evolution of field pea (Pisum) studied by high throughput retrotransposon based insertion polymorphism (RBIP) marker analysis. BMC Evol Biol. 2010, 10: 44-10.1186/1471-2148-10-44.
Franssen SU, Shrestha RP, Bräutigam A, Bornberg-Bauer E, Weber APM: Comprehensive transcriptome analysis of the highly complex Pisum sativum genome using next generation sequencing. BMC Genomics. 2011, 12: 227-10.1186/1471-2164-12-227.
Torres AM, Avila CM, Gutierrez N, Palomino C, Moreno MT, Cubero JI: Marker-assisted selection in faba bean (Vicia faba L.). Field Crops Res. 2010, 115: 243-252. 10.1016/j.fcr.2008.12.002.
Ellwood SR, Phan HTT, Jordan M, Hane J, Torres AM, Avila CM, Cruz-Izquierdo S, Oliver RP: Construction of a comparative genetic map in faba bean (Vicia faba L.); conservation of genome structure with Lens culinaris. BMC Genomics. 2008, 9: 380-10.1186/1471-2164-9-380.
Dalmais M, Schmidt J, Signor CL, Moussy F, Burstin J, Savois V, Aubert G, Brunaud V, Oliveira Y, Guichard C, Thompson R, Bendahmane A: UTILLdb, a Pisum sativum in silico forward and reverse genetics tool. Genome Biol. 2008, 9: R43-10.1186/gb-2008-9-2-r43.
Coyne CJ, McClendon MT, Walling JG, Timmerman-Vaughan GM, Murray S, Meksem K, Lightfoot DA, Shultz JL, Keller KE, Martin RR, Inglis DA, Rajesh PN, McPhee KE, Weeden NF, Grusak MA, Li CM, Storlie EW: Construction and characterization of two bacterial artificial chromosome libraries of pea (Pisum sativum L.) for the isolation of economically important genes. Genome. 2007, 50: 871-875. 10.1139/G07-063.
Hofer J, Turner L, Moreau C, Ambrose M, Isaac P, Butcher S, Weller J, Dupin A, Dalmais M, Signor CL, Bendahmane A, Ellis N: Tendril-less regulates tendril formation in pea leaves. Plant Cell. 2009, 21: 420-428. 10.1105/tpc.108.064071.
Hellens RP, Moreau C, Lin-Wang K, Schwinn KE, Thomson SJ, Fiers MWEJ, Frew TJ, Murray SR, Hofer JMI, Jacobs JME, Davies KM, Allan AC, Bendahmane A, Coyne CJ, Timmerman-Vaughan GM, Ellis THN: Identification of Mendel's white flower character. PLoS One. 2010, 10: e13230-
Moe KT, Chung J-W, Cho Y-I, Moon J-K, Ku J-H, Jung J-K, Lee J, Park Y-J: Sequence information on simple sequence repeats and single nucleotide polymorphisms through transcriptome analysis of Mungbean. J Integr Plant Biol. 2011, 53: 63-73. 10.1111/j.1744-7909.2010.01012.x.
Thomson NR, Holden MTG, Carder C, Lennard N, Lockey SJ, Marsh P, Skipp P, O'Connor CD, Goodhead I, Norbertzcak H, Harris B, Ormond D, Rance1 R, Quail MA, Parkhill J, Stephens RS, Clarke IN: Chlamydia trachomatis: Genome sequence analysis of lymphogranuloma venereum isolates. Genome Res. 2008, 18: 161-171.
Bentley DR: Whole-genome re-sequencing. Curr Opin Genet Dev. 2006, 16: 545-552. 10.1016/j.gde.2006.10.009.
Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing. Plant J. 2007, 51: 910-918. 10.1111/j.1365-313X.2007.03193.x.
Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17: 1636-1647. 10.1111/j.1365-294X.2008.03666.x.
Blanca J, Canizares J, Roig C, Ziarsolo P, Nuez F, Pico B: Transcriptome characterization and high throughput SSRs and SNPs discovery in Cucurbita pepo (Cucurbitaceae). BMC Genomics. 2011, 12: 104-10.1186/1471-2164-12-104.
Park YJ, Lee JK, Kim NS: Simple sequence repeat polymorphisms (SSRPs) for evaluation of molecular diversity and germplasm classification of minor crops. Molecules. 2009, 14: 4546-4569. 10.3390/molecules14114546.
Zeng S, Xiao G, Guo J, Fei Z, Xu Y, Roe BA, Wang Y: Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim. BMC Genomics. 2010, 11: 94-10.1186/1471-2164-11-94.
Rafalski JA: Novel genetic mapping tools in plants: SNPs and LD-based approaches. Plant Sci. 2002, 162: 329-333. 10.1016/S0168-9452(01)00587-8.
Loridon K, McPhee K, Morin J, Dubreuil P, Pilet-Nayel ML, Aubert G, Rameau C, Baranger A, Coyne C, Lejeune-Hénaut I, Burstin J: Microsatellite marker polymorphism and mapping in pea (Pisum sativum L.). Theor Appl Genet. 2005, 111: 1022-1031. 10.1007/s00122-005-0014-3.
Hougaard BK, Madsen LH, Sandal N, Moretzsohn MC, Fredslund J, Schauser L, Nielsen AM, Rohde T, Sato S, Tabata S, Bertioli DJ, Stougaard J: Legume anchor markers link syntenic regions between Phaseolus vulgaris, Lotus japonicus, Medicago truncatula and Arachi s. Genetics. 2008, 179: 2299-2312. 10.1534/genetics.108.090084.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.
Schuelke M: An economic method for the fluorescent labelling of PCR fragments. Nat Biotechnol. 2000, 18: 233-234. 10.1038/72708.
Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, Gaikwad K, Sharma TR, Raje RS, Bandhopadhya TK, Datta S, Singh MN, Bashasab F, Kulwal P, Wanjari KB, Varshney RK, Cook DR, Singh NK: Devlopment of genic-SSR markers by deep transcriptome sequencing in pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biology. 2011, 11: 17-10.1186/1471-2229-11-17.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.
Beldade P, Rudd S, Gruber JD, Long AD: A wing expressed sequence tag resource for Bicyclus anynana butterflies, an evo-devo model. BMC Genomics. 2006, 7: 130-10.1186/1471-2164-7-130.
Sun C, Li Y, Wu Q, Luo H, Sun Y, Song J, Lui EMK, Chen S: De novo sequencing and analysis of the American ginseng root transcriptome using a GS FLX Titanium platform to discover putative genes involved in ginsenoside biosynthesis. BMC Genomics. 2010, 11: 262-10.1186/1471-2164-11-262.
Cheung F, Haas BJ, Goldberg SM, May GD, Xiao Y, Town CD: Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics. 2006, 7: 272-10.1186/1471-2164-7-272.
Wang W, Wang Y, Zhang Q, Qi Y, Guo D: Global characterization of Artemisia annua glandular trichome transcriptome using 454 pyrosequencing. BMC Genomics. 2009, 10: 465-10.1186/1471-2164-10-465.
Karatolos N, Pauchet Y, Wilkinson P, Chauhan R, Denholm I, Gorman K, Nelson DR, Bass C, ffrench-Constant RH, Williamson MS: Pyrosequencing the transcriptome of the greenhouse whitefly, Trialeurodes vaporariorum reveals multiple transcripts encoding insecticide targets and detoxifying enzymes. BMC Genomics. 2011, 12: 56-10.1186/1471-2164-12-56.
Parchman TL, Geist KS, Grahnen JA, Benkman CW, Buerkle CA: Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics. 2010, 11: 180-10.1186/1471-2164-11-180.
Schafleitner R, Tincopa LR, Palomino O, Rossel G, Robles RF, Alagon R, Rivera C, Quispe C, Rojas L, Pacheco JA, Solis J, Cerna D, Kim JY, Hou J, Simon R: A sweetpotato gene index established by de novo assembly of pyrosequencing and Sanger sequences and mining for gene-based microsatellite markers. BMC Genomics. 2010, 11: 604-10.1186/1471-2164-11-604.
Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17: 1636-1647. 10.1111/j.1365-294X.2008.03666.x.
Novaes E, Drost DR, Farmerie WG, Pappas GJ, Grattapaglia D, Sedero RR, Kirst M: High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics. 2008, 9: 312-10.1186/1471-2164-9-312.
Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009, 10: 219-10.1186/1471-2164-10-219.
Vogel : Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature. 2009, 463: 763-768.
Bevan M, Walsh S: The Arabidopsis genome: A foundation for plant research. Genome Res. 2005, 15: 1632-1642. 10.1101/gr.3723405.
Barbara T, Palma-Silva C, Paggi GM, Bered F, Fay MF, Lexer C: Cross-species transfer of nuclear microsatellite markers: potential and limitations. Mol Ecol. 2007, 16: 3759-3767. 10.1111/j.1365-294X.2007.03439.x.
Kumpatla SP, Mukhopadhyay S: Mining and survey of simple sequence repeats in expressed sequence tags of dicotyledonous species. Genome. 2005, 48: 985-998. 10.1139/g05-060.
Eujayl I, Sledge MK, Wang L, May GD, Chekhovskiy K, Zwonitzer JC, Mian MAR: Medicago truncatula EST-SSRs reveal cross-species genetic markers for Medicago spp. Theor Appl Genet. 2004, 108: 414-422. 10.1007/s00122-003-1450-6.
Luro FL, Costantino G, Terol J, Argout X, Allario T, Wincker P, Talon M, Ollitrault P, Morillon R: Transferability of the EST-SSRs developed on Nules clementine (Citrus clementina Hort ex Tan) to other Citrus species and their effectiveness for genetic mapping. BMC Genomics. 2008, 9: 287-10.1186/1471-2164-9-287.
Castillo A, Budak H, Varshney RK, Dorado G, Graner A, Hernandez P: Transferability and polymorphism of barley EST-SSR markers used for phylogenetic analysis in Hordeum chilense. BMC Plant Biol. 2008, 8: 97-10.1186/1471-2229-8-97.
This work was supported by funding from the Victorian Department of Primary Industries and the Grains Research and Development Corporation, Australia.
SK and LP contributed equally to the GS FLX sequencing, EST dataset analysis, EST-SSR marker assay design and interpretation of SSR genotyping data. SK and JF drafted the manuscript. KS assisted the sequence contig annotation process. NC contributed to data analysis, interpretation and assisted in drafting the manuscript. NC, SK, JF, TL, JP and MM co-conceptualised and coordinated the project. MM assisted in drafting the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: . The data represents the consensus sequences of 13,602 assembled contigs generated as a result of de novo assembly of field pea ESTs. (TXT 10 MB)
Additional file 2: . The data represents the sequence information on all the singletons generated from de novo assembly of field pea ESTs. (TXT 18 MB)
Additional file 3: . The data represents the consensus sequences of 6,370 assembled contigs generated as a result of de novo assembly of faba bean ESTs. (TXT 4 MB)
Additional file 4: . The data represents the sequence information on all the singletons generated from de novo assembly of faba bean ESTs. (TXT 16 MB)
Additional file 5: Medicago truncatula genome. This file contains the BLAST results obtained as a result of comparison of field pea unigene set against the M. truncatula genome at an E value < 10-10. (XLS 2 MB)
Additional file 6: . This file contains the BLAST results obtained as a result of comparison of field pea unigene set against the GenBank nr database at an E value < 10-10. (XLS 9 MB)
Additional file 7: Arabidopsis thaliana genome. This file contains the BLAST results obtained as a result of comparison of field pea unigene set against the A. thaliana genome at an E value < 10-10. (XLS 2 MB)
Additional file 8: Medicago truncatula genome. This file contains the BLAST results obtained as a result of comparison of faba bean unigene set against the M. truncatula genome at an E value < 10-10.(XLS 2 MB)
Additional file 9: . This file contains the BLAST results obtained as a result of comparison of faba bean unigene set against the GenBank nr database at an E value < 10-10. (XLS 7 MB)
Additional file 10: Arabidopsis thaliana genome. This file contains the BLAST results obtained as a result of comparison of faba bean unigene set against the A. thaliana genome at an E value < 10-10. (XLS 1 MB)
Additional file 11: Glycine max genome. This file contains the BLAST results obtained as a result of comparison of field pea and faba bean unigene sets against G. max genome at an E value < 10-10. (XLS 7 MB)
Additional file 12: Pisum sativum transcriptome dataset from Franssen et al 2011. This file contains the BLAST results obtained as a result of comparison of field pea and faba bean unigene sets against P. sativum transcriptome dataset at an E value < 10-10. (XLS 10 MB)
BLASTN of field pea and faba bean contigs and singltones against GenBank EST and nucleotide (nt) data
Additional file 13: . This file contains the BLASTN results obtained as a result of comparison of field pea and faba bean contigs and singletons set against the EST and nucleotide (nt) databse of GenBank at an E value < 10-10. (XLSX 7 MB)
Additional file 14: Sequence information of all of the SSR primer pairs identified and designed using BatchPrimer3 from field pea and faba bean ESTs. This file contains all of the information (sequence information, orientation, sequence length, expected product length, Tm, GC content and SSR motif length) on SSR primer pairs designed using BatchPrimer 3. (XLS 2 MB)
Additional file 15: Characterisation of a sub-sets of EST-SSRs on wild and cultivated genotypes of field pea and faba bean. This file represents the data on number and size of alleles amplified from screening of subsets of EST-SSRprimer pairs on different genotypes of field pea and faba bean. (XLS 37 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Kaur, S., Pembleton, L.W., Cogan, N.O. et al. Transcriptome sequencing of field pea and faba bean for discovery and validation of SSR genetic markers. BMC Genomics 13, 104 (2012). https://doi.org/10.1186/1471-2164-13-104
- Faba Bean
- Simple Sequence Repeat Locus
- Simple Sequence Repeat Prime Pair
- Rare Transcript
- Unique Match