Gene-based microsatellite development for mapping and phylogeny studies in eggplant

Background Eggplant (Solanum melongena L.) is a member of the Solanaceae family. In spite of its widespread cultivation and nutritional and economic importance, its genome has not as yet been extensively investigated. Few analyses have been carried out to determine the genetic diversity of eggplant at the DNA level, and linkage relationships have not been well characterised. As for the other Solanaceae crop species (potato, tomato and pepper), the level of intra-specific polymorphism appears to be rather limited, and so it is important that an effort is made to develop more informative DNA markers to make progress in understanding the genetics of eggplant and to advance its breeding. The aim of the present work was to develop a set of functional microsatellite (SSR) markers, via an in silico analysis of publicly available DNA sequence. Results From >3,300 genic DNA sequences, 50 SSR-containing candidates suitable for primer design were recovered. Of these, 39 were functional, and were then applied to a panel of 44 accessions, of which 38 were cultivated eggplant varieties, and six were from related Solanum species. The usefulness of the SSR assays for diversity analysis and taxonomic discrimination was demonstrated by constructing a phylogeny based on SSR polymorphisms, and by the demonstration that most were also functional when tested with template from tomato, pepper and potato. As a results of BLASTN analyses, several eggplant SSRs were found to have homologous counterparts in the phylogenetically related species, which carry microsatellite motifs in the same position. Conclusion The set of eggplant EST-SSR markers was informative for phylogenetic analysis and genetic mapping. Since EST-SSRs lie within expressed sequence, they have the potential to serve as perfect markers for genes determining variation in phenotype. Their high level of transferability to other Solanaceae species can be used to provide anchoring points for the integration of genetic maps across species.


Background
The eggplant (Solanum melongena L.), also known as aubergine or brinjal, belongs to the Solanaceae, but unlike most of the solanaceous crop species, it is endemic to the Old, not the New World. Its progenitor is presumed to have been the African species S. incanum [1], but its centre of domestication and genetic diversity lies in the Indo-Burma region, where it has been grown for at least 1,500 years [2]. Despite its economic and nutritional importance, its genome has been little studied, in contrast to those of the other cultivated solanaceous crops tomato, potato and pepper, in which high density genetic linkage maps have been established [3][4][5][6]. The literature contains only a few reports describing RAPD [7], AFLP [8,9] and SSR [10,11] genotyping, a genetic map constructed with AFLP and RAPD markers [12] and a comparative genetic map, based on tomato sequences [13].
Microsatellites (SSRs) are short tandem repeats of simple (1-6 nt) motifs, and their value for genetic analysis lies in their multi-allelism, codominant inheritance, relative abundance, genome coverage and suitability for highthroughput PCR-based platforms [14]. It was long assumed that SSRs were primarily associated with noncoding DNA, but it has now become clear that they are also abundant in the single and low-copy fraction of the genome [15,16]. These latter SSRs are commonly referred to as "genic SSRs" or "EST-SSRs" and are present in 1 to 5% of the expressed plant DNA sequence deposited in public databases. With the increasing volume of publicly available unigene and cDNA sequences emerging from large-scale EST sequencing projects, the conventional need to generate enriched genomic libraries and to perform the necessary sequencing can now be largely bypassed [17]. Genic SSRs tend to be more readily transferable between (related) species or genera than genomic ones, since coding sequence is better conserved than noncoding sequence; however, they do tend to be less informative than conventional SSRs, particularly in the context of related genotypes [18,19]. On the other hand, they provide a powerful means to link the genetic maps of related species, and since many of them are located within genes of known or at least putative function, any allelic variation present can be exploited to generate perfect markers [20].
We present here our progress in the development and preliminary characterization of a set of eggplant SSR markers, derived from public database sequence, along with an evaluation of their experimental and in silico transferability among other solanaceous species.

SSR motif frequency and distribution
At the time surveyed, the Solanaceae Genomics Network database (SGN; http://www.sgn.cornell.edu) contained 3,181 eggplant ESTs, ordered into 1,841 unigenes (617 contigs and 1,224 singlets). An additional 176 sequences were retrieved from the EMBL sequence database http:// www.ebi.ac.uk/embl. The non-redundant sequence pool contained 1,864 sequences representing 743,527 bp of genomic sequence. Within these, 64 contained one or more SSR (70 in total, including 20 mono-, 11 di-, 36 tri-, one tetra-and two hexanucleotide motifs). One sequence contained three SSRs, while ten SSRs were of the compound type (SSR containing stretches of two or more different repeats). The mean separation between two SSRs was ~10.6 kb, equivalent to one SSR per 29 sequences. This distance is somewhat greater than that estimated for several monocotyledenous [15,21] and dicotyledenous [22] species, perhaps because of the greater stringency of the criteria and the lesser size of the sequence dataset.
The properties of the 70 SSR loci identified are summarised in Table 1, classified on the basis of repeat motif and the number of repeat units. Trinucleotides were the most frequent (51.4%), followed by mono-(28.6%) and dinucleotides (15.7%). Tetra-and hexanucleotides were rare. Although trinucleotide motifs are less frequent in genomic libraries, they represent the most common class in expressed sequence [18,23,24], since variation in repeat number does not normally affect downstream peptide sequence, unlike mono-, di-, or tetra-nucleotide motifs, which generate frameshift mutations and therefore are more likely to be selected against [25]. All ten possible trinucleotide motifs were recovered, with AAG/CTT the most frequent (30.6%), as has been seen in other Solanaceae species [26,27] and more generally within plant sequence databases [16,28]. CCG/CGG and AGG/ CCT are the most common monocotyledonous EST-SSR motifs [18,24,29] and were under-represented in dicotyledonous species as well as in the present dataset. Kantety et al. [30] have observed that AG/CT predominates among the dinucleotide motifs, presumably reflecting the high frequency of Ala (AGA) and Leu (GAG) (respectively, 8% and 10%) in polypeptides [31]. These motifs represented 45.5% of the eggplant dinucleotide SSRs. The second most abundant motif (36.4%) was AT/AT, which is also well represented among plant EST sequences [32,33]. Most of the mononucleotide repeats (19/20) were A/T. The total length of the 64 microsatellite containing sequence reached the 31,909 bp. Of this 16,862 bp represented untranslated (UTR) -and 15.047 bp represented protein-coding regions. SSRs were non-randomly distributed among coding regions and UTRs. All of the mononucleotide and majority of the dinucleotide repeats (91%) were associated with UTRs. Mononucleotide repeats were evenly distributed among 5' and 3' UTRs while dimeric ones preferentially associated with 5'UTRs. Triplet repeats were significantly over-represented in coding region (75%) and among non-coding regions showed more than 3 folds greater frequency in 5'UTRs. Such dominance of trimeric over other SSRs in coding regions can be explained by non-perturbation of the reading frame.

SSR assays and their informativeness
Of the 64 sequences containing one or more SSR, 50 (78%) were amenable to primer design. The markers targeted by EEMS01 to EEMS50 comprised 15 mono-, five di-, 24 tri-and two hexanucleotide simple repeats, together with two di-and two trinucleotide compound loci. The remaining sequences contained either too little flanking sequence, or the sequences themselves were refractory for primer design. Thus, primers amplifying non-redundant loci were designed from about 1.4% of the initial number of database sequences, a success rate comparable to that experienced in other species [23,26,27]. Amplicons were generated from genomic DNA template from 39 (78.0%) of the 50 loci. Failure to amplify can be due to a variety of causes, including the positioning of primers across a splicing site, or to a chimeric origin of the cDNA clones. In all, 31 (79.5%) of the 39 assays were informative across the whole genotype panel ( Table 2), but only 11 (28.2%) were informative among the sample of cultivated eggplant. The majority of the trinucleotide-containing SSRs were informative between species, but few generated any polymorphism among the cultivated set, while the dinucleotide SSRs identified both inter-and intra-specific polymorphism. Similar results have been reported for eggplant by Nunome et al. [10,11] who described that 57% of trinucleotide SSRs were informative at inter-, but only 14% at intraspecific level, while, for the dinucleotide SSRs, the respective frequencies were 78% and 70%. The repeat type, primer sequence and PIC (polymorphism information content) of the successful markers are given in Table  3.
Generally, amplicon size was in agreement with expectation, although EEMS 26, 31, 39 and 41 all amplified a product at least 100 bp larger than expected, presumably because the amplicon included an intron. EEMS12 produced an amplicon of smaller than expected length, perhaps because of the presence of a deletion within the genomic sequence, poor priming specificity amplifying a non-target member of a gene family, or because of minor sequence variation between the amplified copy and the consensus sequence [34]. A total of 116 alleles was amplified from the full genotype panel, with the number of alleles per locus varying between 1 and 9 (mean 3.1) ( Table   Table 1: Occurrence of non-redundant SSRs in a set of 3,357 Solanum melongena sequences. Total   4  5  6  7  8  9  10  11  12  13 14 >15

SSR motif Number of repeats
Total 70 3 between PIC m and SSR length was 0.6 (p = 0.0001), in agreement with the general trend for long SSRs to be more informative than shorter ones [35]. Trinucleotide motif SSRs were less informative than the dinucleotide types (PICs of 0.16 and 0.26 respectively). The former are typically associated with a low level of variability [18,36]. The overall level of intraspecific polymorphism uncovered (28.2%) is typical [37][38][39], and compares poorly with the rate achievable by genomic SSR assays [37,40,41].

Genetic diversity revealed by SSR markers
Thiel et al [24] have stressed the limitations surrounding the application of SSR markers for diversity studies, emphasising the possibility of homoplasy (identical allele sizes may not be identical by descent), and have pointed out that allele size differences can also be generated by indel events, as well as by variation in the SSR repeat number. However, the genetic relationships between the accessions of the full genotype panel as displayed by genetic similarity at the SSR level were in good agreement with prior taxonomic classification based on both genomic [9,11] and plastidial markers [42,43]. Thus the cultivated eggplants clustered with an average genetic similarity of 82% (Figure 1). Three pairs of cultivars ('Tina' and 'Dourga'; 'Sita 07' and 'Violetta di Firenze'; 'Mostruosa di New York' and '305 E40') and 'Mirabelle', 'DR2' and 'Lunga violetta napoletana' were identical to one another. The cluster closest to the cultivated group contained both S. viarum and S. sodomaeum, with a mean genetic differentiation of ~50% from the cultivated germplasm. The S. torvum accession was more distant (mean genetic similarity 39%). The third cluster contained the remaining species S. sisymbrifolium, aethiopicum and integrifolium which shared a mean genetic similarity of 56%.
The EEMS primers were also applied to amplify template from potato, tomato and pepper, which all belong to the Solanaceae. To minimise non-specific amplification, the same stringency level for PCR was applied as with eggplant template. About 54% (21 of the primer pairs) generated a detectable amplicon from at least one of the three species; ten of 21 amplified all three templates, seven amplified potato and tomato but not pepper DNA, two tomato and pepper but not potato, and one each amplified only from potato and tomato.
The principal co-ordinate analysis (PCO) analysis illustrates the genetic relationships between the members of the genotype panel ( Figure 2). The first three principal coordinates accounted for ~54% of the overall genetic variation, with each in turn contributing 34.2%, 10.3% and 9.4%. The first co-ordinate distinguished the cultivated forms from the allied genotypes, while the second allowed the separation of each related eggplant genotypes.

BLAST analyses
Of the 39 functional SSR markers, all but EEMS45 were developed from anonymous eggplant unigene sequences, 25 of which share significant homology to Arabidopsis thaliana proteins of unknown function. EEMS45 lies within a chloroplast phosphate transporter gene (Table  4). Using the source eggplant sequences as a BLASTN query (the target database has been described in the 'Method' section), 24 (61.5%) of the markers identified highly conserved orthologs, with a frequency negatively correlated with phylogenetic distance from eggplant [44]. EEMS15, EEMS21, EEMS24, EEMS39, and EEMS45 had homologous counterparts with known function. Sequences containing homologous microsatellite motifs in conserved positions were found in 15 potato, 10 tomato and 1 pepper orthologs (Table 4). Contrasting results are reported in literature on the transferability of microsatellite markers across members of the Solanaceae [26,45,46]. The high level of transferability between the seven Solanum spp. mirrors the experience in other groups of plants [47]; furthermore we detect a low level of intraspecific polymorphism which seems to confirm the conclusion that EST-SSRs are highly conserved across species [48].

Conclusion
In eggplant, as in pepper and tomato [3,49,50], the level of intraspecific DNA marker polymorphism is rather limited. Nunome et al [11] constructed a genetic map in eggplant based on RAPD and AFLP markers, but only 8.3% of the RAPD primers were informative, and even the AFLP primer combinations were only able to deliver a mean of 2.4 polymorphisms each. We have shown that an in silico analysis of the albeit limited quantity of publicly available eggplant DNA sequence has enabled the development of a set of functional SSR markers. Because these sequences are derived from the expressed portion of the genome, they are relevant for assaying functional diversity in populations or germplasm collections. Most of the EEMS SSRs are readily transferable to related species, and so can be exploited as anchor markers for comparative mapping and evolutionary studies.

Mining of SSR-containing sequences and primer design
In all, 3,357 eggplant sequences were retrieved from the SGN and EMBL nucleotide databases, using the Sequence Retrieval System (SRS6, http://srs.ebi.ac.uk/). A standalone nucleotide database was built for local BLAST2 searches [51]. PolyA and polyT tracts were removed, by applying the criterion that no 50 bp window contain a run of ten A's or ten T's. ClustalW [52] alignment was used to eliminate redundancy, by setting the following two criteria: (i) where a cluster contained two or more identical sequences, the longest was retained, and (ii) where the members of a cluster fell into recognisable sub-groups, only one member of each sub-group was retained. Sequences composed entirely of SSR motif (i.e., lacking any flanking sequence) were discarded, since their uniqueness could not be established, and in any case, primer design is not possible. SSR-containing sequences were identified using MISA software [24], a Perl script which allows both perfect and compound SSRs to be detected. A sequence was considered an SSR where a motif UPGMA dendrogram Figure 1 UPGMA dendrogram. Analysis of the 44 genotype set, based on 116 EST-SSR alleles. Sample codes are described in Table 2 Nei and Li's Similar ity Coefficient was repeated at least 12 times (1 nt motif), seven times (2 nt) or five times (3-6 nt), allowing for only one mismatch. For compound repeats, the maximum default interruption (spacer) length was set at 100 bp.
Primer pairs were designed from the flanking sequences, using PRIMER3 software [53] in batch mode via the p3_in.pl and p3_out.pl Perl5 scripts within the MISA package. The target amplicon size was set as 100-300 bp, the optimal annealing temperature as 60°C, and the optimal primer length as 20 bp.  Table 2). Cross-species transferability was tested against tomato, pepper and potato DNA. DNA was isolated from young leaves using the method described by Doyle and Doyle [54].    amplicons were separated by denaturing 6% polyacrylamide gel electrophoresis on a LI-COR Gene ReadIR 4200 device, as described by Jackson and Matthews [55]. Determination of amplicon size was achieved by including an lRD700-labelled 50-350 bp ladder in each well. The data were collected by e-Seq software (DNA Sequencing and Analysis Software) v3.0.