Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley

Background Earlier comparative maps between the genomes of rice (Oryza sativa L.), barley (Hordeum vulgare L.) and wheat (Triticum aestivum L.) were linkage maps based on cDNA-RFLP markers. The low number of polymorphic RFLP markers has limited the development of dense genetic maps in wheat and the number of available anchor points in comparative maps. Higher density comparative maps using PCR-based anchor markers are necessary to better estimate the conservation of colinearity among cereal genomes. The purposes of this study were to characterize the proportion of transcribed DNA sequences containing simple sequence repeats (SSR or microsatellites) by length and motif for wheat, barley and rice and to determine in-silico rice genome locations for primer sets developed for wheat and barley Expressed Sequence Tags. Results The proportions of SSR types (di-, tri-, tetra-, and penta-nucleotide repeats) and motifs varied with the length of the SSRs within and among the three species, with trinucleotide SSRs being the most frequent. Distributions of genomic microsatellites (gSSRs), EST-derived microsatellites (EST-SSRs), and transcribed regions in the contiguous sequence of rice chromosome 1 were highly correlated. More than 13,000 primer pairs were developed for use by the cereal research community as potential markers in wheat, barley and rice. Conclusion Trinucleotide SSRs were the most common type in each of the species; however, the relative proportions of SSR types and motifs differed among rice, wheat, and barley. Genomic microsatellites were found to be primarily located in gene-rich regions of the rice genome. Microsatellite markers derived from the use of non-redundant EST-SSRs are an economic and efficient alternative to RFLP for comparative mapping in cereals.


Background
The genetic maps of grass species have been constructed using a variety of marker types. Most of the older speciesspecific molecular maps were constructed with RFLP markers, but in recent times there has been increased utilization of PCR-based markers because of accessibility and higher throughput. Conservation of gene content and order has been detected among grass genomes through the use of comparative maps [1,2]. The applications of comparative maps have been discussed many times in the past (See for example: [3]), however genetic maps are not always designed with a comparative study in mind, thus, current maps from different grass species (and in many cases, within the same species) seldom share an adequate number of common (anchor) markers to allow researchers to bridge across maps with an adequate resolution. This is especially true when comparing the genome maps of the Triticeae tribe with the maps of rice or maize, which on average share 3 to 4 markers per wheat homoeologous chromosome group. The lack of anchor markers for bridging across species is exacerbated as new maps are constructed using PCR-based markers such as AFLP, genomic microsatellites and single nucleotide polymorphisms (SNPs) rather than the transferable but laborious cDNAbased RFLP markers.
Genomic SSR (gSSR) markers are biased towards genome specificity [4,5] and generally do not transfer to other species, making them less useful for the generation of comparative maps. For comparative mapping, markers must identify orthologous loci and be polymorphic in two or more species [6].
Recently, several researchers [6][7][8][9][10][11][12][13][14][15][16][17] have addressed the lack of transferability of gSSRs to other genomes by limiting primer design to transcribed regions, that are expected to have higher levels of conservation across related organisms. Public EST sequence databases from the Poaceae family can be scanned for the presence of SSRs both in protein-coding regions and in untranslated regions of genes (5' or 3' UTRs).
When compared to gSSRs, EST derived SSRs (EST-SSRs) were less polymorphic in a study in hexaploid wheat [9] with only 25% polymorphism, but the successful markers were of high quality and were also polymorphic in durum wheat. Thiel [13] reported a higher level of polymorphism in barley (42%). Lower polymorphism requires more effort to design primers for testing a larger set of candidate markers, but the ease and speed of finding SSRs among freely available EST sequence data offsets this extra effort. This approach is only feasible in species for which there have been EST sequencing projects.
We used the rice genome sequence generated by the International Rice Genome Sequence Project (IRGSP) [18] to identify gSSRs that can potentially serve as sources of markers for mapping. In an equivalent experiment, nonredundant sets of transcript sequences from rice, wheat and barley were scanned (dataset was obtained from the TIGR gene-index databases [19]) and those transcripts containing SSRs were collected and mapped in-silico to the rice genome. SSR-containing transcripts derived from different species, sharing a pre-determined threshold of similarity and matching the same location in rice were considered putative orthologs that may be used as anchors in comparative mapping studies. This paper describes a methodology for developing EST-SSR markers from wheat, barley and rice as markers for developing independent species maps as well as for homologous anchor markers for comparative maps. Over 13,000 untested PCR primer pairs for EST-SSRs were generated from the three gene indices and made available to the research community interested in grass genomes. Researchers are encouraged to evaluate a subset of primer pairs and send feedback regarding their utility to the GrainGenes [20] database for posting. The list (along with all other materials: scripts, programs source code and database schemas) is available from the additional files as well as from the Triticeae EST-SSR Coordination webpage in GrainGenes [21].

Frequency of microsatellite types and motifs
Based on combinations of all four nucleotides, the canonical set of SSR motifs is represented by four different duplets (AC, AG, AT, CG), 10 different triplets, 33 different quadruplets and 102 different quintuplet motifs. In the source sequences, all these basic nucleotide motifs can be represented in variant forms of the same basic set or by their reverse complements but to keep a consistency in the database for estimating frequencies, they were transformed into the canonical motifs. Reverse complements and variants would include, for example, CT for AG and GAG for AGG. Sets of unigene sequences such as the TIGR gene indices have the advantage of built-in elimination of redundant SSR counts allowing for more precise estimates of EST-SSR frequency. The rice genomic SSR (gSSR) counts were processed a-posteriori to eliminate redundancy due to BAC/PAC clone overlaps (see methods). Mononucleotide repeats are common in genomic DNA and some are known to be polymorphic but these were deliberately avoided in the unigene database, because they are usually added by the RNA polymerase and are not present in the template DNA (e.g. poly A tails). Table 1 summarizes the frequencies of SSRs in the TIGR gene indices and the rice genome, grouped by SSR type (di-, tri-, tetra-, and penta-nucleotides) and by several minimum acceptable microsatellite lengths starting at 12 bp or longer. Counts were cumulative, meaning that the totals for SSRs with 12 or more nucleotides included the number of longer SSRs displayed in columns to the right in this table (Table 1). If microsatellite expansion were motif-sequence and location independent, under the null hypothesis one would expect that all types of SSRs would expand at the same rate, and thus, the proportions of SSR types would remain equal from short to long SSRs. However, the proportions of SSR types changed with the length of the SSRs (as did the proportions of motifs, Table 2) and the different SSR types (and motifs) seemed to expand or contract at different rates in the genome of rice, and at different rates when compared to the unigenes of the other two cereals (Tables 1 and 2). SSRs of the trinucleotide type were the most frequent overall. The relatively higher proportion of short trinucleotide-based SSRs in rice unigenes was apparent when compared to the same length categories in wheat or barley unigenes and when compared to the rice gSSRs. The proportion of dinucleotide repeats was greater among genomic microsatellites than among EST-SSRs, and this proportion increased with longer SSRs, overtaking trinucleotide repeats in all datasets when the minimum length was set to 20 bp in gSSRs and 30 bp in EST-SSRs. For instance, in the rice genome 72% of the SSRs longer than 30 bp were of the dinucleotide type; thus, it appears that the rice genomic sequence is relatively richer in dinucleotide SSRs than the gene indices from rice (a subset of the genome) and the other two species.

ESTs are a rich source of SSRs
The abundance of SSRs (perfect and imperfect) in the unigenes can range from one in every 100 to one in every two unigenes depending on the minimum length (Table 1). When all SSRs with a minimum length of 12 bp are tabulated, 50%, 36% and 31% of rice, barley and wheat unigenes have at least one SSR. When the minimum length was raised to 16 bp the proportion was reduced to 13, 10 and 8%, respectively. Rice unigenes had a higher frequency of SSRs than did barley and wheat for most minimum lengths, but not for SSRs longer than 20 bp, where the relative abundance was similar in all three gene indices. The nearly two-fold difference at a minimum length of 18 bp was mostly due to the high abundance of trinucleotide SSRs in rice unigenes relative to wheat and barley. The abundance of trinucleotide repeats decreased by about one half for each repeat unit added to the series. The decline in abundance was steeper for tetranucleotide and pentanucleotide repeats but was less than one half for dinucleotide repeats, which at lengths greater than or equal to 30 bp, became the predominant type in all datasets. At 30 bp or longer, the AT motif was most common among gSSRs while AG was more numerous among EST-SSRs motifs ( Table 2).  Wheat unigenes contained a larger number of SSRs for all repeat length categories, followed by rice and barley unigenes. This is probably because wheat has more than twice the number of unigenes than rice or barley with 109,782 for wheat, 51,569 for rice and 48,159 for barley. The larger number of unigenes in hexaploid wheat may result from divergence of the genes in the three genomes, but also from a relatively larger EST dataset, i.e., more ESTs have been sequenced for wheat, with a sequence redundancy of 3.8×, versus 6× and 2.7× for barley and rice, respectively (see methods).
The number of the ten most frequent motifs was tabulated for different minimal SSR lengths ( Table 2). The relative proportions of motifs fluctuated with different length constraints as well as source species. At a minimum SSR length of 12 bp, CCG was predominant in all datasets, but AT and AG were more frequent in the higher range of minimal SSR lengths. Among dinucleotides in the rice EST-SSRs, AG and AT were the most common, but AG and AC were more common in wheat and barley EST-SSRs. Besides CCG, other frequent trinucleotide motifs were AGG and AGC. The trinucleotide (CCG)n microsatellite was present in both coding regions and UTRs. In coding regions, this triplet has the potential to code for the amino acids proline (CCG), arginine (CGG), alanine (GCC), glycine (GGC), but among these, expansion of the motif leading to additions of the amino acid proline could have the strongest effects on protein structure while alanine and glycine would have relatively small effects.  Density of gSSRs equal to or longer than 12 bp in R1 ranged from 1 gSSR in 2.8 kbp near the centromere to 1 gSSR per 1.1 kbp in the distal regions ( Figure 1A). For a more stringent subset of gSSRs (≥ 16 bp tetranucleotides, ≥ 18 bp dinucleotides and trinucleotides, and ≥ 20 bp pentanucleotides), the density ranged from 1 gSSR in 10 kbp around the centromere to 1 gSSR in 3.8 kbp in the densest region of the short arm ( Figure 1A).
The comparison of the density of gSSRs to, a) the density of all rice unigenes mapped to R1, and b) the density of EST-SSRs (a subset from the unigenes) mapped to R1 provided an estimate of the relationship between gSSRs and gene regions in the rice genome. There was a striking resemblance in the patterns of the plots for the density of gSSRs and the density of unigene-derived EST-SSRs in R1 pseudomolecules ( Figure 1A). The similarity in density patterns was less apparent but still present between gSSRs and R1 matches to all rice unigenes ( Figure 1B), and these densities were significantly correlated (r = 0.45, p ≤ 1E-5; Figure 2).
A decomposition of the set of stringent gSSRs (see the methods section for criteria defining the "stringent gSSRs") by types in R1 ( Figure 1C) showed that the relative proportions of the pentanucleotide gSSRs ≥ 20 bp were consistently lower over the majority of the chromosome, while the proportion of the other three types of microsatellites was higher, indicating that pentanucleotides are only a small component in the non-homogeneous distribution of the stringent subset.

Development of primers for cereal EST-SSRs
We designed primer pairs for 5,425 wheat, 3,036 barley and 4,726 rice EST-SSRs conforming to the stringent restrictions described in the methods. The average product size expected from the set of designed primers was 217 bp for rice EST-SSRs, 213 for wheat and 218.9 for barley.
Of those EST-SSRs, 42% of the wheat and 56% of the barley were mapped in-silico to the rice genome. The additional file 1 contains the list of primer pairs that can be downloaded for testing.

Are microsatellites preferentially associated with gene-rich DNA in rice?
Morgante and colleagues [22] reported that in plants, gSSRs were preferentially associated with non-repetitive DNA such as the gene-rich regions. They found a highly significant, positive, linear relationship (r 2 = 0.94, p < 0.006) between genomic microsatellite frequency and the percentage of single copy DNA in several plant species with a wide range of genome sizes. Estimates of repetitive and non-repetitive single-copy DNA fractions were based on reviews of the literature describing renaturation kinetics experiments for each of the species. Plant species that have gone through genome expansion due to retrotransposon amplification, such as maize and wheat, had a lower genomic microsatellite frequency indicating that SSR frequency is not a function of overall genome size but rather the relative proportion of single-copy DNA.
In this study, the best similarity matches between rice unigene sequences and the genomic sequence of rice chromosome 1 were used to estimate the density of transcribed regions along R1. This estimate was compared to both the density of gSSRs and the density of EST-SSRs (the latter group being the intersection of the set of gSSRs and the set of transcribed regions, or unigenes). The density pattern of transcribed regions (unigenes in R1) and of SSRs within transcribed regions (unigenes in R1 with SSRs or EST-SSRs) followed closely the density pattern of gSSRs in rice chromosome 1 (r = 0.45, p < 1 × 10 -5 and r = 0.62, p < 1 × 10 -10 , respectively) ( Figures 1A, 1B and 2).
The density (counts per 500 Kbp) of gSSRs along R1 was higher than both the density of transcribed regions and the density of EST-SSRs. A large number of gSSRs that are not already included in the set of EST-SSRs could still be associated with genes because, as Figure 1B   Micropon family of MITEs [23,24] which are associated with gene-rich regions.
Other reports have documented a role for SSRs that are associated with genes in the control of gene expression. For example, several human diseases have been linked with events of triplet expansions in the past [25]. Chromatin remodeling and gene silencing via histone-deacetylation/cytosine-methylation are among the putative functions of SSRs in the vicinity of genes, especially if GC rich. Coffee [26] showed that histone deacetylation (and methylation of CpG bases) leading to lower expression at the FMR1 locus in fragile X was a consequence of CCG repeat expansion. In another example, the expansion of a (AGC)n SSR in the 3' UTR of the myotonic dystrophy (DM) protein kinase gene could potentially affect the expression due to changes in local chromatin structure [27]. It has been found that DM patients have a reduced or complete loss of a nuclease-hypersensitive site in the region of the gene. Further analysis showed that the majority of DM protein kinase transcripts from cells carrying the repeat expansion also lacked the last two exons of a normal transcript, showing that the repeat expansions affected the splicing at the 3' end.
In rice, although the presence of a single-base mutation breaking an intron splice site is more directly responsible for the difference in phenotypes of the waxy gene, polymorphism due to SSR expansion has been associated with variation of expression levels in different japonica and indica varieties [28,29]. The effect that microsatellites might have in gene expression in plants may be observed as natural phenotypic variation.

A strategy to exploit the EST database for microsatellite markers
One strategy to better exploit a database of EST-SSRs in order to find polymorphic markers is to first sample the longest SSRs (≥ 30 bp), favoring dinucleotide repeats, then follow with trinucleotide, tetranucleotide and pentanucleotide repeats [23]. After exhausting the longest SSRs, one would then proceed with another cycle to select shorter SSRs. Short trinucleotide-based microsatellites such as (CCG)n, the most abundant group overall (Table  2), are more likely to derive from coding regions, thus reducing the chances for finding polymorphism [13]. This strategy is based on the following observations from our results and the literature: 1) Dinucleotides are a better source of polymorphic markers than the other types [30].
2) Longer SSRs generally have a higher tendency to be polymorphic [23]. 3) SSRs deriving from UTRs have the potential for a higher polymorphism than those derived from coding regions, which are constrained by purifying selection [30].
One percent of unigenes from the three species examined in this study have SSRs starting with a minimum of 30 bp (Table 1). Overall, 880 wheat unigenes, 530 rice unigenes and 340 barley unigenes contain at least one of these long microsatellites and primer pairs were successfully designed for 451, 276 and 148 of these long EST-SSRs in wheat, rice and barley, respectively. At this minimum length, trinucleotide repeats were not the most frequent. Nearly 50% of the EST-SSRs longer than 30 bp were based on dinucleotide repeats (Table 1), with (AG)n being the most common motif (Table 2). Yet, the frequency of SSR types among those for which primers could be designed did not follow this pattern. Trinucleotide repeats were still the most common type in this group, followed by dinucleotides. This was due to the fact that dinucleotides are found preferentially in the UTRs of transcripts, and their sequences had fewer surrounding bases to anchor acceptable primers.
After relaxing the microsatellite length constraint to a minimum of 20 bp, the overall number of SSRs in the unigenes increased to around 5%. An additional 3,195, 2,703 and 1,991 EST-SSRs in wheat, rice and barley become available for primer design. Acceptable primer pairs were designed for 2,622 wheat, 2,183 rice and 1,476 barley EST-SSRs in this category (which included the set mentioned previously).
Linear regression of the density of genes in rice chromosome 1 (roughly estimated by the matches of OsGI sequences to this chromosome: Osgi_vs_chr1) on the density of gSSRs (≥ 12 bp) in rice chr1 The set of EST-SSRs with acceptable primer pairs (Additional file 1) were selected from among all dinucleotide and trinucleotide EST-SSRs with a minimum of 18 bp, tetranucleotide EST-SSRs longer than 16 bp and pentanucleotide EST-SSRs 20 bp or longer. However, from 5,424 wheat unigenes associated with microsatellites and having a set of PCR primers in our database, only 2,323 had a best match in the rice BAC/PACs with our stringency settings. The rest (57%) are not anchored to the rice genome but still have potential to provide polymorphic wheat microsatellite markers. The same applies to 44% of the barley EST-SSRs with acceptable primers. The reasons for a large number of wheat/barley unigenes without matches to rice genomic sequence include not having the complete sequence of the rice genome available (the majority of clones were still in sequencing phase 2, with gaps) and having a relatively high stringency setting for filtering wheat and barley sequence comparisons to rice. In previous comparisons between wheat EST unigenes and the same version of the rice genome sequence draft [31,32], we found that 40% of the unigenes did not significantly match a sequence in the rice genome.

Conclusion
The relative proportions of di-, tri-, tetra-, and penta nucleotide repeats and motifs varied widely depending on length and were not consistent among the species examined. We have shown that ESTs are a good source of SSRs that can be exploited to develop microsatellite markers for wheat, barley and rice. The advantage to this approach is that the sequences are already available resulting in a lower cost than designing and testing microsatellites from anonymous genomic libraries, even if the polymorphism rate for EST-derived markers is lower.
EST-SSRs are useful for enhancing individual species maps, but can be used as anchor probes for creating links between maps in comparative studies when designed from sets of orthologous genes, as demonstrated by Yu et al [33]. The annotation and/or the sequence similarity between putative orthologous genes from two related species can provide the basis for their use in comparative maps. More than 13,000 primer pairs were designed to amplify fragments from a stringent subset of EST-SSRs in wheat, rice and barley and are available to the public for testing.
Using a different methodology, our results substantiated the report by Morgante et al [22] suggesting that microsatellites are predominantly found in the vicinity of genes. In some instances, their presence in the vicinity of genes may implicate a regulating function by mechanisms involving chromatin remodelling and DNA methylation.  [35]. For the rest of the genome, accession numbers for the individual BAC/PAC clones in the tiling path were used to download the corresponding sequence from NCBI GenBank [36]. The tiling path for chromosomes 2 to 12 was used to facilitate the posterior ordering of clones.

Scanning of the rice genome and the non-redundant ESTdatasets for SSRs
The TIGR gene indices and the genome of rice were scanned with a modified version of Sputnik [22] available from the University of Delaware [37] to find all perfect and imperfect SSRs having 2 to 5 nucleotides in the basic repeat unit and at least 12 bp in total length. For imperfect SSRs, up to 10% sequence deviation from a perfect SSR was included. We modified the way the program handles input sequences in the NCBI FASTA format and the format of the program's output, making it easier to export to relational databases. No changes were made to the underlying algorithms written by C. Abajian and modified by Morgante's group. The version of Sputnik used to generate the microsatellite data for this report can be obtained from the GrainGenes EST-SSR coordination webpage [21], or by downloading the additional file 2.
In order to eliminate the problem of counting the same microsatellites several times in the rice genome due to the redundancy created by overlapping regions between contiguous BAC/PACs in chromosomes 2 to 12, the gSSRs were annotated as redundant or not, according to their location in the tiling path. When located to a region in the BAC/PAC that overlapped with a neighbor clone (based on the tiling path information as well as MegaBLAST [38] pairwise alignments) only the SSRs belonging to the overlapped region of the top (northern) clone were counted while those present in the overlapped region of the bottom (south) clone were ignored. Of course, all SSRs found in unique, non-overlapping regions of rice clones were counted. A perl script that performed queries and updates to the SQL database (via the DBI perl module) scanned the tables of genomic microsatellites and flagged them according to the procedure explained above. Thus 24% of the genomic microsatellites (of length ≥ 12 bp) found in the rice BAC/PAC clones were ignored, as they were duplicates due to clone overlaps. Table 1 shows the counts and relative proportions of SSRs found in the four datasets (rice, wheat and barley gene indices as well as in the non-redundant rice genomic) for dinucleotides, trinucleotides, tetranucleotides and pentanucleotides when having different minimum microsatellite lengths (greater than or equal to 12 bp) as the starting point. Table 2, on the other hand, shows the relative proportions of the ten most common motifs for each dataset when subject to different constraints for minimum microsatellite lengths.

In-silico mapping of grass Non-redundant EST-SSRs
The set of EST unigenes associated with SSRs from wheat and barley was matched against the sequence of the rice genome to provide a putative map location in rice. Only the best hits were recorded for any given EST unigene. The similarity threshold was set at an E-value < 1 × 10 -10 and at least 80% similarity over 100 bp of minimum alignment. Rice EST unigenes were matched with the same criteria except for a higher similarity threshold of 95%. The inferred location in the rice genome for rice EST unigenes was used to estimate the proportion of rice gSSRs that were associated with regions containing genes.

PCR primer design
We developed a perl script (see Additional file 3) that automatically queries the database of EST-SSRs to design primers in batch based on what was learned in previous experiments and on recommendations found in the literature to maximize the chance of selecting polymorphic microsatellite markers. The script used the BioPerl module [39] to control the Primer3 core program [40,41], feeding each of the SSR source sequences and specifying the target regions to be amplified via PCR.
EST-SSRs were selected for primer design when conforming to the following more stringent restrictions (referred to as the set of stringent SSRs or gSSR stringent): a) The SSRs are dinucleotides or trinucleotides of length equal or larger than 18 bp, tetranucleotides equal or larger than 16 bp or pentanucleotides equal or larger than 20 bp.
b) The imperfect SSRs have less than 10% mismatches or gaps relative to a perfect SSR of the same length and motif.
c) There is a minimum of 50 bp surrounding the SSR edges in the source sequence to allow for possible primer design.
The parameters used for the Primer3 program specified an optimal Tm of 60°C with a minimum and maximum of 57°C and 65°C, respectively, and a 30% to 70% GC content with a low chance of dimer or hair-loop formation.
The range for PCR product length was set to be between 100 and 300 bp.