Patterned sequence in the transcriptome of vascular plants

Background Microsatellites (repeated subsequences based on motifs of one to six nucleotides) are widely used as codominant genetic markers because of their frequent polymorphism and relative selective neutrality. Minisatellites are repeats of motifs having seven or more nucleotides. The large number of EST sequences now available in public databases offers an opportunity to compare microsatellite and minisatellite properties and evaluate their evolution over a broad range of plant taxa. Results Repeated motifs from one to 250 nucleotides long were identified in 6793306 expressed sequence tags (ESTs) from 88 genera of vascular plants, using a custom data-processing pipeline that allowed limited variation among repeats. The pipeline processed trimmed but otherwise unfiltered sequence and output nonredundant loci of at least 15 nucleotides, with degree of polymorphism and PCR primers wherever possible. Motifs that were an integral multiple of three in length were more abundant and richer in G/C than other motifs. From 80 to 85% of minisatellite motifs represented repeats within proteins, up to the 228-nucleotide repeat of ubiquitin, but not all of these repeats preserved reading frame. The remaining 15 to 20% of minisatellite motifs were associated with transcribed repetitive elements, e.g., retrotransposons. Relative microsatellite motif frequencies did not correlate tightly to phylogenetic relationship. Evolution of increased microsatellite and EST GC content was evident within the grasses. Microsatellites were less frequent in the transcriptome of genera with large genomes, but there was no evidence for greater dilution of the transcriptome with transposable element transcripts in these genera. Conclusion The relatively low correlation of microsatellite spectrum to phylogeny suggests that repeat loci evolve more rapidly than the surrounding sequence, although tissue specificity of the different EST libraries is a complicating factor. In-frame motifs are more abundant and higher in GC than frame-shifting motifs, but most EST minisatellite loci appear to represent repeats in translated sequence, regardless of whether reading frame is preserved. Motifs of four to six nucleotides are as polymorphic in EST collections as the commonly used motifs of two and three nucleotides, and they can be exploited as genetic markers with little additional effort.


Background
Microsatellites have been defined as tandem direct repeats of motifs (subsequences) of one to six bases [1], and minisatellites as tandem direct repeats of longer motifs. Adja-cent repeats comprise a microsatellite locus, which may be simple if there is only one motif, or compound, if there are two or more adjacent blocks of repeated motifs separated by less than some maximum gap of nonrepeated sequence [2]. Microsatellites have been classified as perfect, having no variation among the repeated motifs within the locus, or imperfect, with less than some tolerated percentage (e.g., 10%) of deviant nucleotides over the entire locus. The minimum microsatellite locus has spanned 12 to 20 bases in a number of studies [3][4][5][6][7], although polymorphism has been found to be higher in those at least 20 nucleotides long [8].
Microsatellites are frequently polymorphic within populations or species, and therefore have been widely used as markers in genetic mapping, fingerprinting, and population genetics. New alleles are thought to arise facilely by replication slippage or unequal crossing over at meiosis [9], or by gene conversion [10,11]. To some extent, the same usefulness and evolutionary mechanisms apply to minisatellites. Until relatively recently, the software tools for identifying short, tandemly repeated subsequences have been directed at microsatellites [12][13][14][15], although one tool, the Tandem Repeats Finder [16], can currently detect repeated subsequences of any motif length less than 2000 nucleotides. Meanwhile, studies of longer repetitive elements have concentrated on sequences of a few hundred to a few thousand base pairs, e.g., transposable elements, that (with the exception of 5S and 18S-28S ribosomal RNA genes) are generally dispersed in the genome and do not form large tandem arrays with varying copy number. Repetitive subsequences between these length classes remain less analyzed.
Collections of ESTs (expressed sequence tags derived by reverse transcription of messenger RNA) have proven valuable as sources of microsatellite markers. In part, this is because there are many distinct EST sequences in databases [17], microsatellites occur preferentially in nonrepetitive DNA [18] (but see [19] for a contrasting view), and because most EST sequence represents expressed genes that occur singly or in small families in the genome. It also accords with the relatively conserved nature of expressed sequence in relation to the chaotic melange of rapidly evolving repetitive sequences that comprise the bulk of plant genomes [20]; therefore, microsatellite markers derived from the ESTs of one species are likely to be useful in related species [3]. The diminished transferability of microsatellite markers among more distantly related species raises the question of the rate at which such sequences and their flanking sequences evolve, and whether the distribution of microsatellite types carries any phylogenetic information.
Before I was aware of the capacity of Tandem Repeats Finder [16] to identify repeats of minisatellite motifs, I wrote a generic repeat finder in C and several Perl scripts to identify distinct microsatellite loci, group sufficiently close loci into compound loci, condense the collection of microsatellite-bearing sequences to a nonredundant EST set by means of phrap [21], identify likely polymorphisms within the collection, and select plausible PCR primers with primer3 [22]. The resulting pipeline processes any FASTA-formatted sequence collection, which can be highly redundant, and outputs a list of relatively nonredundant microsatellite loci with position, motif, degree of observed polymorphism in the collection, and primers. The findings of this pipeline are here reported and statistically characterized for 88 genera of vascular plants, each represented by a collection of at least 3000 EST sequences in GenBank as of 15 October 2005. Also reported are the results of a maximum-likelihood phylogenetic analysis of the motif frequencies, to express the similarity of these motif frequencies among genera and to relate similarity in microsatellite spectrum to more conventional, broadly based measures of their phylogeny. These findings can be used to test the hypotheses that repeated subsequences are subject to natural selection and that their evolution reflects the evolution of their host genomes.

Scope
This study used a custom pipeline to identify repeated subsequences from 1 to 250 bases long in 6793306 sequences and 3681160063 nucleotides from 88 genera of vascular plants. The phrap step reduced the total number of nucleotides to 836433533 in the nonredundant set, for a 77.3% reduction of total sequence length. The genera ranged from Selaginella to members of the Poaceae and Asteraceae, and gymnosperms, Poaceae, and Solanaceae were comparatively well represented, with at least five genera each. The analysis was repeated three times for each genus, allowing for zero, one, or two singlenucleotide mismatches over all copies of the motif in the repeat locus. In this way, the analysis considered perfect microsatellites, and imperfect microsatellites with a 10% tolerance at minimum locus lengths of 15 and 20 nucleotides. Microsatellites are reported as simple, if they occur in isolation, and compound, if they occur within five bases of another microsatellite. All frequencies are reported from nonredundant sequence as recognized by phrap within the pipeline. Figure 1 shows on a log scale the abundance of microsatellites and minisatellites per megabase in relation to motif length for zero, one, or two allowed mismatched nucleotides. Major shifts in abundance of short motifs reflect the differing threshold values for a minimal microsatellite as the number of allowed mismatches varies. Above decamer motifs there appear periodic peaks wherever the motif length modulus three is zero, i.e., where the minisatellite cannot change the reading frame of the message. This pattern holds even though the abundance of loci with two allowed mismatches falls short of that for zero or one allowed mismatch. The pattern breaks down for motifs longer than 150 bases, largely because the distribution by length becomes discontinuous. These zero values (actually negative infinities) appear at 0.001 in Figure 1. The peak at 228 bases was real and appeared in 13 genera, including dicots, monocots, and a gymnosperm. All 27 instances were related to one another at the peptide level, at a maximum e-value of 7e-21 in a mutual tblastx search. A tblastx of an instance from Triticum aestivum revealed no relationship to repetitive elements in RepBase [23], but blastn versus GenBank hit numerous ubiquitin sequences at 1e-69.

Frequency by motif length
The observed motif-length frequencies were compared to those expected from Equations 1 and 2 (given in Materials and Methods) for the pooled sequences of 12 taxonomically divergent genera: Aquilegia, Brassica, Citrus, Euphorbia, Festuca, Gossypium, Helianthus, Ipomoea, Lotus, Mesembryanthemum, Nicotiana, and Prunus. These genera had relatively large EST collections that were still small enough to be processed by phrap without subdivision. As shown in Supplementary Table 1 (see Additional File 1), for most motif lengths the observed repeat frequencies greatly exceeded those expected with random sequence having the same length distribution. Overall, the observed excess was greatest with zero allowed mismatches and least with two. With one allowed mismatch, the observed frequencies of motifs of eight to 11 nucleotides differed relatively little from the frequencies expected from random sequence. Because of the longer minimum locus length with two mismatches, the observed frequency of motifs from 10 through 12 nucleotides was close to the expectation with random sequence. The expected frequencies lacked the observed peaks for motif lengths that were integral multiples of 3 and were many orders of magnitude lower than the observed frequencies for motifs longer than 20 nucleotides.

Frequency by motif type
Supplementary Table 2 (see Additional File 2) presents per-megabase frequencies of all 501 canonical single-base through six-base motifs in perfect microsatellites for all 88 genera. Its sister tables, Supplementary Tables 3 and 4 (see Additional Files 3 and 4), respectively present the same data for microsatellites with one or two allowed mismatches within the microsatellite locus. These frequencies are itemized for simple, compound, and total loci, and for the fraction of simple loci to simple plus compound loci, as listed in the "Type" column of each table and defined fully in the Abbreviations. The "Type" column of Supplementary Tables 2, 3, and 4 also designates lines for the repeat count and frequency of polymorphism for each canonical motif, itemized for simple loci and submotifs in compound loci, as coded in the Abbreviations. All submotifs within a compound locus were counted as polymorphic if any one of them was, since polymorphism was scored by locus. In Supplementary Tables 2, 3, and 4, the third column is the mean frequency of the respective motif and quantity over all 88 genera. Table 1 combines these mean frequencies over the six length classes to provide a collective view of behavior in regard to motif length. Table 1 shows that trinucleotides are the most abundant motif in simple perfect microsatellites, followed consecutively by dinucleotides, hexanucleotides, pentanucleotides, mononucleotides, and tetranucleotides. In contrast, tetranucleotide motifs are the most common in compound perfect microsatellites, followed by trinucleotides, pentanucleotides, hexanucleotides, dinucleotides, and mononucleotides. The total frequency is dominated by the compound microsatellites and follows the same ranking. The fraction of simple to total perfect microsatellites is highest for mononucleotide and trinucleotide motifs, followed by dinucleotides, hexanucleotides, pentanucleotides, and tetranucleotides. The mean repeat count in simple and compound microsatellites was highest for mononucleotide motifs and decreased with increasing motif length, as expected from the minimum length thresholds necessary to complete a locus of at least 15 nucleotides. The mean repeat count was also higher in simple than in compound microsatellites for the same motif, also as expected from the threshold lengths to Abundance of microsatellites and minisatellites versus motif length, with zero, one, or two allowed mismatched repeat positions Figure 1 Abundance of microsatellites and minisatellites versus motif length, with zero, one, or two allowed mismatched repeat positions. declare a microsatellite. Hexanucleotide motifs were the most likely to be polymorphic in simple microsatellites; tetranucleotide, trinucleotide, pentanucleotide, dinucleotide, and mononucleotide motifs were consecutively less likely to be polymorphic. In compound microsatellites, polymorphism was most likely in hexanucleotide motifs, then trinucleotide, pentanucleotide, tetranucleotide, dinucleotide, and mononucleotide motifs, and the frequencies were lower than in simple microsatellites. Overall, perfect simple microsatellites were significantly more polymorphic than perfect compound microsatellites by 0.176 to 0.137 (p = 1.03e-13 in a paired t-test).
Increased microsatellite frequencies were expected with one allowed mismatch, because the minimum locus size was 15 nucleotides as in perfect microsatellites. The observed increase ranged from 1.1 to 2.4-fold for simple microsatellites, being greatest for pentanucleotide motifs and least for dinucleotide motifs. For compound microsatellites, the increase was from 1.2 to 14.7-fold, except for a 72% reduction for tetranucleotide motifs, which possibly were being recognized at other motif lengths. Again, the increase was greatest for pentanucleotide motifs. Allowing a mismatch also changed the ranking of frequencies by motif length; with one mismatch, the ranking ran trinucleotides, pentanucleotides, dinucleotides, mononucleotides, hexanucleotides, and tetranucleotides in simple microsatellites, and pentanucleotides, hexanucleotides, trinucleotides, tetranucleotides, dinucleotides, and mononucleotides in compound microsatellites. The frequencies of total microsatellites were ranked in the same order as the frequencies of compound microsatellites. The ratio of simple to total microsatellites increased with one mismatch for dinucleotide, trinucleotide, and tetranucleotide motifs, and decreased with mononucleotide, pentanucleotide, and hexanucleotide motifs. The decrease for hexanucleotides was nearly 90%. The ratio was highest for trinucleotide motifs, then mononucleotide, dinucleotide, tetranucleotide, hexanucleotide, and pentanucleotide motifs. Allowing one mismatch did not greatly change the mean repeat count per locus for simple microsatellites, which again tracked the minimum length criteria for different motif lengths. Allowing a single mismatch did increase the mean repeat count for motifs of four or fewer nucleotides in compound microsatellites. A single mismatch had little effect on the fraction of polymorphic loci, although this fraction increased about 0.015 for motifs of four or fewer nucleotides in compound microsatellites, and trinucleotides replaced hexanucleotides as the class with highest polymorphism in compound microsatellites.
Allowing two mismatches and increasing the minimum locus length from 15 to 20 nucleotides reduced the fre- quencies of both simple and compound microsatellites up to 150-fold from their values with one mismatch, and reversed their relationship so that simples were 83% (dinucleotides) to 93% (pentanucleotides) of the total. In simple microsatellites, trinucleotide motifs were most abundant, then dinucleotide, hexanucleotide, pentanucleotide, mononucleotide, and tetranucleotide. The ranking was the same for compound and total microsatellites, except that mononucleotide motifs ranked ahead of pentanucleotide motifs. In accordance with the greater length criteria for a minimal microsatellite with two mismatches, the repeat counts were higher for all motif lengths for both simple and compound microsatellites, and fell with increasing motif length. In simple microsatellites, hexanucleotide motifs were the most likely to be polymorphic, then tetranucleotides, pentanucleotides, trinucleotides, dinucleotides, and mononucleotides. In compound loci, the order was trinucleotides, pentanucleotides, tetranucleotides, dinucleotides, hexanucleotides, and mononucleotides. Mononucleotide motifs were least likely to be polymorphic in simple and compound loci over all three numbers of allowed mismatches.
Mean frequencies of eight microsatellite characteristics appear in the third column of Supplementary Tables 2, 3, and 4 (see Additional Files 2, 3, and 4). These frequencies were ranked in descending order within motif length classes to produce Supplementary Tables 5, 6, and 7 (see Additional Files 5, 6, and 7), respectively for zero, one, or two allowed mismatches. Supplementary Tables 8, 9, and 10 (see Additional Files 8, 9, and 10) list in descending order the numbers of genera for which each motif was top ranked, again respectively for zero, one, or two allowed mismatches. Every canonical motif appears in these tables, but 16 hexanucleotide motifs were less than 0.0005 in total frequency with two allowed mismatches: AAACGT, AACGTT, AACTAT, AACTCT, AAGCGT, AAGCTT, AAGTAT, ACCGGT, ACCTAT, ACGCGT, ACG-TAG, ACGTAT, AGCGCT, AGGCCT, ATATCG, and ATGCGC.
There was a single clearly most abundant motif for mononucleotides (A), dinucleotides (AG), and trinucleotides (AAG). The most abundant motifs in tetranucleotide, pentanucleotide, and hexanucleotide microsatellites were AAAG, AAAAG, AAAAAG, or motifs with A and two or three Gs, but these were relatively less abundant than the top-ranked loci with shorter motifs. PolyA motifs ending in a single T or C were also abundant overall, but other runs of a single letter were not particularly common. Motif rankings diverged between simple and compound loci with increasing motif length. Simple loci were typically from 5 to 20% of all loci with zero or one mismatch, but at least 80% of loci were simple with two allowed mismatches. Table 2 gives rank correlation coefficients (Spearman's rho, calculated using R [24]) for seven types of comparisons at each mismatch level and motif length from three through six. The comparisons are designated according to the same codes as the "Type" column in Supplementary  Table 2. The null hypothesis was no correlation at all. Although R could not calculate exact p-values in the presence of ties in rank, it did produce approximate p-values that were accepted for the determination of significance. These p-values for the combined motif lengths usually tracked the p-values for hexanucleotide motifs, which contributed 350 out of the combined 501 data values. Motif frequency was strongly, positively correlated between compound and simple loci, although the correlation weakened for hexanucleotide motifs. Repeat count was less strongly correlated between compound and simple loci, although the correlation was highly significant for hexanucleotide and collective repeats, and for tetranucleotide motifs with one or two mismatches. Polymorphism was uncorrelated between simple and compound loci with zero or two mismatches; the correlations with one mismatch were low to moderate, but mostly significant. Repeat count was not significantly correlated to frequency of simple loci, except in hexanucleotide motifs and the collective motifs that they dominated. Repeat count was more correlated to frequency for compound loci, where pentanucleotides also reached significant values with one or two mismatches. Polymorphism was significantly correlated to repeat count in simple and compound loci with hexanucleotide motifs and collectively, but not for shorter motifs, and the correlations for trinucleotide motifs were mostly somewhat negative.

Correlations among microsatellite properties
Values of the eight traits defined in the Abbreviations are listed versus motif length and GC content for perfect microsatellites in Table 3, for one mismatch in Table 4, and for two mismatches in Table 5. Monoclinal trends with respect to GC content were uncommon; examples include repeat count in simple trinucleotide loci, which declined with increasing GC over all three mismatch levels, and total frequency in tetranucleotide loci with zero or one mismatches. A simple peak or dip sometimes was observed, e.g., the total frequency of hexanucleotide loci with 2/3 GC was minimal with zero mismatches, and the polymorphism of simple pentanucleotide loci with 3/5 GC was maximal with zero or one mismatch. More generally, there were two peaks or dips in the pattern. The patterns were fitted to a quadratic model of the form y = a + bx + cx 2 , using function glm() in R, to produce Supplementary Table 11 (see Additional File 11). The significance of coefficients in this table is based on t-values, and 151 of the 288 reported coefficients differed significantly from zero at p <= 0.05. There were 54 coefficients that were significant at p <= 0.01, and 50 more that were sig-    nificant at p <= 0.001. The coefficients were particularly likely to be significant for pentanucleotide motifs with a single mismatch and for intercepts in general. The coefficients of the linear and quadratic terms were more likely to be significant for abundance than for the other traits, and usually the linear coefficient was negative and the quadratic coefficient was positive for abundance. The relative abundance of simple and compound loci did not respond much to GC content, and neither did the repeat count at any number of mismatches. Polymorphism frequently had a significantly negative quadratic coefficient, especially with zero or one mismatch in pentanucleotide and hexanucleotide motifs, meaning that the GC-rich motifs were overall less polymorphic.
The GC content of microsatellites was correlated with GC content of the entire transcriptome of each genus, but the correlation diminished for short motifs with two mismatches, where the recognized microsatellite loci were longer. Over all perfect microsatellites, Pearson's correlation coefficient was 0.960; with one allowed mismatch, it was 0.984; and for two mismatches, it was 0.971. The respective values for loci or subloci based on two-to sixbase motifs were 0.953, 0.978, and 0.865, for zero, one, or two allowed mismatches. Microsatellite GC content outpaced increases in transcriptome GC content; the GC content of microsatellites in GC-rich transcriptomes exceeded the transcriptome's GC content, and the GC content of microsatellites in GC-poor transcriptomes fell short of the transcriptome's GC content. For perfect microsatellites, the regression slopes of microsatellite to whole-transcriptomic GC content (y = a + bx by glm in R) were 2.06 for two-through six-base motifs, and 1.96 for all motifs. These slopes respectively became 1.85 and 1.71 for one mismatch, and 2.54 and 1.79 for two mismatches. All of these values differed from zero at p-values less than 1e-16, as did all their intercepts. Fitting a quadratic model (y = a + bx + cx2) gave positive intercepts that did not differ significantly from zero, negative b that did not differ significantly from zero, and positive c between 1.72 and 4.54 that was not significant for short motifs with two mismatches, or significant at 0.082 for all motifs in perfect microsatellites, down to 0.005 for short motifs with a single allowed mismatch. The significant values of c were between 1.72 and 2.61.
The GC content of microsatellites and minisatellites did not vary in proportion to motif length between 1 and 250 bases; the regression coefficients for a linear model (y = a + bx) were 6.26e-5, 3.13e-5, and 4.25e-5 for zero, one, or two mismatches, none of which differed significantly from zero. However, the slope between 1 and 90 base pairs was significantly negative at p < 0.0005 for motif lengths integrally divisible by 3 in all three mismatch classes. Also, GC content spiked at integral multiples of three ( Fig. 2) for motif lengths up to 90. Motifs that frameshifted their message had a significantly lower GC content that those that preserved the reading frame, at p-values less than 6e-12 in t-tests, regardless of mismatch level. Motifs that shifted their reading frame by one nucleotide did not differ significantly in GC content from those that shifted their reading frame by two nucleotides.

Comparative microsatellite frequencies among taxa
Top ranking was well distributed among genera for the 149 dinucleotide through pentanucleotide motifs. For perfect microsatellites, 42 genera ranked first in frequency of at least one motif. For one allowed mismatching nucleotide, 26 genera ranked first, and for two allowed mismatching nucleotides, 50 genera ranked first. In contrast, one genus, Pisum, ranked last (0) for 69 of these motifs, in accordance with its smallest nonredundant sample size of 1076 nonredundant sequences.
Similarity was tested by mean Euclidean distance within groups of related genera for total frequency in simple plus compound loci, repeat count in simple loci, and frequency of polymorphism in simple loci. This mean distance was compared to the distribution of 20000 mean distances in groups of the same size for the same character and number of allowed mismatches, drawn at random without replacement from the set of all 88 genera. The distance is based on the 499 canonical dinucleotide through hexanucleotide motifs; A and C were excluded, because either can appear spuriously in low-resolution sequence, and A can represent untrimmed polyA tails, especially any polyA trapped in chimeric clones. The results appear in GC content versus motif length for tandemly repeated sub-sequences  Tables 6, 7, and 8 for 18 groups ranging in diversity from the grass subfamilies Pooideae and Panicoideae to the three included non-seed plants. Genera were significantly close together for total frequency in the conifers, Solanaceae, and Asteraceae, at all three allowed numbers of mismatches. Otherwise, total motif frequencies for genera within the indicated families and orders were not significantly closer than expected by chance. Also, there were no significant differences of observed mean distances from random expectation for repeat count in these three tables.
Mean distances for polymorphism were significantly close only for the three lower vascular plants with zero or two mismatches, meaning that their average degree of polymorphism for each motif was significantly similar.
A second measure of intergeneric similarity is mean rank difference in particular motif frequencies, which is presented respectively for zero, one, and two mismatches in Supplementary Tables 12, 13, and 14 (see Additional Files 12, 13, and 14). Here the null hypothesis is that the frequency of the given motif for the same number of genera was drawn at random from the entire set of 88 genera. The three tables present only those motifs with a 0.01 or lower probability of being ranked as they were by chance. Since only the 149 canonical motifs of two through five nucleotides were used, about 1.5 motifs are expected to be significant at p <= 0.01 for each group. With zero allowed mismatches, four of 18 natural groupings (Malphigiales, Lamiales, Asparagales, and "other monocots") did not exceed two significant motifs, versus two groups (Malphi- giales and Asparagales) with one allowed mismatch and nine groups ("lower vascular plants", "other gymnosperms", Caryophyllales, Rosaceae, Malphigiales, Lamiales, Asteraceae, Asparagales, and "other monocots") with two allowed mismatches. There were fewest similarly ranked motifs within families and orders when two mismatches were allowed, and most when one mismatch was allowed. In part, rank consistency of a family or order depended on its values versus the mean for all 88 genera; families with very high or very low frequencies were more likely to exhibit a consistent ranking among the sampled genera. Since GC-rich motifs were especially abundant in Poaceae, the Poaceae had numerous GC-rich motifs that ranked more consistently than expected by chance. However, the Solanaceae ranked tightly for 5 to 16 motifs in spite of their low to middling frequencies. The Asteraceae, which like the Solanaceae exhibited a significantly low mean Euclidean distance among its genera, did not cohere as well in rankings, even though it had 52% of its rankings in the top or bottom 25% of the distribution versus 29% for the Solanaceae.

Phylogenetic analysis of microsatellite frequencies
The program contml (continuous character maximum likelihood) in phylip 3.65 [25] used the 149 canonical two-base through five-base motif frequencies to construct three maximum-likelihood trees (Figs. 3, 4, and 5), one per level of allowed mismatches, that provide a convenient basis to consider intergeneric similarities. Each maximum-likelihood tree was the most likely tree to result from 300 trials with different input orders of the genera and global branch rearrangements of the initially formed tree. While it is not probable that a globally most likely tree was ever found, the greatest log-likelihoods were similar among runs within each number of allowed mismatches. However, the maximum found natural-loglikelihood differed greatly among the number of allowed mismatches (-25793.86 with none, -33852.89 with one, and -6671.66 with two), as did the mean Euclidean distance among all 88 genera (189.34 with none, 326.85 with one, and 55.18 with two). The mean Euclidean distances for total frequency within groups of related genera, as recognized in Tables 6 and 8, were significantly lower (p < 1e-06) when two mismatches were allowed. The longer motifs appeared to be more uniform among all vascular plants, and the recognized sequences were more random with two allowed mismatches. However, it is difficult to conclude from the likelihoods or mean distances that any one level of mismatches produced a more realistic phylogeny than did the others.
The three trees group together various, often quite unrelated, genera. Most of these groups are clades (terminal branches) within the tree, but others are merely close to some more basal node. The mean Euclidean distance among their members for dinucleotide through hexanucleotide motifs appears after the taxonomic groupings in Tables 6, 7, and 8. The mean rank differences among their members for dinucleotide through pentanucleotide motifs follow those of the taxonomic groupings in Supplementary Tables 12, 13, and 14. P-values are given for the null hypothesis that the observed ranking resulted from random sampling of the same number of genera from all 88 genera. Since there were 149 motifs involved, only those motifs that cohered at p <= 0.01 are presented. Supplementary Tables 12, 13, and 14 also include the mean ranking for each significant motif, from 1 for the lowest total frequency to 88 for the highest. Thus Supple- Maximum likelihood tree based on perfect microsatellites with motifs from two through five bases long Figure 3 Maximum likelihood tree based on perfect microsatellites with motifs from two through five bases long. The names have been truncated to 10 letters. Maximum likelihood tree based on microsatellites with one allowed mismatch and motifs of two through five bases Figure 4 Maximum likelihood tree based on microsatellites with one allowed mismatch and motifs of two through five bases. Maximum likelihood tree based on microsatellites with two allowed mismatches and motifs of two through five bases Figure 5 Maximum likelihood tree based on microsatellites with two allowed mismatches and motifs of two through five bases.  mentary Tables 12, 13, and 14 specify the motifs that most likely determined the placement of the group members in their respective maximum likelihood trees.
The most prominent clade in all three trees consisted of 11 to 13 out of 15 total genera of grasses (Poaceae), although Cynodon and one to three other genera fell outside of it in each case. Within the grass clade, members of the subfamilies Pooideae and Panicoideae freely interdigitated, although Lolium consistently associated with Festuca, and four genera of the pooid tribe Triticeae grouped together for perfect microsatellites. This clade exhibited increasing GC content distally, which is evident from the generally higher ranking of individual GC-rich motifs for the distal group 15 over group 16 in Supplementary Another durable grouping included Allium and most of the gymnosperms (groups 2 and 3 in Supplementary  Table 12, groups 1 and 2 in Supplementary Table 13), which ranked low for various motifs. The ferns Adiantum and Ceratopteris also ranked low for their most consistent motifs. Glycine, Populus, and Vitis grouped together in all three mismatch levels, with high rankings for AAAAT, AAAT, AAAC, and similar motifs with runs of A. Citrus, Poncirus, and Mimulus also clustered together at each mismatch level; Tamarix associated with these three with zero or two allowed mismatches, and Lycoris associated with these three with zero or one mismatch. Acorus, Ananas, and Eucalyptus associated for zero or two mismatches, but went to separate branches with one mismatch. Thellungiella grouped with Brassica with zero mismatches, Arabidopsis with two mismatches, and neither with one mismatch. The main message of Supplementary Tables 12, 13, and 14, is that there are few prominent groups that remain consistent over numbers of mismatches or minimum locus length. In particular, the Solanaceae, which cluster close together in terms of Euclidean distance, did not cluster in any of the trees, except that Solanum grouped with Lycopersicon with zero or two allowed mismatches. Similarly, the Asteraceae resided in four groups for perfect microsatellites, two groups for one allowed mismatch, and five groups for two allowed mismatches, even though their genera were significantly close together by Euclidean distance. Furthermore, there was no particular correlation of GC-content or GC-rich motif frequencies to higher taxa. Apart from the Poaceae, the microsatellites of monocots were not particularly GC-rich, and one monocot (Lycoris) had the second most AT-rich microsatellites of all 88 genera for zero or one allowed mismatch, even though its EST collection ranked 21 st in AT content for zero mismatches.

Microsatellite frequency in relation to genome size
The total per-megabase frequency of perfect microsatellites was inversely correlated with genome size. This was evident anecdotally from the lowest ranking genera in microsatellite frequency (Table 9), all of which (where known) exceeded the median 2c value of 2.56 pg for EST sequence donors in the 78 genera where DNA content was available [26][27][28]. The negative correlation was also evident by linear regression of microsatellite frequency against 1c DNA content over these same genera. When all 78 genera were considered together, the regression slope was -21.77 loci per picogram of 1c DNA content (p = 0.036), whereas without the Poaceae the slope was -20.43 (p = 0.008). The correlation was not significant within the Poaceae, although the slope was still negative (-10.71, p = 0.90), because Zea has a smaller genome and a lower microsatellite frequency than the pooid genera Triticum, Hordeum, Secale, Avena, Festuca, and Lolium. Nevertheless, Zea has arguably the largest monoploid genome in the panicoid grasses, and it has likely expanded recently during its evolution, apart from its doubled chromosome number [29]. However, this negative correlation almost disappeared when gymnosperms were excluded from the sample (slope = -6.44, p = 0.69), indicating a more direct relationship to phylogeny than to genome size per se. When total perfect microsatellite frequencies were itemized by motif length and regressed versus 1c DNA content over all 78 available genera, the resulting slopes were significantly negative for dinucleotide (-5.04, p = 0.00096) and trinucleotide (-7.62, p = 0.0064) motifs, but not for tetranucleotide motifs (-2.88, p = 0.42). Within the Poaceae, the slopes were negative but not significantly so (p > 0. 30), and without the Poaceae, the slopes became even more significant for di-and trinucleotide motifs, and remained nonsignificant (p = 0.37) for tetranucleotide motifs. Finally, without the gymnosperms, the slopes were negative but not significant for all three motif lengths.
To test the possibility that the transcriptomes of these large-genomed genera are being diluted by transcription of transposable elements, which might be more abundant in large genomes, the phrap-generated singlets and contigs of Allium, Picea, Pinus, Arabidopsis, Glycine, and Oryza, were subjected to tblastx versus RepBase 3.11 at a cut-off e-value of 1e-10. The respective 2c values of these genera are 33.5, 40.4, 44.2, 0.32, 1.14, and 1.00 pg [26,28], when tetraploidy is accounted for in Glycine. The percentage of sequences matching repetitive elements was 16.9 in Oryza, 3.9 in Allium and Glycine, 3.5 in Pinus, 2.8 in Picea, and 2.7 in Arabidopsis, which did not fit the expectation that the three large genomes would have the three highest percentages.

Sources of long motifs
There are three interesting sources of repeated long motifs: repeated domains in proteins, tandem repeats in transcribed repetitive elements such as retrotransposons, and exon repetition resulting from mRNA editing. Several tactics were tried to quantify these sources. The first was to tblastx the 18374 repeated motifs of 30 or more nucleotides, derived from all 88 genera, against RepBase 3.11. These hits were frequent (0.298 of the sequences hit at an e-value less than 1, and the mean was 7.57 matched amino acids per hit) but unimpressive, mostly with e-values greater than 0.01. This was because the motifs were generally short and incompletely matched, representing a nonconservative subsequence in the repetitive element. This led to the second tactic, Monte Carlo simulation of sequences that matched the real repeated motifs in length, followed by tblastx against RepBase. The simulation took independent account of codon usage in each genus (which will be presented elsewhere), and 0.252 of the simulated motifs hit at an e-value less than 1, with a mean of 7.85 matched amino acids per hit. Although these are close to the observed values for the real motifs, a onesided, paired-data t-test gave t = 4.742 for hits per query and t = -2.952 for identical amino acids per hit, both of which were significant at p = 0.002 with 261 degrees of freedom (88 genera, 3 allowed mismatches, and two com-binations that gave no long motifs). Thus there were significantly more hits in the real data, but the hits were shorter than in the simulation, and hits to random sequence could account for about 85 percent of the observed frequency.
The third tactic was tblastx of the whole sequences that contained the long motifs against RepBase. Because of the longer query sequences, the cutoff e-value was decreased to 0.01, but still 8850 sequences hit out of 18374 distinct long-motif-bearing sequences. Furthermore, there were a few close matches to repetitive elements: 3211 (0.175 of the total) hit at 1e-10, 644 (0.035 of the total) hit at 1e-20, 141 (0.008) hit at 1e-30, 38 (0.002) hit at 1e-60, and the closest hit was to a CINFUL2A_I element at 1e-144. At an e-value of 1e-10, the most frequently hit elements were ENSPM4_OS (1012 hits out of 3211 total), ENSPM7_OS (198 hits), MERMITE18D (633 hits), TREP60 (424 hits), SZ-6IN (225 hits), MUDRN3_OS (67 hits), XILON1_ZM_LTR (64 hits), and CACTA-L (59 hits). There were 20 hits to CINFUL2A_I at 1e-10, and 14 of them were closer than 1e-60. Other slowly evolving elements that were hit at 1e-60 or less included HUCK1-LTR_ZM (7 hits) and ZEON1_ZM (3 hits). Thus tblastx of the whole sequences also suggests that 15 to 20% of the long motifs are direct repeats within transcribed repetitive elements.
High copy numbers of long motifs were particularly well represented in Populus, Pinus, Lycopersicon, Solanum, and Helianthus. Many of the longest repeats were reported as parts of compound microsatellites having several submotifs of equal length. Two examples of such loci are presented in Fig. 6. Both examples were imperfect simple repeats of a long motif, with more mismatches over repeats than were allowed in this study. The appearance of a compound locus resulted from the interruption of the locus by more than two mismatches, often resulting in the same repeated subsequence being recognized in more than one frame at different positions within the locus.
With blastx unfiltered for low-complexity sequence (-F F), the first sequence in Fig. 6 (CF446675.1 from Allium cepa) hits a number of cell-wall or membrane proteins from vascular plants and algae. In containing domains of proline, serine, and lysine, it is similar to CAA84230, an extensin from maize pollen, although the closest matches (1e-61) are to algal pherophorin proteins. The second sequence, BE430767.1 from Triticum aestivum, is a high-molecularweight glutenin; the closest blastx hits are around 1e-80.
In a blastx search of all 77 sequences from all 88 genera with at least six copies of a motif of at least 18 nucleotides, 63 sequences hit proteins in GenBank's nr database at evalues less than 1e-08. However, 30 of the hits were to the same entry, XP_344805.1, a putative protein from rat. These 30 sequences, which contained a 23-mer that was Two examples of imperfect minisatellite loci Figure 6 Two examples of imperfect minisatellite loci.

Discussion
The 77% reduction of total sequence length from redundant to nonredundant datasets in this study compares to an 85% reduction in number of ESTs from redundant to nonredundant datasets in [3]. Since the current study attempted to conserve deviant repeat loci as much as possible, it is likely that limited redundancy remains in the repeat loci reported here. This residual redundancy could distort the repeat frequencies somewhat from their values in the coding genomic sequence. There is a conflict between the goals of mining every microsatellite from an EST collection, and accurately extracting the motif frequencies from the coding sequence.
The slowest step of the pipeline is phrap, followed by the initial repeat finder. The run time of the phrap step is strongly influenced by the amount and pattern of redundancy within the dataset. The run time of the repeat finder increases quadratically with the number of motif lengths sought and linearly with the length of examined sequence; this time is small for motifs of less than 10 nucleotides. Although the repeat finder's algorithm is not particularly efficient, it is effective, and larger datasets can be divided and run in parallel.
The decision to allow the independent reporting of short and long repeats from the same subsequence, so that embedded classical microsatellites would not be overlooked in longer repetitive sequences, inflated the reported count of motifs that are part of very long motifs, probably by several percent.
The necessity to repeat phrap on large datasets probably merges distinct but similar sequences excessively, leading to a loss of microsatellites but also a canceling loss of "nonredundant" sequence to host them. The effect is not consistent, but it might be responsible for the low reported microsatellite frequencies in Zea.
Low-quality sequence was not completely excluded from the analysis. Particularly bothersome are relatively short tracts of low-quality sequence within otherwise accurate sequence, since these can appear as tracts of a single nucleotide and inflate the count of microsatellite motifs like AAAC, AAAG, and AAAT. Any rules-based removal of such subsequences would likely also remove legitimate instances of motifs that contain runs of a single nucleotide.
The observed satellite frequencies exceed those expected by chance in random sequence at every motif length and mismatch level, but the difference was least (and probably not significant) for eight-through 11-base motifs with one or two allowed mismatches. A more realistic model of random codon order, taking into account the observed codon preference of the sequences, probably would yield higher expected repeat frequencies. Such a model could provide testable predictions of relative motif frequencies to be expected in random sequence, that could be compared to the ranking of observed frequencies.
The low microsatellite frequency in Allium and conifers, which have very large diploid genomes with 2c > 20 pg, runs counter to the intuitive expectation that large genomes should have more and longer microsatellites [30]. However, studies in a grasshopper [30], fungi [31], and vertebrates [32] have demonstrated that the frequency and length of genomic microsatellite loci are not correlated to 2c DNA content. The loss of significantly negative correlation upon excluding gymnosperms accords with a relationship to phylogeny more than to genome size in itself, and thus plants probably resemble animals in their relationship of microsatellite frequency to genome size. An open question is how much the transcriptome is diluted by the individually infrequent transcription of a multitude of microsatellite-poor repetitive elements in large genomes. The tblastx results for six genera here indicate no correlation of transposable-element transcription to genome size. Instead, the transposable elements of Oryza might only be better known and more represented in RepBase than the rest, even though full genomic sequence is available for both Oryza and Arabidopsis.
Polymorphism is as frequent for pentanucleotide and hexanucleotide motifs as it is for the classical di-and trinucleotide microsatellites, although the repeat counts are lower. Such loci ought to be considered as potential genetic markers.
The pipeline offers primers to amplify most of the detected repeats, but the success of amplification depends upon the nonredundancy of the primer sequences in the genome, the distribution of primers among exons, and the success of phrap in reconstructing consensus EST sequences. The EST sequences provide little information about primer redundancy in the genome, since they do not include nontranscribed sequence, which comprises most of the genome, and since many genes belong to multigene families of varying size and complexity. The mature sequences also do not provide the locations of splice sites, and amplification will fail if a suggested primer straddles a splice site or if suggested primers are separated by too large an intron. Finally, there is always a chance that phrap has misassembled similar sequences, such that a primer pair suggested from a contig has no identical spacing or even identity in the genome.
Contamination of the EST databases with ribosomal RNA seems to be minimal, or the length of individual rRNA genes exceeds the 250-base cutoff used in this study. The amount of contamination with untranscribed genomic sequence remains unknown in these collections.
Microsatellite sequences often evolve faster than the taxa that bear them, although closest relatives usually retain similar microsatellite frequencies. This holds not only for frequency but also for polymorphism and especially mean repeat count, and it is consistent with the limited transferability of microsatellites among less related species. Repeated subsequences are probably subject to natural selection, like almost anything else that is transcribed.
Several other studies have reported motif frequencies from individual genera that were analyzed here. Supplementary Most or all of the rest can be attributed to repeated subsequences within genes, such as those that encode cell-wall or seed-storage proteins. O'Dushlaine et al. [34] found that about six percent of human proteins manifest tandem-repeat polymorphisms. The repeated subsequences do not necessarily preserve reading frame, although there is a substantial excess of those that do. Repeats that disrupt reading frame have been invoked in the evolution of nine types of unstructured proteins in animals [35], including the C-terminal domain of eukaryotic RNA polymerase II. O'Dushlaine et al. [34] suggest that up to one percent of human protein unigenes contain frameshifting repeats, and that such proteins tend to be involved in protein-protein interactions. The present study could not substantiate any examples of exon repetition arising upon RNA splicing, even though specific instances of exon repetition and exon scrambling have been demonstrated to result from intermolecular RNA splicing in human beings [36] and five other animal species [37]. Also, the mean GC content is higher for motifs that retain reading frame, and this accords with a higher GC content within coding sequence than elsewhere.
Inukai [38] investigated minisatellite loci in 5.3 megabases of genomic sequence in Oryza sativa. The loci had at least three copies of a motif at least 10 nucleotides long. Based on blast searches within the sampled genomic sequence, 183 loci were considered multilocus, i.e., occurring within similar sequence at dispersed positions in the genome, and 37 were single-locus. The loci were ascribed to transposable elements if their flanking sequence met criteria for long-terminal repeats, inverted repeats, or target-site duplication, or if they matched database entries for rice transposable elements. Of the 41 single-locus minisatellites, eight (19.5%) were embedded in transposable elements, comparable to the 15 to 20% of minisatellites estimated to belong to repetitive elements in the current study. While all 183 of Inukai's multilocus minisatellites belonged to transposable elements, it is unlikely that these elements were transcribed at a significant frequency, so one would not expect them to dominate the minisatellites within the transcriptome. The En/Spm family of class II transposable elements was one of the most common sources of minisatellites in Inukai's study, and it was the leading repetitive-element source of minisatellites in the current study. Also, minisatellite GC content in his study behaved the same as in the current study, exceeding the GC content of the flanking sequence when that GC content was high, and falling short when that GC content was low.
Tissue specificity is a generally underestimated source of microsatellite frequency variation in EST collections. Bilgen et al. [39] demonstrated this for ESTs drawn from 11 different tissues or developmental stages in Arabidopsis thaliana, where the relative frequency of the leading dinucleotide motif, AG, varied from 31.6% in seedlings to 55.5% in developing seeds. The current study did not try to select subsets of EST collections on the basis of annotation, so it cannot comment on the degree to which the phylogenetic analysis of motif frequencies was perturbed by tissue representation in the collections. It is potentially quite significant. Also, transcriptional slippage is a potential source of spurious microsatellite alleles in cDNA libraries [34], but the measured frequency of such slippage is only about 0.02% of all mRNA molecules in ubiquitin-B and amyloid precursor protein in human [40]. Therefore, transcription slippage has likely occurred at some very low frequency in the founding mRNA of the EST datasets examined here.
Finally, microsatellites stand in the eye of the beholder, i.e., the threshold criteria for patterned sequence are rather arbitrary and there is a continuum from perfect microsatellites to random sequence. Although a mini-mum locus length of 15 nucleotides is appropriate for short microsatellite motifs, a minimum of 20 or more is less likely to pick up randomly occurring doublings of longer motifs. Between 10 and 20% of patternedsequence loci are polymorphic within EST collections for motif lengths from two to six nucleotides, suggesting that the longer motifs should also be sought out and used wherever di-and tri-nucleotide motifs are used now. Frequently the polymorphism results from indels in the flanking sequence, in addition to or instead of variation in repeat count in the microsatellite itself. Patterned sequence appears to be subject to natural selection and to evolve more rapidly than the transcriptome at large.

Data source
DNA sequences were extracted from gbest flat files in Gen-Bank release 150, dated 15 October 2005 [41]. For each accession, the genus was recognized as the third word (tract of non-whitespace text) in the line that contained the string "ORGANISM". Sequences were counted by genus, and sequences were written to separate flat files for all 88 genera that were represented by at least 3000 sequences.

Removal of vector and low-quality sequence
Each genus was processed independently. Its sequence file was first searched for vector sequences by blastn [42] against UniVec [43]. A Perl 5.8 script then removed the vector-matching subsequences, except for the telomeric repeats of YAC vectors. A second Perl script trimmed off tracts of at least 10 consecutive A's or T's within the first 60 nucleotides from either end of the vector-trimmed sequences, plus any sequence distal to these tracts.

Recognition and processing of repeated motifs
A generic repeat finder, written in C, searched for repeated motifs from 1 to 250 nucleotides long in the trimmed sequences, without regard for redundancy in the motif. A Perl script then recognized motifs that were multiples of shorter motifs, and purged the list of repeats to retain only the shorter motifs. This script allowed the recognition of overlapping motifs provided that one was sufficiently longer than the other, and was not a multiple of it. In this way, classical microsatellites were sometimes recognized within longer minisatellites. Another Perl script then identified all instances where different blocks of repeats occurred within five nucleotides of one another, thereby comprising a compound locus. From this script emerged the list of all simple and compound sites of patterned sequence within the set of ESTs. To eliminate duplicate and nearly duplicate sequences, phrap [21] was applied to only the sequences that contained patterned sequence. Phrap produced a list of singlets and a list of contigs. The contigs were parsed for the longest contributing sequence that contained patterned sequence at a particular position in the contig, where position was a window of nine nucleotides centered at the midpoint of a patterned sequence. Therefore, primers were always chosen from a real sequence, even if the contig was misassembled. The singlets and chosen contig-contributors were output as a list of motifs, positions, and host-sequence identities. This list, and the sequences that it specified, were reformatted for input to primer3 [22], which returned wherever possible six pairs of primers per locus. The primers were chosen to position the microsatellite locus off center in the expected amplification product, closest to a specifiable ratio of short-side length to long-side length. This improved the chances of successful verification on agarose gels when a motif-specific third primer is included. Finally, the chosen primers were compared to all the sequences in the nonredundant set resulting from phrap, counting matches to indicate the likely frequency of the amplified sequence in the genome.
The phrap program could not successfully process large datasets, even under its manyreads option, because it exhausted system memory. In these cases, the vector-polyA-trimmed input file was split up so that no piece had more than about 150000 sequences, and the entire pipeline ran independently on the pieces. A Perl script then identified the contributing sequences for each piece, and the pipeline was rerun on only these contributing sequences. For the largest datasets, e.g., Triticum and Zea, it was necessary to undertake this strategy twice, so as to maintain datasets of manageable size throughout the processing.

Statistical analysis
Other Perl scripts extracted canonical motifs from the pipeline's output and tabulated repeat counts, GC content, frequency of polymorphism, and so on. The canonical form of a motif was the lexicographically least form obtained upon all possible rotations of the motif and its reverse complement. Thus TCA reduced to ATC, and TC reduced to AG. The results were grouped by canonical motif, even though the different contributing motifs might have independent biological functions that depend upon reading frame. Statistical significance was tested by empirical distributions of 20000 to 50000 randomly drawn samples of the same size, wherever a parametric distribution was unavailable, as in the case of mean Euclidean distances among genera or mean motif rankings.
Microsatellite frequency per megabase was estimated by dividing counts of post-phrap loci by numbers of nucleotides in non-redundant trimmed sequences, which were obtained by using phrap independently on all the trimmed sequences. As above, the larger datasets required division and reprocessing of subsets through phrap.

Repeat frequency in random sequence
The observed microsatellite frequency was compared to that expected by chance in random sequences of the same length as the observed sequences. If the bases occur at equal frequency (0.25), the expected frequency of tandem repeats of length m with k mismatching nucleotides in a sequence of length n is: F = (n -2m +1) * 4 -m * 3 k * m!/(k! * (m -k)!) (1) The sum of F for k = 0, 1, and 2 gives the expected frequency of tandem repeats given two allowed mismatches, and the sum of such sums over all sequence lengths in the EST collection gives the expected frequency of repeats of motif length m. Similarly, the expected frequency of triple tandem repeats is F = (n -3m + 1) * 4 -2m * 3 k * (2m)!/(k! * (2m -k)!), whose sum for k = 0, 1, and 2 is approximately the expected frequency of triple tandem repeats with two allowed mismatches. The sum is a bit in excess because two mismatches in the second repeat disqualify the repeat block as a triple tandem repeat. The comparison for double tandem repeats was done for motif lengths where the minimum locus contained two repeats, i.e., where the motif was at least eight nucleotides long. Below that, the triple tandem repeats formula was used for motifs from five through seven nucleotides long. The expected ratio of double to triple repeats is approximately 4 m .

Detection of repetitive elements
To test the possibility that repeated motifs were part of repetitive sequences, particularly retrotransposons that were being transcribed at individually low but collectively substantial frequency, motifs of at least 30 bases were subjected to tblastx [42] against RepBase 3.11 [23]. Also subjected to tblastx against RepBase was a parallel set of synthetic sequences, constructed at random to the same lengths as the real motifs in accordance with codon usage in each genus. This synthetic set relied on random deviates uniformly distributed in [0,1), produced by the Mersenne Twister generator [44] as function runif() in R [24]. This set allowed an appraisal of the biological significance of hits of the real repeated motifs to various repetitive elements. Also, the real sequences that contained 30-base or longer repeated motifs were subjected to tblastx against RepBase, to identify long but diffuse hits that might have arisen from the rapid evolution of repetitive elements.
Abbreviations compfreq: frequency of compound (multi-motif) microsatellites containing this motif per megabase of EST sequence, after trimming of vector and poly-A tails