Genome-wide analysis of tandem repeats in Daphnia pulex - a comparative approach
© Mayer et al. 2010
Received: 29 September 2009
Accepted: 30 April 2010
Published: 30 April 2010
Skip to main content
© Mayer et al. 2010
Received: 29 September 2009
Accepted: 30 April 2010
Published: 30 April 2010
DNA tandem repeats (TRs) are not just popular molecular markers, but are also important genomic elements from an evolutionary and functional perspective. For various genomes, the densities of short TR types were shown to differ strongly among different taxa and genomic regions. In this study we analysed the TR characteristics in the genomes of Daphnia pulex and 11 other eukaryotic species. Characteristics of TRs in different genomic regions and among different strands are compared in details for D. pulex and the two model insects Apis mellifera and Drosophila melanogaster.
Profound differences in TR characteristics were found among all 12 genomes compared in this study. In D. pulex, the genomic density of TRs was low compared to the arthropod species D. melanogaster and A. mellifera. For these three species, very few common features in repeat type usage, density distribution, and length characteristics were observed in the genomes and in different genomic regions. In introns and coding regions an unexpectedly high strandedness was observed for several repeat motifs. In D. pulex, the density of TRs was highest in introns, a rare feature in animals. In coding regions, the density of TRs with unit sizes 7-50 bp were more than three times as high as for 1-6 bp repeats.
TRs in the genome of D. pulex show several notable features, which distinguish it from the other genomes. Altogether, the highly non-random distribution of TRs among genomes, genomic regions and even among different DNA-stands raises many questions concerning their functional and evolutionary importance. The high density of TRs with a unit size longer than 6 bp found in non-coding and coding regions underpins the importance to include longer TR units in comparative analyses.
The planktonic microcrustacean Daphnia pulex is a key species in lake ecosystems and forms an important link between the primary producers and the carnivores. It is among the best-studied animals in ecological, toxicological, and evolutionary research [1–4]. With the availability of the v1.1 draft genome sequence assembly for D. pulex it is now possible to analyse the genome in a comparative context.
Tandem repeats (TRs) are characteristic features of eukaryotic and prokaryotic genomes [5–13]. Traditionally, they are categorized according to their unit size into microsatellites (short tandem repeats, STRs, 1-6 bp (1-10 in some publications) repeat unit size), minisatellites (10 to approximately 100 bp repeat unit size), and longer satellite DNA (repeat units of >100 bp). Typically, STRs contribute between 0.5 - 3% to the total genome size.
TR loci in general, and micro- and minisatellite loci in particular, are often highly dynamic genomic regions with a high rate of length-altering mutations [14, 15]. Therefore, they are frequently used as informative molecular markers in population genetic, forensic, and molecular ecological studies [6, 16–22]. Due to their high abundance in genomes, microsatellites (STRs) are useful markers for genome mapping studies [23–26].
In contrast to the early view that TRs are mostly non-functional "junk DNA", the picture has emerged in recent years that a high proportion of TRs could have either functional or evolutionary significance [27–34]: TRs frequently occur within or in the proximity of genes, i.e., either in the untranslated regions (UTRs) up- and downstream of open reading frames, within introns, or in coding regions (CDS) . Recent evidence supports that TRs in introns, UTRs, and CDS regions can play a significant role in regulating gene expression and modulating gene function [32, 35, 36]. Highly variable TR loci were shown to be important for rapid phenotypic differentiations [37, 38]. They can act as "evolutionary tuning knobs" which allow fast genetic adaptations on ecological timescales [ for review, see also ]. Furthermore, TRs can be of profound structural as well as evolutionary importance, since genomic regions with a high density of TRs, e.g., telomeric, centromeric, and heterochromatic regions, often have specific properties such as alternative DNA structure and packaging. The structure of DNA can, in turn, influence the level of gene expression in these genomic regions [28, 33, 34, 37, 40]. Altogether, the analysis of the TR content of genomes is important for an understanding of genome evolution and organisation as well as gene expression and function.
With the rapid accumulation of whole genome sequence data in the last decade, several studies revealed that STR densities, usage of repeat types, length characteristics, and typical imperfection vary fundamentally between taxonomic groups [9, 11, 41–44] and even among closely related species [45–48]. In addition, strong differences of STR characteristics among different genomic regions have been described [9, 12, 43, 44, 49]. The often taxon-specific accumulated occurrence of certain repeat types in different genomic regions can hint at a functional importance of these elements. These characteristics are interesting from a comparative genomics as well as an evolutionary genomics point of view [9, 11, 12, 43, 44, 50, 51].
Several studies have been conducted in the past to compare the characteristics of microsatellites (1-6 bp or 1-10 bp) among different taxa and different genomic regions, e.g. [9, 44]. In these studies, however, the characteristics of TRs with a unit size >6 bp or >10 bp have been neglected. It has sometimes been argued that repeats with a unit size above 10 bp are generally rare in genomes, a presumption that has never been systematically tested. Furthermore, most studies are restricted to perfect TRs, with the main advantage that this significantly simplifies their identification. Disadvantages of this approach are that imperfections are a taxon-dependent natural feature of TRs and therefore should be included rather than neglected in an analysis. But even more important, TRs with long units tend to be more imperfect [10, 52] so that a meaningful survey, which includes repeats with a unit size above 10 bp, has to include imperfect repeats.
Studies on characteristics of microsatellites can also be categorized according to whether they use the TR coverage of a sequence (in this paper referred to as the density, see Methods), or a number count of TRs per sequence length as the main characteristics of TRs. We recommend the use of a TR density (as in ) instead of number counts, since the latter do not represent the true TR content of a sequence. For example, the number count of a single perfect, 10000 bp long repeat, which might cover 20% of a sequence, is the same as that of a 20 bp repeat that only covers 0.04% of the same sequence. Depending on the number of mismatches, indels or sequencing errors, as well as the allowed degree of imperfection, the same 10000 bp repeat can be counted as one or a variety of different numbers of satellites. Hence, TR densities have the clear advantage that they show a much smaller dependence on the allowed degree of imperfection of a satellite than number counts.
List of species genomes analysed in the present study together with basic information on the genome assembly.
Genome size [Mb]
genome assembly version
Dappu v1.1 (dpulex_jgi060905)
The twelve sequenced genomes analysed in the present study are listed in Table 1. This list also contains the size, the CG-content, the assembly versions, and the download reference of the studied genomes. The size refers to the number of base pairs in the haploid genome. It reflects the current state of the genome build and includes known nucleotides as well as unknown nucleotides (Ns). CG-content, and genome size were determined with a self-written program. For D. melanogaster, the analysis of TRs in the complete genome includes the Het (heterochromatic), U and Uextra sequence files. Similarly, for A. mellifera, we included scaffolds in the file GroupUn_20060310.fa.
For the D. pulex genome we obtained the most recent 'frozen gene catalogue' of the v1.1 draft genome sequence assembly from January 29th 2008 in the generic GFF (General Feature Format) from Andrea Aerts (DOE Joint Genome Institute), which in similar form is available from http://genome.jgi-psf.org/Dappu1/Dappu1.home.html. This catalogue contains the predicted and to some extent still putative gene locations. For each gene model, it provides the predicted locations of exons, and for most genes also the locations of coding regions, start and stop codons. Since the catalogue often contains multiple or alternative gene models at the same locus as well as duplicate or overlapping features of the same type within the same gene model, a C++ program was written by CM to remove multiple gene models in order to avoid an overrepresentation of these loci in the analysis. To be more precise, if two predicted gene models overlapped and if both genes were found in the same reading direction, the longer of the two gene models was removed. Similarly, if two exons or two coding (CDS) features of the same gene overlapped, the longer of the two features was removed. Introns and intergenic regions were identified by the locations of exons that are associated to the same gene model. If available, the start and stop codon positions within exons of a gene were used to infer the locations of 5' and 3'UTR. This information on the positions of different genomic regions was finally used to split the genome sequences into six sequence files, each containing the sequence fragments associated to exons, introns, 5'UTRs, 3'UTRs, CDS, or intergenic regions. Since the TR characteristics of exons are just a combination of the TR characteristics of CDS and UTR regions, they have not been included in the present analysis.
For A. mellifera we used the same procedure as for D. pulex. A GFF file with annotation information was obtained from http://genomes.arc.georgetown.edu/Amel_abinitio_on_assembly2.gff. Unfortunately, the annotated features have so far not been officially mapped on assembly version 4.0, so the TR analysis of genomic regions had to be performed with assembly version 2.0.
For the D. melanogaster genome, separate sequence files for the six different features of interest can readily be downloaded from ftp://ftp.flybase.net/genomes. Since also these files contain multiply or alternatively annotated features, again a C++ program written by CM was used to consistently remove the longer of two overlapping features if both were of the same feature type and annotated in the same reading direction. The separate sequence files for different genomic regions do not include the sequence fragments found in the Het (heterochromatic), U and Uext sequence files of the current assembly, since these regions have not yet been annotated .
For the 5'UTRs, 3'UTRs, introns, and CDS regions of the three genomes we extracted and analysed always the sense strand of the corresponding gene. This provides the opportunity to identify differences in the repeat characteristics of the sense and anti-sense strands, i.e. to search for a so-called strandedness.
For a given TR unit, the associated repeat type is defined as follows: All TRs with units that differ from the given repeat unit only by circular permutations and/or the reverse complement are associated to the same repeat type. Clearly, there are always several repeat units, which belong to the same repeat type. We follow the convention to represent a repeat type by that unit which comes first in an alphabetical ordering of all units that are associated to it . This convention allows us to count and identify repeat units without reference to the repeat unit phase or strand. To give an example, the repeat type represented by the unit AAG incorporates all TRs with units AAG, AGA, GAA, TTC, TCT, and CTT. Furthermore, the term repeat motif is used instead of the term repeat type when we aim at distinguishing between sense and anti-sense strand repeat characteristics, but not the repeat phase. Hence, on the level of repeat motifs, AAG, AGA, GAA are all represented by AAG, but are distinguished from the repeat motif CTT, which also represents TTC and TCT. Finally, the terms repeat type and repeat motif are distinguished from the term repeat class which we use to denote the collection of all repeats with the same repeat unit size (e.g. mono-, di-, trinucleotide repeats).
An important property of one or a set of TR types is their density within a nucleotide sequence. It is defined as the fraction of base pairs that are found within repeats of a given set of repeat types over the total number of base pairs in the sequence. Repeat type densities are measured in base pairs per megabase pairs (bp/Mbp). It can be envisaged as the coverage of the sequence with the specified repeat types. Since in several genomes, including D. pulex, the number of (Ns) contributes significantly to the total size, all TR densities computed in this work were corrected for the number of Ns. It is important to distinguish repeat densities from densities based on number counts of repeats (measured in counts/Mbp) that are sometimes used in publications, e.g. [44, 47, 51].
The characteristics of perfect and imperfect TRs strongly depend on the properties individual satellites have to fulfil to be included in the analysis. For perfect TRs this is the minimum repeat length or its associated alignment score, which in TR search programs is often defined as a function of the unit size. Changing the minimum unit size has an effect not only on the total density of different TR types, but also on relative densities since the length distribution of different repeat types usually differ strongly. For imperfect TRs it is additionally necessary to restrict or penalize their imperfection, e.g. with a mismatch and gap penalty. Furthermore, an optimality criterion has to be specified that determines which of two alternative alignments of a putative TR locus with its perfect counterparts is to be preferred.
In the present work, TRs were detected using Phobos, version 3.2.6 and 3.3.0 . Phobos is a highly accurate TR search tool that is able to identify perfect and imperfect TRs in a unit size range from 1 bp to >5000 bp without using a pre-specified motif library. The optimality criterion Phobos uses is the alignment score of the repeat region with a perfect repeat counterpart. This means that each putative TR is extended in both directions as far as possible, by including gaps and mismatches, if this leads to a higher alignment score (see the Phobos manual for details ). For the present analyses, the alignment scores for match, mismatch, gap and N positions were 1, -5, -5, 0 respectively. In every TR the first repeat unit was not scored. Only a maximum number of four successive Ns were allowed. For a TR to be considered in the analysis it was required to have a minimum repeat alignment score of 12 if its unit size was less or equal to 12 bp or a score of at least the unit size for unit sizes above 12 bp. As a consequence, mono-, di-, and trinucleotide repeats were required to have a minimum length of at least 13, 14, and 15 bp to achieve the minimum score. For repeat units above 12 bp a perfect repeat had to be at least two units long, an imperfect repeat even longer, to achieve the minimum score. For this study, imperfect TRs were analysed in two size ranges: 1-50 bp and 1-4000 bp. For both size ranges a recursion depth of five was used. For the size range 1-50 bp the maximum score reduction was unlimited, for the size range 1-4000 bp the maximum score reduction was set to 30 to accelerate the computation while preserving a good accuracy. For details concerning the search strategy of Phobos and its scoring scheme the reader is referred to the Phobos manual .
Phobos has been used for this analysis since it is more accurate in the unit size range 1-50 bp than other TR search tools. Besides searching for imperfect repeats, Phobos is also able to identify whether alternative alignments exist for a TR. For example, the (ACACAT)N repeat can be viewed as an imperfect dinucleotide or a perfect hexanucleotide repeat. In this discipline, the Tandem Repeats Finder (TRF)  is the only alternative. While it is the state of the art in the detection of imperfect repeats with long unit sizes, it is based on a probabilistic search algorithm. In particular, it is less accurate when detecting TRs with a short unit size and a small number of copies. In contrast, Phobos uses an exact (non-probabilistic) search algorithm necessary for a meaningful statistical analysis of TR characteristics. The search parameters used in this analysis are being compared to the default search parameters used in the TRF program in the Additional file 1. TR characteristics such as the density and mean length of repeat types were computed using the program Sat-Stat, version 1.3.1 developed by CM.
In principle, results can be compared to TR databases available [56–60]. However, due to the differences in search parameters and problems related to probabilistic searches such a comparison makes sense in few cases only and has therefore not been performed in this study.
Main characteristics of STRs in the genome of Daphnia pulex and 11 other taxa.
A comparison of genome sizes and mean lengths of imperfect STRs of all 12 genomes is shown in Figure 1b. Even though the mean repeat length depends crucially on the search parameters for TRs, general trends can be seen in this comparison: STRs are shortest in D. pulex (average length 19.48 bp) and longest in M. musculus (average length 38.3 bp), see Figure 1b and Table 2. No significant correlation between genome size and mean length of STRs was found (Pearson correlation coefficient: R = 0.489, P = 0.107).
Whereas for the three vertebrate species a high TR density is correlated with a high value of the mean repeat length, no similar correlation can be observed for the three arthropods. While A. mellifera has a STR density of almost twice the value of D. melanogaster, the STRs are on average 20% longer in D. melanogaster than in A. mellifera. In the Additional file 2, we present separate analyses of perfect and truly imperfect TRs. Most notable is that C. elegans, despite of its low density of truly imperfect repeats has on average very long imperfect TRs.
Comparing the relative densities of STR classes among the three arthropod species, we find that trinucleotide repeats are strongly overrepresented in D. pulex, contributing 30% to all STRs (Figure 2). The proportions of mono-, tetra-, penta-, and hexanucleotide repeats are almost identical in D. pulex and A. mellifera. With the exception of similar tetranucleotide densities there are no common features among D. pulex and the other two arthropod species.
Tandem repeat types of mono- to trinucleotide repeats for the genome of D. pulex and eleven other taxa.
Repeat characteristics of TR classes with a unit size of 1 to 50 bp for Daphnia pulex, Drosophila melanogaster, and Apis mellifera.
repeat class [bp]
mean TR length [bp]
longest TR [bp]
mean TR length [bp]
longest TR [bp]
mean TR length [bp]
longest TR [bp]
Similar to the repeat densities, strong differences between the mean lengths of TRs with respect to the unit size are observed for the three arthropod species (Figure 4, Table 4). Since the minimum length of TRs is twice the unit size, it is expected to see a trend toward longer repeats for an increasing unit size. Roughly, this trend can be confirmed for D. pulex and A. mellifera, whereas for D. melanogaster a trend can only be seen when not taking into account some of the repeat classes with extraordinarily long repeats. In D. pulex and A. mellifera, all mean repeat lengths are shorter than 254 bp in the unit size range 1-50 bp. D. pulex shows a notable peak for the mean repeat lengths of 17 bp repeats, a repeat class that is discussed in detail below. Among the smaller peaks in the mean repeat length spectrum of D. pulex there is a trend towards peaks that correspond to repeat classes that are multiples of three base pairs (Figure 4, Additional file 4).
In contrast, D. melanogaster has mean repeat length peaks above 500 bp for several repeat classes. This explains why the genomic density of TRs found in D. melanogaster is twice as high as in D. pulex even though the total number of TRs is lower (Table 4). A maximum mean repeat length of 2057 bp is found for the 46 bp repeat class which consists of 12 repeats ranging in length from 355 bp to 11248. It should be mentioned at this point that the high densities of longer repeat classes in D. melanogaster are concentrated in the heterochromatic regions of this genome. The sequencing and assembly of these regions was so difficult that this was done in a separate Heterochromatin Genome Project [61, 62]. See also the discussion below.
Characteristics of the CDS, introns, and intergenic regions of D. pulex, D. melanogaster, and A. mellifera.
Characteristics of the TRs found in the CDS regions, introns, and intergenic regions of D. pulex, D. melanogaster, and A. mellifera.
Several repeat classes are more dense in CDS regions than in other regions, e.g. the densities of the 24 bp repeat class in D. pulex, the 39 bp repeat class of D. melanogaster, and the 6, 10, 15, 16, 18, 21, 30, 36 bp repeat classes of A. mellifera are significantly higher in CDS regions than in all other regions. In a separate analysis conducted only for D. pulex, we searched for TRs in the size range 1-4000 bp in CDS regions. The results show repeat densities above 100 bp/Mbp also for the 51, 52, 60, 75, 108, and the 276 bp repeat classes. A list of all TRs found in CDS regions of D. pulex is given in Additional file 6.
In introns of D. pulex and D. melanogaster the proportion of STRs is higher than in the other genomic regions, whereas in A. mellifera, with a general trend to shorter repeat units, this cannot be observed. In D. pulex, the repeat classes with a unit size of 1-5 bp and 7-8 bp show by far the highest densities in introns as compared to other genomic regions (Additional file 5). Most dominant are trinucleotide repeats, which are more dense in introns of D. pulex than in introns of D. melanogaster and A. mellifera. A notable feature in introns of D. melanogaster is the relatively high density of the 31 bp repeat class. The intergenic regions of D. pulex and D. melanogaster show high densities for several longer repeat classes which are rare or absent in other regions (Figure 6, Additional file 5). In D. pulex, e.g., the 17 bp repeat class shows a high repeat density only in intergenic regions, whereas in the other two arthropods it is relatively rare in all genomic regions. Repeat classes with a particularly high density in intergenic regions can be found in Additional file 5. Concerning the UTRs in D. pulex, the TR statistics has to be treated with caution for repeat classes longer than 3 bp, since only a small proportion of genes has well annotated UTRs so that the total number of TRs found in 5' and 3'UTRs (135 and 653) is low. For example, the inflated density of the 24 bp repeat class in 5'UTRs of D. pulex is based on just a single 272 bp long repeat. As a general result, TRs with short units dominate in UTRs.
Mean lengths of the TR classes in the different genomic regions are more heterogeneous in D. melanogaster than in D. pulex and A. mellifera. This is not just the case for intergenic regions including the heterochromatin, but also in introns (e.g. the 31 bp repeat class) and CDS regions (e.g. 39 bp and 48 bp repeat classes), see Figure 6.
For D. pulex, D. melanogaster, and A. mellifera repeat motif usage shows only few common features among the genomes and different genomic regions. Common features of all three genomes are a relatively high density of poly-A/T repeats in introns and intergenic regions, low densities of CG repeats in all regions, and higher densities of AAC and AGC repeats in CDS regions than in introns and intergenic regions. Repeat motifs that are more dense in introns than in CDS and intergenic repeats of all three genomes are poly-T, AT and GT (Additional file 7). Several repeat motifs show a strong strandedness in the CDS regions of all three genomes. Most notable are the repeat motifs AAC and AAG, which have much higher densities than their reverse complements GTT and CTT. A smaller but still existing trend is observed for AAT versus ATT repeats. Strandedness also occurs in introns of D. pulex, where poly-T repeats have much have higher densities than poly-A repeats. Other motif pairs with considerably different densities on the sense strand in introns are ATT versus AAT, CT versus AG, GT versus AC, and ATTT versus AAAT. In all these examples T-rich motifs are preferred on the sense strand.
Restricting the search for common features to D. pulex and D. melanogaster one finds that CCG/CGG repeats are predominantly found in CDS regions, whereas AT repeats show their highest densities in 3'UTRs (data not available for A. mellifera), see Additional file 7. The absolute densities of the AT repeat type in 3'UTRs, however, differ significantly with values of 220.5 and 2663.6 bp/Mbp in D. pulex and D. melanogaster, respectively. In both genomes, the dominant repeat motif in CDS regions is AGC, with a particularly high density of 1658.9 bp/Mbp in CDS regions of D. melanogaster.
Curiously, for both genomes (D. pulex and D. melanogaster), the repeat motif AGC shows much higher densities on the sense strand of CDS regions than its reverse complement, the repeat motif CTG (340.7 bp/Mbp versus 74.7 bp/Mbp and 1658.9 bp/Mbp versus 26.9 bp/Mbp, see Additional file 7). In introns of D. pulex, a strandedness for this motif is not present, whereas in introns of D. melanogaster it is much less pronounced. In contrast to D. pulex and D. melanogaster, the repeat motif AGC has only a moderate density in all regions of A. mellifera. Conversely, the dominant repeat motif in CDS regions of A. mellifera, ATG, is very rare in the other two genomes. Also this repeat motif shows a considerable strandedness in CDS regions of A. mellifera. Other repeat motifs with a high density in CDS regions of A. mellifera, but with low densities in the other genomes are ACT and AGT. Also notable is the high density of the dinucleotide (and thus reading frame incompatible) repeat motif CT (435.8 bp/Mbp) in CDS regions of A. mellifera and the strong discrepancy to the low density of its reverse complement AG (20.3 bp/Mbp). As mentioned before, short units are dominant in introns of all three genomes. Dominant repeat motifs in introns of D. pulex are poly-T followed by CT and CTT. Among tetranucleotide repeats, the motifs CTTT and ATTT show the highest densities. All these motifs have higher densities than their reverse complements. In introns of D. melanogaster, dominant repeat motifs are poly-A followed by poly-T and AT, with only a small strandedness of poly-A versus poly-T repeats. Densities in introns of A. mellifera are high for several repeat motifs. Most notable are the motifs AT followed by poly-A, poly-T, CT, AG, and AAT. The density of AT repeats in introns of A. mellifera (4069.0 bp/Mbp) constitutes the highest repeat motif density among the three genomes and their genomic regions. A notable strandedness is observed for the poly-A versus poly-T and for AAT versus ATT repeat motifs. In CDS regions of A. mellifera a high strandedness is also found for the AAGCAG motif (1480 bp/Mbp) versus CTGCTT (0.00 bp/Mbp). In introns, the two motifs still have the respective densities of 46.3 bp/Mbp versus 0.00 bp/Mbp.
Concerning the mean perfection of TR motifs in different genomic regions (see table in Additional file 7, page 10 for details) we could not find many general trends. In different genomic regions of D. pulex, the mean imperfection in the size range 1-50 bp was 98.36% in CDS regions, 99.09% in intergenic regions, and 99.31% in introns (the mean values are not shown in above mentioned table). For A. mellifera we found on average lower repeat perfections of 97.35% in CDS regions, 98.57% in intergenic regions, and 98.52% in introns. For D. melanogaster, mean repeat perfections are 97.35% in CDS regions, 98.55% in intergenic regions and 98.68% in introns. So in all three genomes, the mean repeat perfection is lowest in CDS regions. Differences in repeat perfection among introns and intergenic regions are small.
Strong differences among the three genomes are found for several repeat motifs: poly-C and poly-G densities are particularly low in A. mellifera, AT repeat densities are 20 and 30 times higher in intergenic regions and introns of A. mellifera as compared to D. pulex and AnG (n = 1 to 5) and ACG densities are much higher in D. pulex and A. mellifera than in D. melanogaster. For instance AAG repeat densities are about 40 times higher in introns and intergenic regions of D. pulex than in the same regions of D. melanogaster. Potentially interesting are TRs in CDS regions where the unit size is not directly compatible with the reading frame. As mentioned above, 10-mer repeats (and multiples of 10) have significant densities in CDS regions of D. pulex. Most notable are the repeat types AACCTTGGCG (Dappu-343799, Dappu-344050, Dappu-343482, Dappu-279322, Dappu-280555), ACGCCAGAGC (Dappu-264024, Dappu-264706, Dappu-275708), and ACGCCAGTGC (Dappu-267284, Dappu-267285, Dappu-275706, Dappu-275708, Dappu-277192). These three repeat types are completely absent in D. melanogaster and A. mellifera. Repeat motif usage in UTRs was only compared if the number of satellites in these regions was sufficiently high. All TR characteristics including the number counts are listed in Additional file 7. As a general result, repeat type usage is very heterogeneous on a genomic level as well as among different genomic regions. Within a given TR class there are usually only a few TR motifs which contribute to the density of the repeat class (Figure 7, Additional file 7).
Mean lengths of mono- to trinucleotide repeat types in different genomic regions of D. pulex show a relatively homogeneous length distribution, in contrast to the heterogeneous densities (Figure 7, Additional file 5). Peaks in average repeat length in the UTRs (see Additional file 5 and 7) must be regarded with caution due to small samples sizes (see above). In D. melanogaster and A. mellifera, TRs are generally longer than in D. pulex.
The repeat class in D. pulex with the highest repeat density and a unit size longer than three base pairs is the 17 bp repeat class (Table 4). There are several notable aspects of these repeats: first of all, the true genomic density of 17 nucleotide repeats is likely to be underestimated in the current assembly since several scaffolds start or end with a 17-nucleotide repeat. For instance, the longest imperfect repeat found in D. pulex with a total length of 3259 bp is a 17 nucleotide repeat located at the end of scaffold 66. Three very similar repeat types, (AAAAGTTCAACTTTATG with 273.0 bp/Mbp, mean length 318.5 bp, AAAAGTAGAACTTTTCT with 209.8 bp/Mbp, mean length 739.62 bp, AAAAGTTCTACTTTGAC with 88.9 bp/Mbp, mean length 705.3 bp) contribute 88% to the total repeat density of 17 bp repeats. (Further repeat types were found that are similar to these three.) A striking characteristic of these repeat types is their high similarity to their reverse complement. The two repeat types with the highest density have only 5 non-matching positions when aligned to their reverse complement. This might hint at a functional role or structural importance of these repeats - see discussion. The mean length of all imperfect 17-mer nucleotide repeats is 270 bp, which is the highest value for repeats with a unit shorter than 46 bp in D. pulex. Repeats of the 17 bp repeat class are mostly found in intergenic regions with a density of 1039.4 bp/Mbp and mean length of 295.0 bp.
Tandem repeats, together with interspersed repeats, are key features of eukaryotic genomes and important for the understanding of genome evolution. For the newly sequenced crustacean D. pulex we have analysed the characteristics of TRs and compared them to the TR characteristics of 11 other genomes from very different evolutionary lineages. A particular focus was on comparing the genomes of A. mellifera and the model insect D. melanogaster because of their shared ancestry with Daphnia within the Pancrustacea, and despite their large evolutionary divergence, they best served to help annotate the D. pulex genome.
A general problem of TR analyses is that the detection criteria, the allowed degree of imperfection, the optimality criterion as well as the accuracy of the search algorithm can significantly influence the characteristics of TRs found in a search [65, 66]. Therefore, a direct comparison of TR characteristics of different genomes is only possible if analyses were carried out by the same search tool using the same search parameters. Despite differences in the detection criteria, a comparison of TR type densities for Homo sapiens analysed in this study and by Subramanian et al.  agree well in terms of absolute and relative densities (see Table 3 in this paper and Figures 3, 4 and 5 in ) supporting that general trends can well be independent of the search criteria. While Subramanian et al.  also used TR densities as the main characteristics, many studies rely on number counts. This type of data is difficult to compare to analyses using TR densities. Hence, in this paper we have compared our results mainly with those in Tóth et al. , since their detection criteria (perfect STRs, minimum length 13 bp), main characteristics (TR densities) and the compared taxa still come closest to those used in the present analysis. All comparisons drawn here have been confirmed (in a separate analysis) to hold true also when using the same search parameters as in .
Our analyses show that TRs contribute considerably to all genomes analysed in this study, which is consistent with earlier results ([5, 9, 11, 12, 51, 67] and many others). No TR characteristics were found that are common to all of the 12 genomes, except for a relatively low density of ACT repeats, which has already been reported in Tóth et al. . The dominance of taxon rather than group specific characteristics has also been reported in [44, 51] when comparing number counts of satellites. As a general trend, Tóth and collaborators  also observed an underrepresentation of ACG repeats in most taxa. Our data support this trend with the striking exception of O. lucimarinus, where ACG repeats constitute the highest individual trinucleotide repeat type density in this study (Table 3). Curiously, the high absolute and relative di- and trinucleotide repeat densities found in O. lucimarinus are exclusively based on the high densities of the CG, ACG, and CCG repeat types that are uncommon in all other taxa in this study (see discussion below). The high CG-content of these three dominant repeat types is consistent with the high CG-content (60%) of the genome of O. lucimarinus.
Even within evolutionary lineages, common features of TR characteristics are rare. Notable are the clear dominance of poly-A over poly-C repeat types in all genomes except for the diatom and the green algae, the almost complete absence of mononucleotide repeats in the diatom and the green algae, and the almost complete absence of ACG repeats in vertebrates (Figure 2 and Table 3). Our data also supports the result of Tóth et al.  that the relative high proportion of tetranucleotide over trinucleotide repeat densities in vertebrates could not be found in any other taxonomic group. To establish these features as lineage specific, still more taxa need to be analysed. Besides these few cases of group specific similarities, this study reveals a high level of dissimilarity in genomic repeat class and repeat type densities among all taxonomic groups. Among the fungi, for example, the genomes of N. crassa and S. cerevisiae show no lineage specific similarities. In contrast to Tóth et al. , where AT and AAT repeats were the dominant di- and trinucleotide repeat types in genomes of fungi, N. crassa has a more than 2.6 times higher density of AC than AT repeats and a more than 3 times higher density of AAC than AAT repeats in this study. Also the three arthropod species, D. pulex, D. melanogaster, and A. mellifera show no remarkable similarities among mono- to hexanucleotide repeat class (Figure 2) or mono- to trinucleotide repeat type densities (Additional file 7). Several common features of arthropods that have been found in  cannot be confirmed in the present analysis: whereas these authors found dinucleotide TRs to constitute the dominant repeat class in arthropods, this cannot be confirmed in the present study for D. pulex where the density of trinucleotide repeats exceeds the density of dinucleotide repeats by 40%. Furthermore, in  AC was the dominant dinucleotide and AAC and AGC the dominant trinucleotide repeat types in arthropods, which is not the case for the genomes of A. mellifera and D. pulex. Most striking, the AC, AAC, and AGC repeat type densities are particularly low in A. mellifera, a genome for which an untypical repeat type usage, as compared to other arthropods, has already been mentioned in . A. mellifera also stands out as the taxon with the highest density of mononucleotide repeats in this study, whereas in  this repeat class was found to be densest in primates. In contrast to , where penta- and hexanucleotide repeats were "invariably more frequent than tetranucleotide repeats in all non-vertebrate taxa", this cannot be confirmed in the present study.
Going beyond the scope of previous TR analyses ([9, 11, 43, 44] and others), we compared characteristics of TRs with unit sizes in the range 1-50 bp. Our results reveal that imperfect TRs with unit sizes larger than 6 bp contribute significantly to the TR content of all genomes analysed. The model nematode C. elegans, e.g., was commonly thought to have a very low density of genomic TRs , which is true for the unit size range 1-5 bp, but not for the size range 6-50 bp (Additional file 2, see also Figure 3). This finding leads to a completely new picture for the TR content of this organism.
Concerning the mean lengths of STR, this study showed that the genome of D. pulex is characterized by shorter STRs than the other genomes. Furthermore, among the STRs, perfect repeats have a higher density than imperfect repeats. Neglecting the still unknown contribution of unequal crossing-over to length altering mutations of STRs, their equilibrium lengths are the result of slippage events extending STRs and point mutations breaking perfect TRs into shorter repeats [41, 46, 69, 70]. The dominance of relatively short STRs in the genome of D. pulex indicates that the 'life cycle' of a typical TR is comparatively short, i.e. the frequency of interrupting point mutations is relatively high compared to extending slippage mutations. Furthermore, it has been discussed in the literature whether the typical length of TRs is inversely correlated to the effective population size (see e.g. ). Since large population sizes are a feature of D. pulex, our results are not in conflict to this conjecture.
Another interesting point is the typical perfection of TRs. Perfect TRs are believed to be subject to more length altering mutations than imperfect repeats, since a higher similarity of sequence segments increases the chance of slippage and homologous crossing-over events. Since the STRs found in D. pulex but also those in A. mellifera are predominantly perfect, we expect an increased number of length altering mutations in these two genomes. The mutability of STRs in D. pulex has been studied in detail by another group of the Daphnia Genomics Consortium, which compares the rate and spectrum of microsatellite mutations in D. pulex and C. elegans . In view of this remark it is interesting that TRs in the size range 1-50 bp are on average more imperfect in CDS regions of all three arthropod genomes as compared to introns and intergenic regions.
A direct comparison of TRs with a unit size of 1-50 bp among the three arthropods shows remarkable differences. The dominant repeat classes (highest to lower densities) are the 2, 1, 3, 4, 5, and 10 bp repeat classes of A. mellifera, the 3, 2, 1, 17, 4, and 10 bp repeat classes in D. pulex and the 11, 5, 12, 2, 1, and 3 bp repeat classes in D. melanogaster. This highlights the trend towards shorter motifs in A. mellifera in contrast to the trend towards longer motifs in D. melanogaster. The relative dominance of 3 bp repeats in D. pulex likely reflects the great number of genes (>30000; Daphnia Genomics Consortium unpublished data) in this comparatively small genome. This same paper also states that D. pulex is one of the organisms most tightly packed with genes. Similar to the repeat densities, the mean lengths of TRs show remarkable differences among the three arthropods. An elevated mean length of TRs in a repeat class can hint at telomeric and centromeric repeats. In D. pulex, candidates for telomeric and centromeric repeats are found in the 17, 24, and 10 bp repeat classes. Since the long 17 bp repeats are usually located at the beginning or end of scaffolds, their true density is likely to be underestimated. Interestingly, just three very similar repeat types contribute 87% of the density to this repeat class. It is worth noting that the two repeat types with the highest density have only 5 non-matching positions when aligned to their reverse complement, which could lead to the formation of alternative secondary structures, see e.g. [33, 72].
As mentioned above, the CG, ACG and CCG repeat types are rare in all taxa except for O. lucimarinus, where the densities of these repeats are particularly high. Usually, the low densities of these motifs are explained by the high mutability of methylated CpG dinucleotides (as well as CpNpG trinucleotides in plants, where N can be any nucleotide), which efficiently disrupts CpG rich domains on short timescales. Since CCG repeat densities are also low in several organisms that do not methylate (C. elegans, Drosophila and yeast), Tóth et al.  argue in favour of other mechanisms, which lead to low CCG repeat densities, particularly in introns. According to our data, CpG and CpNpG mutations must certainly be suppressed in TR regions of O. lucimarinus. Furthermore, mechanisms which act against CpG-rich repeats in other species are not in effect in this genome. The particularly high densities of CG, ACG, and CCG as compared to all other mono- to trinucleotide repeat types in O. lucimarinus even raises the question whether CpG-rich repeats are simply favoured for unknown reasons, or whether they are prone to particularly high growth rates if their occurrence is not suppressed.
Interesting in this respect is a direct comparison of the densities of the ACG and AGC repeat types, which have identical nucleotide content on the same strand, but which differ in the occurrence of the CpG dinucleotide. The density ratio of AGC to ACG repeats ranges from high values in the vertebrates with a value of 63.4 in H. sapiens to 0.0040 in O. lucimarinus (Table 3). Even among the three arthropod species, this density ratio differs considerably: D. pulex (3.3), A. mellifera (0.28), and D. melanogaster (18.5). Interestingly, A. mellifera and O. lucimarinus are the only two species for which the density of ACG repeats is higher than the density of AGC repeats. Among the three arthropods, A. mellifera has the highest content of CpG containing TRs despite its lowest value for the genomic CG-content (34.9%) in this study. Consistent with this observation, a CpG content higher than in other arthropods and higher than expected from mononucleotide frequencies has been found previously, even though A. mellifera methylates CpG dinucleotides .
In D. pulex, the densities of An× (n = 1 to 10) repeat types are significantly overrepresented, a feature that has also been observed for other, distantly related species (H. sapiens , A. thaliana ). Lawson and Zhang  have argued that these repeats could have evolved from mutations in poly-A repeats.
Several recent studies have shown that TRs are not just "junk DNA" but play an important role in genome organization, gene regulation and alternating gene function. They have gained particular interest due to their potential for rapid adaptations and several authors regard them as hotspots for evolutionary success of species [28, 34, 36–39].
In D. pulex, STRs are predominantly found in introns with a clear preference for a small number of repeat types (AC, AG, AAG, AGC). Interestingly, all mono- to trinucleotide repeat types are densest in introns, with the exception of AT and CCG repeat types. A predominance of STRs in introns has not been reported for many genomes before, except e.g. for fungi in . In D. melanogaster, STRs have highest densities in 3'UTR with a preference for AG, AT, AAC, and AGC repeats. Common to the D. pulex and D. melanogaster genome is the dominance of AC repeats in introns, AT repeats in 3'UTR, and CCG repeats in coding regions. Relatively high densities of CCG repeats in CDS regions and low densities in introns had also been reported for vertebrates and arthropods . All these features are in contradiction to a model of neutral evolution of different TR types, see also [9, 34]. They suggest differential selection to prevail in different genomes and genomic regions, which in turn hints at an evolutionary or functional importance of TRs.
Concerning the density of different repeat classes in different genomic regions of D. pulex, the following observations are of particular interest: (i) The densities of the repeat classes 1-5, 7-8 bp are higher in introns than in CDS and intergenic regions. (ii) The densities of TRs with a unit size above 8 bp are much lower in introns than in the other regions. (iii) The densities of almost all repeat classes with a unit size longer than 10 bp that are a multiple of three are higher in CDS regions than in introns and even intergenic regions. (iv) The high density of trinucleotide repeats in introns raises the question how well introns have been annotated. Furthermore it would be interesting to determine DNA transfer rates between CDS regions and introns caused by mutations. This process could also be the reason for higher trinucleotide densities in introns. Observation (i) could be explained by a preference for TRs in introns that are more variable or that have higher repeat copy numbers, which both could be important for regulatory elements. Observation (ii) could indicate that TRs with longer motifs are not beneficial in introns. Alternatively, the restricted size of introns could be the limiting factor for TRs with longer motifs. Observation (iii), however, shows that the size of genomic features does not provide a good indication for the expected motif sizes of TRs. While introns and CDS regions have about the same size in D. pulex, (see Table 5) observations (i) to (iii) show opposite preferences for the motif size of TRs in these two regions. The tendency toward longer repeat motifs in coding regions is presumably caused by tandemly repeated amino acid sequences, in particular for the motif PPR (proline - proline - glycin) and suggests strong protein domain level selection. Most interestingly, the absolute density of TRs with a unit size of 7-50 bp in CDS regions of D. pulex is higher than in CDS regions of D. melanogaster, despite of the strong tendency towards longer repeat units in all other regions of D. melanogaster.
An interesting observation of our analysis is the strandedness found for some repeat motifs in CDS regions and introns. The fact that some motifs are favoured on a particular strand hints at a selective advantage that remains to be studied in more detail.
The overall strong differences in TR characteristics in genomes and genomic regions raises many questions. For the extreme outlier in respect to repeat type usage, O. lucimarinus, we found that the most dominant repeats have a high CG content, which correlates with the high CG content of the complete genome. It would certainly be interesting to study this putative correlation in a separate study. An observation of Riley et al. [33, 72] should be noted at this point. They have found that for repeats with putative regulatory function, the existence of the repeat and its overall structure is more important than the detailed base composition. This would allow organisms to have different repeat motifs with their preferred base composition at regulatory important segments of the genome.
The question arises whether TRs can be used to detect problems or inconsistencies in the current annotation of genomes. For this reason we had a closer look at selected TRs occurring in coding regions of D. pulex (from Additional file 6). Only a small proportion of these annotated genes show a clearly low support, but the support deceased for annotated gene, which host multiple TRs, such as e.g. Dappu-243907 and Dappu-318831. Furthermore, we had a look at gene models that host TRs with a motif size that is not a multiple of three, e.g. the relatively dense 10 and 20 bp repeat classes. Among these gene models, several were found for which the TR has almost the same size as the CDS element. Interesting examples with almost identical repeat units are found in the following annotated genes (braces contain the length of the CDS element, the length of the TR as well as the repeat unit): Dappu-264024 (1075 bp, 1033, ACGCCAGAGC), Dappu-264706 (165 bp, 113 bp, ACGCCAGAGC), Dappu-267284 (414 bp, 395 bp, ACGCCAGTGC), Dappu-267285 (460, 459, ACGCCAGTGC), and Dappu-265168 (738 bp, 473 bp, AATGC ACGCCAGTGC ACGCC). The numbers show that these CDS elements consist almost exclusively of the repeat pattern. The unit ACGCCA is indeed found in several other TRs in CDS regions of D. pulex. We found that the mean perfection of these 10-mer repeats (97.4%) is only marginally lower than that of 9-mer repeats (98.8%) or that of trinucleotide repeats (99.1%), indicating that their imperfection should not be an indication for a potential invariability of these 10-mer repeats in CDS regions. Another problematic finding is the high repeat content in exons of D. melanogaster of the two very similar repeat types with the unit AAACCAACTGAGGGAACGAGTGCCAAGCCTACAACTTTG (195.4 bp/Mbp) and AAACCAACTGAGGGAACTACGGCGAAGCCTACAACTTTG (109.1 bp/Mbp) with no contribution of these repeat types neither to CDS or UTRs, hinting at a problem in the annotation where these repeats occur.
For the characteristics of TRs analysed in the present work we have not given any error margins, not because we do believe that our results are exact, but since an estimate of error margins is hardly feasible. While a minor source of uncertainty might be introduced by the TR search algorithm, the main source of error is the incomplete nature of most genome assemblies (see Table 1). The genomic sequences of the current assembly of D. pulex, A. mellifera, D. melanogaster, and H. sapiens for instance contain 19.6%, 15.6%, 3.8%, and 7.2% unknown nucleotides (Ns), respectively (Table 1). But even the apparently low number of Ns in the latter two organism might be too optimistic, which is phrased in  as follows: "... a telomere-to-telomere DNA sequence is not yet available for complex metazoans, including humans. The missing genomic "dark matter" is the heterochromatin, which is generally defined as repeat-rich regions concentrated in the centric and telomeric regions of chromosomes. Centric heterochromatin makes up at least 20% of human and 30% of fly genomes, respectively; thus, even for well-studied organisms such as D. melanogaster, fundamental questions about gene number and global genome structure remain unanswered."
For obvious reasons, most genome projects focus on sequencing easily accessible coding regions and leave aside highly repetitive regions which are difficult to sequence and assemble. As a consequence, TRs densities will be lower in sequenced than in unsequenced genomic regions, and error margins for TR densities cannot be assessed statistically, but depend on mostly unknown systematic errors of the current assembly. The implications for the present work are, that TR densities are likely to be underestimated for all genomes analysed. Among the three arthropods, D. melanogaster is the best-studied organism and the only one with an exclusive Heterochromatin Genome Project [61, 62]. For D. pulex and A. mellifera, heterochromatic regions have not yet been sequenced with the same effort. However, the contribution of heterochromatin in A. mellifera is estimated to be about 3% [73, 74], whereas in D. melanogaster the contribution is about 30%, without clear boundaries between euchromatin and heterochromatin . These differences in sequencing status and different sizes of heterochromatic regions could lead to a bias of yet unknown direction.
Altogether, it is expected that this bias will not affect the generally robust trends we found in our analyses for the following reasons: in D. melanogaster, the trend towards longer repeats units appeared already in the first assemblies, while this has not been observed in A. mellifera. In this context it is interesting to note that the total density of STRs is still higher in A. mellifera than in D. melanogaster. In D. pulex, no reliable estimate of the contribution of heterochromatin is known. Our study indicates a trend to slightly higher contributions than in A. mellifera, but considerably lower contributions than in D. melanogaster.
The newly sequenced genome of Daphnia pulex shows several interesting characteristics of TRs which distinguish it from the other model arthropods D. melanogaster and A. mellifera. The density of TRs is much lower than in the two other arthropods. The mean length of STRs was shortest among all genomes in this study. From a functional perspective it is interesting that STRs are by far densest in introns and that the contribution of TRs with units longer than 6 bp in CDS regions of D. pulex is even higher than in D. melanogaster. The finding of a strong strand bias in repeat motif usage (strandedness) underpins the functional relevance of several repeats. A notable feature of D. pulex is the high density of 17 bp repeats presumably associated to heterochromatin regions.
Comparing the 12 genomes, our results reveal an astonishing level of differences in TR characteristics among different genomes and different genomic regions, which even exceeds the level of differences found in previous studies. Extreme "outliers" concerning densities and repeat type usage (O. lucimarinus), even lead us to the conjecture that nature has not imposed general limitations concerning repeat type usage and densities of TRs in genomes. In view of several general and lineage specific TR characteristics that have been refuted in this analysis and in view of the still small number of taxa that have been compared, the existence of common TR characteristics in major lineages becomes doubtful.
Altogether, this study demonstrates the need to analyse not only short TRs but also TR with longer units, which contribute significantly to all genomes analysed in this study. Restricting an analysis to STRs leaves a great amount of genomic TRs go unnoticed that may play an important evolutionary (functional or structural) role.
short tandem repeat
The authors want to thank Dr. Bánk Beszteri and Prof. Dr. Stephan Frickenhaus (Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany) for giving us the opportunity to use the AWI computer clusters for a large part of the present analysis and for their valuable technical assistance. Furthermore, we are very grateful to Dr. Andrea Aerts from the DOE Joint Genome Institute for providing us the most recent FrozenGeneCatalog from Jan. 29th 2008 in the generic "Gene Feature Format" containing all information needed for our comparison of different genomic regions. We also want to thank the steering committee of the Daphnia Genome Consortium for the great infrastructure they have provided for all contributors to this genome project.
The sequencing and portions of the analyses of the Daphnia pulex genome were performed at the DOE Joint Genome Institute under the auspices of the U.S. Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48, Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231, Los Alamos National Laboratory under Contract No. W-7405-ENG-36 and in collaboration with the Daphnia Genomics Consortium (DGC) http://daphnia.cgb.indiana.edu. Additional analyses were performed by wFleaBase, developed at the Genome Informatics Lab of Indiana University with support to Don Gilbert from the National Science Foundation and the National Institutes of Health. Coordination infrastructure for the DGC is provided by The Center for Genomics and Bioinformatics at Indiana University, which is supported in part by the METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. Our work benefits from, and contributes to the Daphnia Genomics Consortium. We also want to thank two anonymous reviewers for constructive and helpful comments on this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.