Eukaryote DIRS1-like retrotransposons: an overview

Background DIRS1-like elements compose one superfamily of tyrosine recombinase-encoding retrotransposons. They have been previously reported in only a few diverse eukaryote species, describing a patchy distribution, and little is known about their origin and dynamics. Recently, we have shown that these retrotransposons are common among decapods, which calls into question the distribution of DIRS1-like retrotransposons among eukaryotes. Results To determine the distribution of DIRS1-like retrotransposons, we developed a new computational tool, ReDoSt, which allows us to identify well-conserved DIRS1-like elements. By screening 274 completely sequenced genomes, we identified more than 4000 DIRS1-like copies distributed among 30 diverse species which can be clustered into roughly 300 families. While the diversity in most species appears restricted to a low copy number, a few bursts of transposition are strongly suggested in certain species, such as Danio rerio and Saccoglossus kowalevskii. Conclusion In this study, we report 14 new species and 8 new higher taxa that were not previously known to harbor DIRS1-like retrotransposons. Now reported in 61 species, these elements appear widely distributed among eukaryotes, even if they remain undetected in streptophytes and mammals. Especially in unikonts, a broad range of taxa from Cnidaria to Sauropsida harbors such elements. Both the distribution and the similarities between the DIRS1-like element phylogeny and conventional phylogenies of the host species suggest that DIRS1-like retrotransposons emerged early during the radiation of eukaryotes.


Background
The tyrosine recombinase (YR)-encoding elements constitute one of the major groups of retrotransposons [1,2]. These elements encode a YR that is required for the mechanism of integration into the genome [3], distinguishing them from other retrotransposons (i.e., LTR retrotransposons, LINEs, SINEs and Penelope) [4]. DIRS1-like retrotransposons belong to the YR-encoding element superfamilies [5], whose constituents exhibit a unique structure made up of three ORFs and uncommon repeats (Figure 1). The first ORF encodes a putative GAG protein, the second the YR, and the third a pol region composed of three distinct domains: a reverse transcriptase (RT), a RNase H (RH), and a methyltransferase (MT). The function of this latter still remains unknown. Depending on the element considered, there may be considerable overlap between the pol and the YR regions ( Figure 1). The catalytic tyrosine recombinase domain is encoded by the non-overlapping 3'-end of the YR ORF. Many phylogenetic relationship analyses have shown that the RT/RH domains of DIRS1-like retrotransposons are closely related to those of Ty3/Gypsy LTR retrotransposons, suggesting that all these elements diverged from an ancient GAG-pol form of retrotransposon [5][6][7]. DIRS1-like elements are bounded by Inverted Terminal Repeats (ITRs) and harbor two Internal Complementary Regions (ICRs). The two ICRs located at the 3'-end of the element appear to overlap on a 3-bp motif called the circular junction. As the left ICR is inverse-complementary to the beginning of the left ITR so is the right ICR to the end of the right ITR, but the latter also appears complementary to an extension of the right ITR that is called the right Extension (rE) [1]. Given these unusual features, an integration model has been proposed [3,5] in which the ITRs' extremities match with their respective ICR. The junction of the two ITRs results in the formation of a rolling-circle intermediate of the element. The element integration then occurs by recombination between the 3-bp ITR junction sequence (complementary to the circular junction) and an identical sequence in the genome, which does not produce any target site duplications. Their unique structure distinguishes DIRS1-like retrotransposons from other YR-encoding elements, also known as the DIRS order [2] that includes also the Ngaro, Viper and PAT elements. The Ngaro and Viper retrotransposons are devoid of the MT domain and do not usually harbor ORF overlaps [6,8]. Elements from the PAT superfamily, the sister group of DIRS1-like retrotransposons, differ most prominently in their repeats. The PAT retrotransposons (PAT-like elements, TOC elements and kangaroo) are bounded by some "Split" Direct Repeats (SDRs) and can contain tyrosine recombinaseencoding regions in an inverted orientation [5].
Transposable elements have been found in all eukaryotic species investigated thus far [2]. However, depending on the superfamily or family of elements studied, they show different distributions among eukaryotes. For example, the Ty1/Copia, Ty3/Gypsy, LINEs, SINEs retrotransposons and the Tc1/Mariner transposons, have been detected almost ubiquitously [2,7,[9][10][11]. The Penelope retrotransposons are also abundant in many animal species, but seem to be rare among plants, protists and fungi [12]. In contrast to this, the Maverick transposons (also called Polintons) have been characterized by a highly patchy distribution in diverse eukaryote species, but not in plants [13,14]. Until recently, bibliographic data and automatic annotations have revealed the presence of DIRS1-like retrotransposons only in 43 diverse eukaryote organisms (Table 1), mostly with a low diversity per species (up to four families in Strongylocentrotus purpuratus and three families in Danio rerio [1,5]) with the notable exception of Xenopus tropicalis (73 families deposited in Repbase [15]). They were not described in several well-studied groups (e.g., plants and mammals), and are absent from model organisms such as Saccharomyces cerevisiae and Drosophila melanogaster. The DIRS1-like retrotransposons appear widely distributed among decapod crustaceans [16]. These elements were previously detected using PCR approaches in 16 decapod species, including some shrimps, lobsters, crabs and galatheid crabs. The wide distribution among decapods and the continuous identification of elements in new species with the emergence of large-scale genome sequencing call into question their supposedly patchy distribution among eukaryote species.
We aim to determine the distribution of DIRS1-like retrotransposons among eukaryotes using an in silico approach. In the post-genome era, several automatic annotation tools have been developed to detect the presence of particular types of transposable elements in genomes. The conventional approaches are based on similarity searching using the RepeatMasker program [17]. However, transposable elements often correspond to ancient genome components. Many copies even within the same family appear fragmented and divergent in nucleotide sequences due to several punctual mutations, rearrangements, and insertions or deletions (indels). Similarity searching-based programs are efficient in identifying copies closely related to those previously reported in the library, but they often appear inefficient in detecting very divergent copies or unknown elements [18]. Other in silico approaches have been developed to detect particular types of elements. These programs, such as LTRharvest [19], are not based upon similarity searching but on specific signature searches (e.g., the nature of the termini and the presence of target site duplications). While some programs have been developed to detect LTR retrotransposons or transposons, none have been developed for DIRS1-like retrotransposons. Such a program might appear inefficient in identifying divergent DIRS1-like retrotransposons because the training dataset that is currently available for these elements remains too limited (only 18 reference elements with detectable ITRs for example). Some de novo approaches that detect more divergent transposable elements, such as RECON [20], have been developed to exhaustively report the content of repeated sequences within genomes. To identify a specific type of element, many investigations of this report must be performed, such as similarity searching. For the same reasons as those given for similarity searching-based methods, such approaches could appear inappropriate for studying the distribution of the DIRS1-like retrotransposons.
We hereby present a new computational approach specifically dedicated to the identification of DIRS1-like retrotransposons among genomes that we called ReDoSt. Our method is based on both the detection of the structure of these elements and on sequence similarity searches performed using alignment profiles designed on coding domains. It has the advantages of not considering the element copy number and of avoiding any preconception of the ITRs (length or sequence identity). With our method we analyzed 274 completely sequenced genomes, which allowed for a high coverage of eukaryotic diversity, especially plants and unikonts.
We have identified more than 4000 element copies that can be clustered into approximately 300 new families. We report the first DIRS1-like element copy number estimate among many genomes and we evaluate the diversity within the DIRS1-like superfamily. Their distribution appears wider than it was previously thought, especially in unikont species. Sequence analyses confirmed the presence of well-conserved DIRS1-like retrotransposons in 28 species, including at least 14 species that were not previously known to host such elements, and allowed us to define a more precise structure of the DIRS1-like retrotransposons, especially in their terminal repeats.

Results and Discussion
Identification of putative DIRS1-like retrotransposons in eukaryote genomes To study the distribution of DIRS1-like retrotransposons among genomes, we developed a new computational tool that we call ReDoSt (Retrotransposon Domain and Structure). The element detection is mainly based on independent similarity searches against co-oriented and well-ordered RT-, MT-and YR-encoding domains within a single 10-kb genomic fragment (see Methods). So, the DIRS1-like copies detected with ReDoSt may be considered as well-conserved (i.e. with the simultaneous recognizable presence of these three characteristic domains), which suggests that they may still be active, or have moved only recently. Thus, relics and highly degenerate elements are not considered here.

Hexapoda
Harpegnathos saltator [27] Nasonia vitripennis GenBank (2007) Solenopsis invicta [28] Tribolium castaneum [21] Mucoromycotina Phycomyces blakesleeanus [49] Rhizopus oryzae [8] Sauropsida Gopherus agassizii [5] Urochordata Oikopleura dioica [50] All the detected DIRS1-like elements, even in partial sequences, are reported here represented in which some taxa are characterized for the first time as harboring DIRS1-like retrotransposons. For example, we observed the first DIRS1-like elements in Mollusca (Aplysia californica and Lottia gigantea). Interestingly, DIRS1-like retrotransposons can be detected in all the species in two higher taxa, Actinopterygii and Mucoromycotina. ReDoSt was able to detect DIRS1-like elements in all species already described in the literature except those harbor in the honey bee Apis melifera genome. This discrepancy is due to the fact that this genome contains only remnant fragments of DIRS1-like elements that ReDoSt is unable to detect [21]. As expected, the identified elements seem to be wellconserved. The length of the three detected domains appears highly constrained within the elements of a given genome. For example, in the Sauropsida Anolis carolinensis genome, almost all RT-, MT-and YRencoding fragments have a length ranging from 360 to 380 bp, 300 to 320 bp, and 900 to 940 bp, respectively (Additional File 1). This pattern is present in most genomes, with the notable exception of Saccoglossus kowalevskii, which varies considerably in its domain length (Additional File 1), possibly because of multiple large fragment deletions.
Considering the repartition of the 4310 copies detected in 32 eukaryotes, the copy number per genome appears highly variable ( Table 2), even within some of the higher taxa examined. In Actinopterygii, the low copy numbers detected in Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis and Gasterosteus aculeatus (6, 7, 8, and 21 copies, respectively) contrast with the 2091 copies identified in D. rerio. Conversely, in Mucoromycotina, Mucor circinelloides has ten times fewer copies than other related species. The copy number per genome is usually relatively low, illustrated by the fact that half of the species harbor fewer than 8 copies. Twelve species show between 10 and 60 copies and only 5 species harbor more than 100 copies (D. pulex, S. kowalevskii, X. tropicalis, A. carolinensis and D. rerio). This suggests that the more or less recent element activity is relatively low, resulting either from the inactivation of most genomic copies or from a strong regulation of the copy number. The loss of elements in some higher taxa or species could be facilitated by this low copy number. However, the relatively low copy number observed in genomes has to be conservative since only well-conserved copies are considered based on the three coding domains studied. For example, similarity searches on Acantheamoeba sp. allowed us to reveal 29 more degenerate sequences related to the unique element detected using ReDoSt (data not shown).
To our knowledge, the copy number has only been previously estimated in two genomes: the slime mold Dictyostelium discoideum and the crustacean Daphnia pulex. In D. discoideum, the previous copy number estimation of DIRS1-like retrotransposons suggested 40 full-size elements and around 200 incomplete copies [22]. Our detection tool results in the identification of 16 well-conserved copies. This result seems consistent with the previous estimation considering the difference in the methods used. The previous analysis estimated the copy number with quantitative Southern-blot experiments using the complete DIRS1-like sequence as a probe. For this reason it may detect more altered elements than our tool does. This is especially the case with the nested elements [23] that amplify the signal in Southern blots but are by default considered to be a unique copy by in silico ReDoSt analysis (see Methods). In D. pulex, the DIRS1-like copy number has been previously estimated at 218 [24], including only 19 intact copies (i.e., uncorrupted sequences and conserved ITRs) [25]. This estimation also seems consistent with our results (100 copies detected), as ReDoSt identifies well-conserved elements but is not limited to intact copies.

The diversity of DIRS1-like retrotransposons
To study the diversity of the DIRS1-like elements, we use the MCL program to cluster into families all the sequences that were detected with ReDoSt as well as reference elements. The parameter values used to cluster in the MCL program were empirically estimated to discriminate each of the DIRS1-like families previously described (e.g., DrDIRS1, DrDIRS2 and DrDIRS3 in D. rerio). Based on the sequence identity, the clusters obtained on the reverse transcriptase-encoding sequences using the MCL program are considered to correspond to different DIRS1-like families. For example, the sequence identities among the largest cluster in A. carolinensis (319 sequences) range from 57% to 100%, with an average sequence identity of 81%. Such a relatively high nucleotide sequence divergence is similar to those observed in reverse transcriptases encoded by non-LTR retrotransposons and in some DNA transposases. The cluster number obtained in each genome reflects the diversity of DIRS1-like elements.
A total of 287 families were found distributed unevenly among the genomes of the 32 species examined ( Table 2). Most of the families seem restricted to only one species with the notable exception of Mucoromycotina species for which several interspecific families are obtained. Some species show very low element diversity in comparison to their copy number. For example, all 16 copies detected in D. discoideum grouped into a single family. On the other hand, few species show very high element diversity. For example, S. purpuratus harbors 4 copies distributed among 4 families. Likewise, the 14 copies of B. floridae are split into 11 families. The distribution of copy number per family shows two major profiles according to species (Figure 2 and Additional File 2). Comparing the two vertebrate species X. tropicalis and A. carolinensis, both of which harbor high copy and family numbers, the Western clawed frog contains families almost equal in size whereas the lizard contains two families that together include 64% of the copies. The two fungi Rhizopus oryzae and Allomyces macrogynus have only about 20 copies, which are well distributed in R. oryzae while half of the copies of A. macrogynus belong to one family. Finally, in D. rerio, which harbors the highest copy number, 96% of the 2091 copies belong to just three families (1157 and 767 copies for DrDIRS1 and DrDIRS2, respectively). Such a distribution with a high copy number restricted to few families could be related to bursts of transposition. Bursts of DIRS1-like element activity are also suspected in S. kowalevskii (the SkoDIRS1 family alone accounts for 175 of the 240 copies identified) and in A. carolinensis (AcDIRS1 and AcDIRS2 families together harbor more than 60% of the different copies).

Phylogenetic analysis of DIRS1-like retrotransposons
To infer the relationships among the various members of DIRS1-like superfamily, we constructed a phylogenetic tree (Figure 3) based on an alignment of amino acid pol region sequences (214 sites). This phylogenetic tree contains 114 sequences, including a representative sequence of each family that has at least one uncorrupted copy, 23 DIRS1-like or PAT-like reference elements and 4 Ty3/Gypsy elements used as outgroups. Preliminary analysis of the three genomes that present high family numbers (42 families in A. carolinensis, 39 in D. pulex, and 81 in X. tropicalis) has shown that all of the elements from a given species cluster together into a monophyletic group (data not shown). For these species, only representative elements from the 4 or 5 largest families were included in the phylogenetic analysis. In contrast to previous analyses on much smaller datasets, the monophyly of DIRS1-like elements is not supported in the present study (bootstrap support lower than 75%). Such a pattern could be an artifact of a dataset that is too large and includes divergent elements. Alternatively, it might suggest that the PAT elements belong to the DIRS1-like superfamily, representing a peculiar group because of their structure. Many wellsupported groups can be identified within the DIRS1like elements. In many cases, the elements from a given species form a monophyletic group (e.g., elements from Nasonia vitripennis, D. pulex or A. carolinensis). However, some species harbor elements from two or three different groups (e.g., two and three element groups in A. californica and L. gigantea, respectively). In the same way, each group usually integrates elements from the same species or from a few closely related ones. For example, all the elements identified in fishes belong to one group called DrDIRS1 [21]. Likewise, the fungi group 1 comprises most of the elements identified in fungi, a result that confirms the close relationships between most fungi DIRS1-like elements revealed by the MCL analysis. Despite the difficulty in resolving the relationships among the different DIRS1-like groups, the monophyletic groups comprising only elements from a species or related species, the tree topology appears absent of clear evidence of horizontal transfer.

Discriminating the PAT-like sequences included in the final dataset
The PAT-like retrotransposons are the sister group of DIRS1-like elements and show a similar structure with the exception of their termini [6]. To discriminate the putative PAT-like elements retained by ReDoSt, 5 PATlike reference sequences were included during the clustering process and the phylogenetic analysis ( Figure 3). This allowed us to determine that 11 families correspond to PAT-like retrotransposons ( Table 2). This includes 6 families from the chlorophytes (Chlamydomonas reinhardtii and Volvox carteri), 3 families from the nematodes (Caenorhabditis briggsae and Pristionchus pacificus), one family from L. gigantea, and one shared by Nematostella vectensis and S. kowalevskii.
The presence of DIRS1-like retrotransposons is confirmed in 25 species, but still remains uncertain in Emiliana huxleyi, Petromyzon marinus, Naegleria gruberi, P. pacificus, V. carteri and C. reinhardtii. Elements from these species do not cluster with any reference elements and their sequences harbor too many frameshifts or indels to be included in our phylogenetic analysis. For these elements, we checked the presence of DIRS1like elements using similarity searches using the TBLASTX program [26] and the Repbase database that we previously re-annotated for the DIRS1-like and PAT elements (data not shown). A family was assigned to the DIRS1-like element superfamily under the two conditions: (i) an E-value lower than 1e-20 with at least one DIRS1-like reference element; and (ii) a minimum difference between the best E-values obtained with DIRS1like and PAT reference elements of 1e-10. Under these criteria, the presence of DIRS1-like retrotransposons could be confirmed in V. carteri, P. marinus and N. gruberi, but remains uncertain in C. reinhardtii and E. huxleyi, whereas the element detected in P. pacificus appears to be a PAT-like retrotransposon. So, 30 of the 32 species revealed by ReDoSt are now considered as harboring DIRS1-like retrotransposons and the two remaining posses in fact only PAT elements.
Distribution of DIRS1-like elements among eukaryotes DIRS1-like retrotransposons are now described in 61 diverse eukaryote species (Figure 4), including 14 species in 8 higher taxa newly characterized using ReDoSt: annelids, blastocladiomycetes, cephalochordates, chlorophytes, heteroloboseans, molluscs, petromyzontids and sauropsids. The DIRS1-like element distribution does not seem to be as patchy as it was previously described. Sixteen of the 28 unikont groups tested revealed the presence of these elements, indicating a wide distribution. This distribution could be shown to be wider in the near future since seven of the unikont groups apparently devoid of DIRS1-like elements are currently represented by only one or two completely sequenced genomes. Conversely, four other unikont groups seem to be clearly devoid of DIRS1-like elements. Despite a high number of completely sequenced genomes and diverse taxa tested, no well-conserved copies could be identified in any ascomycetes (75 species), basidiomycetes (16 species), nematodes (12 species) or mammals (37 species). A specific loss of DIRS1-like elements in Mammalia during evolution is the most probable cause of their absence when one takes into consideration their   Figure 3 Rooted phylogenetic tree based on the pol amino acid sequences of the DIRS1-like families identified. Distances are calculated with JTT parameter model plus gamma distribution's correction for amino acids. The tree is constructed using the Neighbor Joining method and pairwise deletion of gaps option included in MEGA5.0 software. When possible, one representative copy sequence that required only minor corrections for each family was integrated into our analysis. Reference elements are labeled with an asterisk and clusters that correspond to an element annotated in this study are written in bold italics. If a reference element was included in a family, this sequence was chosen to represent the family. In the cases of Anolis carolinensis, Daphnia pulex and Xenopus tropicalis, species that show a high family number, only four or five of their most abundant families were integrated. Ty3/Gypsy element sequences were used as outgroups according to the close relationships of their reverse transcriptase and RNase H domains with those of DIRS1-like and PAT retrotransposons. Support for individual groups was evaluated with non-parametric bootstrapping using 100 replicates. Only bootstrap node values over 50% are represented.
wide distribution in Unikonta, especially Deuterostomia. Outside of unikonts, DIRS1-like retrotransposons appear infrequently, observed in only three groups, even though most groups are represented by relatively few species. Various distributional patterns can currently be observed among eukaryotes. On a large phylogenetic scale, we make two observations: (i) a wide distribution of DIRS1-like elements among groups such as deuterostomes, with the detection of copies in a wide range of higher taxa from Echinodermata to Sauropsida; and (ii) a large repartition of the DIRS1-like elements observed in certain taxa despite a lack of detection in closely related taxa. In fungi, all three Mucoromycotina genomes were found to harbor DIRS1-like elements, whereas none could be detected in Ascomycota and Basidiomycota. On a smaller phylogenetic scale (i.e., within a higher taxon), the distribution again appears to be taxon-dependent with three distinguishable patterns. As described above, some groups seem to possess no DIRS1-like retrotransposons (e.g., mammals and streptophytes). Second, a large repartition of DIRS1-like elements was observed in some groups such as in  Actinopterygii and Mucoromycotina (detection in all 5 and 3 genomes tested, respectively). Finally, a sparser distribution of DIRS1-like elements was observed in yet other groups. Only 3 of the 22 hexapod species tested harbor well-conserved elements. However, this heterogeneous distribution could result in part from a sampling bias. We observed a lack of elements in some overrepresented taxa, such as Diptera (absence of detection in 16 Drosophila species tested), and an abundance in others, such as Hymenoptera (in three wasp and five ant species). Indeed, we used ReDoSt to analyze the recently released ant genomes, all of which harbor DIRS1-like elements. Five copies were found in Camponothus floridanus, 22 in Pogonomyrmex barbatus, 37 in Harpegnathos saltator, 41 in Linepithema humile, and 57 in Solenopsis invicta [27][28][29][30][31].
The previous though that DIRS1-like retrotransposons are uncommon among eukaryotes appears to be strongly biased considering that ascomycetes, mammals and green plants, which are devoid of elements, represent more than 55% of the sequenced genomes. DIRS1-like elements do not appear as ubiquitous as Ty1/Copia and Ty3/Gypsy retrotransposons but their distribution among eukaryotes appears more comparable to the Penelope element distribution [12,13]. Despite their loss in several lineages, the phylogenetic analysis and the distribution of DIRS1-like elements in a very broad range of unikonts indicate that their genomic invasion occurred early in unikont evolution; at least prior to the Bilateria radiation but probably before if we take into account the presence of DIRS1like retrotransposons in Amoebozoa and Fungi ( Figure  4). This primary invasion could be found to have occurred earlier in evolution if the presence of DIRS1like elements is confirmed in Excavata, Plantae and Chromoalveolata. Though our results unequivocally indicate the presence of DIRS1-like elements in Unikonta, we must be cautious in our estimation of their real distribution in Excavata and Plantae because most of the copies identified in these taxa harbor too many indels and frameshifts in the repeated sequence structures to be studied and for them to be included in the phylogenetic analysis. The presence of DIRS1-like elements in these species is only supported by similarity search analyses.
The absence of DIRS1-like elements in several groups may reflect their differential success in adapting to different host species and/or a propensity for stochastic loss during evolution. Nevertheless, this absence has to be confirmed in the future by investigations of deleted DIRS1-like copies in these genomes. The detection of deleted copies in an apparently "unoccupied" species would be evidence of the previous existence of well-conserved DIRS1-like elements.

In-depth characterization of new DIRS1-like elements
To describe the diversity within the DIRS1-like superfamily, we detailed the structure of 28 new elements, most of which represent high copy number families or species newly characterized for the presence of such retrotransposons (e.g., A. californica and L. gigantea). Several features of DIRS1-like retrotransposons are presented in Table 3, such as their length, the presence of a long ORF overlap, and the structure of their repeats. The length of DIRS1-like retrotransposons appears variable between the 28 elements from 3974 bp in Acas-DIRS1 (Acantheamoeba sp.) to 6283 bp in SkoDIRS2 (S. kowalevskii), with an average length of 5160 bp. In-depth annotation including the positions of the repeated sequences and several conserved motifs is provided in Additional File 3. The pol motifs seem to be highly conserved, especially the 'YL/IDD' motif that is conserved in 25 of the 28 annotated elements. The 'HSTR' tyrosine recombinase motif appears more variable (only harbored by 13 of the 28 elements). For example, AmDIRS2 and MciDIRS1 harbor an 'SDLK' and 'LCPV' sequence, respectively. This suggests that the catalytic tyrosine recombinase-encoding domain sequence could be less constrained than the pol sequence. Twenty-three of the elements begin and end with a trinucleotide NTT, most frequent being ATT (Table 3). Only the AmDIRS1 from A. macrogynus begins and ends with an uncommon GCrich motif. In almost all elements, this trinucleotide appears complementary to the 3-bp circular junction. Evidence of long ORF overlaps was found in half of the 28 DIRS1-like elements, which seems to depend on host species (e.g., evidence in the five elements from Fungi and none in Mollusca).
Previous studies have outlined the structure of DIRS1like retrotransposons, especially the nature of their termini, which complement the Internal Complementary Regions (ICRs), and the presence of a right Extension sequence (rE) [3]. Looking in detail at the repeated sequences "lITR-lICR-rICR-rITR-rE" in these elements allowed us to reveal a rather more complex structure ( Figure 5). Whereas previous studies only allowed the description of a rE sequence, we have characterized an equivalent left Extension sequence (lE) at the 5'-end of some elements, which is only complementary to the left ICR. The identification of this additional lE sequence does not challenge the replication model that proposes a rolling-circle intermediate. This intermediate is produced by the 3-bp circular junction that corresponds to the overlap of the two ICRs complementary to the 5'and 3'-ends of the element [3][4][5]. All elements harbor at least one extension, and, like DIRS1, most elements contain only a rE. The lE region has only been detected in fungi and amoebozoa species. Two elements show only a lE (AcasDIRS1, AmDIRS1) and four other elements harbor the two extensions (e.g., AmDIRS2, MciDIRS1). We hereby propose to redefine the fine structure of the DIRS1-like element's termini (Figures 5  and 6). In this study we call the left and right termini (lTer and rTer) the assembly of the two components: the ITRs and their respective potential extension (lE or rE). The lE and rE regions are considered the external sequences of the termini that are only complementary to their respective ICR sequences (theoretically 100% sequence identity). The ITRs are defined as the parts of these terminal sequences that are mostly complementary to each other. On a smaller scale, two parts can be distinguished within these ITRs ( Figure 6). In the conserved ITR part, the two ITRs are strictly complementary to each other. In the divergent ITR part, the two ITR sequences are mostly constrained by their respective ICR and remain only partially complementary to each other, with a sequence identity that varies from 50% to 85%. ITR length appears highly variable among the different elements, ranging from 66 bp (LgDIRS1) to 316 bp (DIRS1-2). Likewise, the length of the ICRs varies between 85 bp for the sum of the two ICRs in Acas-DIRS1 and 130 bp in AcaDIRS1. The right extensions vary from 9 bp to 75 bp (being apparently shorter in the presence of a lE). In most cases, the sizes of the various repeats are conserved among the different elements from the same species (e.g., among AcDIRS1 and AcDIRS2, or AcaDIRS1 and AcaDIRS2). The conserved The element size and trinucleotide sequences beginning and ending the element complementary to the circular junction are given for each manually annotated DIRS1-like element. Evidence of long ORF overlap is also indicated. The lengths of the different parts of the termini (the divergent and the conserved ITRs, lE and rE) as well as those of the ICRs are reported. nd: not determined because CaspDIRS1 corresponds to a chimeric sequence. Each newly identified element has been submitted to Repbase.
ITR usually represents the largest part of the ITR, ranging from 31 bp to 304 bp (Table 3), whereas the divergent ITR is often small, ranging from 9 bp to 36 bp. However, in some elements from molluscs both parts have about the same size. Interestingly, the boundary between these two ITR parts is composed of a short sequence of at least 10 nucleotides that is conserved in two ITRs and two ICRs (Figure 6), which may be involved in the formation of the circular intermediate of the element before its integration.

Conclusions
In this study, we developed a new computational tool, ReDoSt, allowing us to describe more precisely the distribution of DIRS1-like retrotransposons as well as their diversity among eukaryote genomes. These elements appear more continuously distributed than previously though, with 8 new higher taxa characterized to harbor these elements (e.g. Mollusca) and 14 new eukaryote species, giving a total of 61 species containing DIRS1-like elements in their genome. The current understanding of the distribution of DIRS1-like elements in Eukaryota, and especially Unikonta, suggests the presence of DIRS1-like elements in the last common ancestor of eukaryotes. Whereas some higher taxa seem clearly devoid of well-conserved DIRS1-like retrotransposons (e.g., ascomycetes, mammals and streptophytes), these elements appear highly conserved in some other higher taxa, such as Actinopterygii and Mucoromycotina. Now that a large diversity of elements within the DIRS1-like superfamily (around 300 different families) have been characterized, it is possible to screen sequence datasets for the presence of DIRS1-like elements using more conventional approaches like Repeat-Masker. This large diversity allowed us to study the phylogenetic relationships within the DIRS1-like superfamily in which the different groups appear related to the host species. All of the elements included in the phylogenetic analysis as well as the subset of 28 annotated elements were used to define two new alignment profiles for each of the three characteristic domains of the DIRS1-like retrotransposons: reverse transcriptase, methyltransferase and tyrosine recombinase. These profiles could be used in further studies or in future automatic annotation of transposable elements within genomes (Additional file 4).

Data collection
The 274

Identification of DIRS1-like retrotransposons
We propose a new computational tool for DIRS1-like retrotransposon identification, ReDoSt (Additional file 4, updates available at http://wwwabi.snv.jussieu.fr/public/ ReDoSt/), based on both similarity searches of domains and their organization in the element structure. The similarity searches were performed using the RPS-BLAST and PSI-BLAST programs [32] with an E-value cutoff of 0.01 and specific alignment profiles for each domain. This method, in comparison with BLAST or RepeatMasker approaches, may be more permissive and thus allow for the identification of more divergent elements. For example, using this method we identified 21 DIRS1-like copies in the A. macrogynus genome, whereas only 16 well-conserved elements (i.e. simultaneous detection of the RT, MT and YR domains) were detected using RepeatProteinMask and the RepeatPeps library (included in the RepeatMasker package). We used three different profiles whose positions within the element are shown in Figure 1. jussieu.fr/public/ReDoSt/). Our automatic detection tool is composed of six main steps ( Figure 7): (1) Identification of all putative reverse transcriptase-encoding fragments within the genome; (2) Extraction of each genomic hit with 5-kb flanking sequence on both sides because all DIRS1-like elements described to date are less than 6 kb in length; Within each genomic fragment retained, (3) tyrosine recombinase-encoding domain search and (4) methyltransferase-encoding domain search; (5) After obtaining the 10-kb contigs that harbor the three characteristic domains (RT, YR and MT) of DIRS1-like retrotransposons, we checked the co-orientation and the order of these domains to discriminate other types of YR-encoding retrotransposons (e.g., Ngaro and PAT elements); (6) Finally, fragments that harbor at least two occurrences of the same domain were set aside for copy number estimation, sequence alignments, and supplementary investigations required to determine from which rearrangements (duplications or insertions) they are derived. Such a fragment has then been considered a single copy of DIRS1-like Step 4: Identification of Methyltransferase encoding domains within the 10kb extracted fragments Step 5: Filter for the co-orientation and the order of the three coding domains Step 6: Set aside the fragments harboring at least two occurrences of one domain Genome in FASTA format (e.g., Rhizopus oryzae ) Step 1: Identification of Reverse Transcriptase encoding domains Step 2: Extraction of a fragment for each Reverse Transcriptase hit with 5kb flanking sequences Step 3: Identification of Tyrosine Recombinase encoding domains within the 10kb extracted fragments Figure 7 ReDoSt pipeline developed in this study for the identification of DIRS1-like retrotransposons. To assess the efficiency of each step of the pipeline, we detailed the number of fragments retained after each step for the genome of the fungus Rhizopus oryzae. element in copy number estimation. We repeatedly observed a bottleneck between the first and fourth steps for all of the genomes tested (the example of R. oryzae results given in Figure 7). We chose to be less stringent in the first step by using an alignment profile designed using a large diversity of elements, one third represented by other types of tyrosine recombinase-encoding retrotransposons as well as one Gypsy element. As a consequence, many reverse transcriptase-encoding fragments identified may belong to other retrotransposon superfamilies. Analyses were performed on an iDataPlex Linux system (CPU 2.53 GHz, 3 GB memory).

Sequence analysis
Families of DIRS1-like elements were identified by clustering all the nucleotide reverse transcriptase-encoding fragment sequences detected in the 32 species with the MCL program (http://www.micans.org/mcl/, [33]). Reference elements previously described and/or deposited in Repbase version 14.06 were also added to the dataset. This method was used in previous studies on IS transposons [34,35]. An E-value cutoff of 0.01 was used for the initial BLASTN search. An inflation factor of 1.2 was computed to cluster sequences. These values are effective at least in splitting elements of different previously defined DIRS1-like families (e.g., DrDIRS1, DrDIRS2 and DrDIRS3 in D. rerio [5]). Because clustering results can depend on the dataset used, we tested two different approaches: an independent clustering of the elements within each tested genome and a global clustering of all elements from all species. Similar results were obtained regardless of the approach used (data not shown), suggesting that the clusters obtained are wellsupported.
To perform the element annotation, we preferentially selected elements from species in which DIRS1-like retrotransposons were not previously reported or from families showing high copy number. The repetitive structures (ICRs and ITRs) were detected using UGENE (http://ugene.unipro.ru/index.html). When several copies of a family were available for one species, the boundaries of the ITRs were manually analyzed and detection of the flanking regions in multiple nucleotide sequence alignments carried out using MUSCLE [36]. To check the presence of ORF overlaps, we used the ORF Finder tool (http://www.ncbi.nlm.nih.gov/projects/gorf/).
For phylogenetic analyses, a sequence from each family was included that required none or only minor corrections in its pol sequence (no large indels or multiple frameshifts). The amino acid pol sequence multiple alignments were performed with MUSCLE and ambiguously aligned sites were removed using Gblocks [37]. Phylogenetic analyses were conducted using neighborjoining (NJ) method and the pairwise deletion option of the MEGA5.0 software [38]. The best-fit model, the JTT model [39] with gamma distribution, was selected with Topali2 software [40] and support for individual groups was evaluated with non-parametric bootstrapping [41] using 100 replicates.

Description of additional data files
The following additional data are available with the online version of this paper. Additional data file 1 contains two histograms representing the distribution of the domain sizes for the elements detected in A. carolinensis and S. kowalevskii. Additional data file 2 contains histograms of the distribution of family size in several species. Additional data file 3 provides a table listing features of the 28 DIRS1-like annotated elements. Additional data file 4 is a mini-website providing an access to the ReDoSt pipeline, to the different alignments profiles and to the DIRS1-like sequences used to design them. Additional data file 5 is a list reporting the data source for all species tested.

Additional material
Additional file 1: Domain size distributions for the elements detected in A. carolinensis (A) and S. kowalevskii (B). The histogram represents the number of element domains detected (y-axis) as a function of their length (x-axis). The reverse transcriptase fragments are represented in blue, the methyltransferase fragments in red, and the tyrosine recombinase fragments in yellow.
Additional file 2: Distribution of family size. Families are arranged along a gradient of decreasing size. For each species, mean family size and standard deviation are given. X-axis: family rank, Y-axis: number of elements in the family.
Additional file 3: Annotation of the 28 DIRS1-like elements described. For each element, positions of the repeated sequences within elements, the tyrosine recombinase and pol conserved motifs (reverse transcriptase (RT), RNase H (RH) and methyltransferase domains), and the end of the putative pol region are reported. The position of each element within the genome sequences is also provided.
Additional file 4: ReDoSt pipeline and alignment profiles used in this study.
Additional file 5: List of all species tested. For each species, the acronym used during the study and the data source website are indicated.