Analysis of a comprehensive dataset of diversity generating retroelements generated by the program DiGReF
© Schillinger et al.; licensee BioMed Central Ltd. 2012
Received: 16 May 2012
Accepted: 18 August 2012
Published: 28 August 2012
Diversity Generating Retroelements (DGRs) are genetic cassettes that can introduce tremendous diversity into a short, defined region of the genome. They achieve hypermutation through replacement of the variable region with a strongly mutated cDNA copy generated by the element-encoded reverse transcriptase. In contrast to “selfish” retroelements such as group II introns and retrotransposons, DGRs impart an advantage to their host by increasing its adaptive potential. DGRs were discovered in a bacteriophage, but since then additional examples have been identified in some bacterial genomes.
Here we present the program DiGReF that allowed us to comprehensively screen available databases for DGRs. We identified 155 DGRs which are found in all major classes of bacteria, though exhibiting sporadic distribution across species. Phylogenetic analysis and sequence comparison showed that DGRs move between genomes by associating with various mobile elements such as phages, transposons and plasmids. The DGR cassettes exhibit high flexibility in the arrangement of their components and easily acquire additional paralogous target genes. Surprisingly, the genomic data alone provide new insights into the molecular mechanism of DGRs. Most notably, our data suggest that the template RNA is transcribed separately from the rest of the element.
DiGReF is a valuable tool to detect DGRs in genome data. Its output allows comprehensive analysis of various aspects of DGR biology, thus deepening our understanding of the role DGRs play in prokaryotic genome plasticity, from the global down to the molecular level.
KeywordsDGR Diversity-generating retroelement Targeted mutagenesis Prokaryote evolution Horizontal gene transfer Reverse transcriptase DiGReF
Living organisms utilize many mechanisms to ensure fidelity of replication and to reduce the mutation rate. However, in some circumstances, an increased mutation rate can be beneficial. In particular, pathogenic organisms are often subjected to selection for diversity to overcome host defenses and/or increase host range. For example, mutator mutants lose the mismatch repair system, which affects the entire genome. Alternatively, changes in the copy number of simple repeats at bacterial contingency loci can generate high frequencies of mutations in particular genes, but result in a limited range of potential mutations. Diversity generating retroelements (DGRs) can generate a much greater range of localized diversity. The first DGR was discovered in a Bordetella phage, where it affects tail fibers and, thus, host range. Since then, DGRs have been discovered in a variety of phage and bacterial systems[4–6].
The DGR characteristics described above have been mainly derived from investigation of a single element, the Bordetella phage DGR, and supported by sequence comparison with a small number of related elements. However, a systematic and comprehensive assessment of the prevalence, distribution, and structure of these retroelements has been lacking. In this paper, we present a Perl program that identified 155 potential DGRs in public DNA sequence databases, the largest set described so far. Having subjected this dataset to careful quality control, we used it to examine several aspects of DGR mechanism and evolution. We found that DGR cassettes have a rather homogenous length of 2–5 kb, but are highly tolerant to permutations of their components and expansion with up to three additional VRs. TR and VR can even be on a different DNA strand to the corresponding RT gene. Thus, unlike in group II introns and retrotransposons, the RT mRNA and the template RNA are not necessarily the same molecule. DGR RTs, though highly divergent, form a phylogenetic clade that is characterized by a (I/V/L)GxxxSQ motif in RT domain 4. This motif seems largely necessary and sufficient to predict DGR association and may explain the observed restriction of mutagenesis to adenine bases. DGRs can be found in all major classes of bacteria, but exhibit sporadic phylogenetic distribution. Several lines of evidence point to horizontal gene transfer as the main propagation mechanism of DGRs. However, DGRs do not use a single vector for their dispersal, but “hitchhike” with various mobile elements, e.g. phages, transposons and plasmids.
Results and discussion
DiGReF reliably identifies potential DGRs
The sequences in the NCBI nr protein database were subjected to a psi-blast search to identify sequences that potentially encode RT enzymes. There were 2651 hits. DNA Sequences 5000 bp up- and downstream of each RT were extracted and subjected to analysis by the DGR-finder program DiGReF (Additional file1). This algorithm uses a sliding window (default size 50 nucleotides), which is used to search the complete extracted sequence for repeats of its sequence. To account for the characteristics of DGRs, all non-A bases in the window have to match exactly, but the adenines in the window do not have to match. When such a hit is found, it is extended to yield the maximum length sequence in which all non-A bases match. The program designates the sequence derived from the search window as the template repeat (TR), and the mutated repeat as the variable region (VR).
To eliminate artifact hits such as low complexity repeats or sequences that are a result of recent gene duplication events, only repeats with at least 10 adenines in the TR and at least 7 A → B substitutions in the VR (B = G, C or T) were considered. TR sequences with less than 10 potential mutation sites would only be able to provide a diversity of < 2.6 x 105 possible VR sequences. Due to the logarithmic correlation between repertoire size and the probability of finding a protein of the desired properties in a repertoire[9, 10], repeats with low diversification potential are more likely artifacts (e.g. group II intron RTs associated with random repeat-like sequences) than efficient DGRs. Manual inspection of samples confirmed this assumption.
With these criteria, 155 of the 2651 RT hits could be identified as containing DGR-like repeat structures, 126 of which had not been previously described (Additional file2A). VR/TR hits were overwhelmingly found associated with RTs that have a high homology to known DGR RTs. To explore more distantly related RT sequences, we performed further iterations of psi-blast with newly found DGR RTs that were more dissimilar to the “standard” DGR RTs. However, we did not detect additional DGRs, and we are thus confident that we have reached saturation and that our dataset is comprehensive. This strategy also served as a test to assess the possibility that VR/TR-like repeats are widely abundant in genomic sequences and thus also often found in the vicinity of RT genes by chance. Using a cut-off of seven A substitutions, we did not find such fortuitous repeats. However, lowering the cut-off to five A exchanges resulted in 41 additional hits which upon manual inspection seemed mostly false positives (Additional file2B). Still, six of these hits match the known characteristics of DGRs, but are lost in the higher cut-off setting. Depending on the objective of the user, it is thus possible to emphasize detection sensitivity or stringency of the DiGReF program by adjusting the cut-off values. In this paper, we wanted to avoid as many false positive hits as possible and thus carried out further analyses with a cut-off of seven A substitutions. The few false positives and false negatives that remained are discussed later in the text.
In addition to the coordinates and sequences of the VR/TR pairs, the program also delivers an alignment of the repeats, statistical data on the adenine exchanges, and an annotation file that can be opened in a sequence viewer such as Artemis to visualize the DGR structure (Additional file3). Due to its modular nature, the software can be easily adapted and expanded to address other questions that might arise while DGRs are being studied in more detail.
DGRs are ubiquitous among prokaryotes
Phylogenetic distribution of DGRs
Sequenced genomes on NCBI [%]
Hits in our dataset [%]
DGRs use different mechanisms to transfer between species
A different transfer strategy was used in the case of the DGRs from Vibrio sp. RC586 (GI 262403399) and Shewanella baltica OS155 (GI 126090247). These DGR cassettes show an overall sequence identity of 93%. In Shewanella, the element is located on plasmid pSbal02, which itself can potentially be exchanged between organisms. Moreover, the DGR is close to a transposase/integrase gene, which may be responsible for mobilization of the whole element. In Vibrio sp. RC586, the DGR is located in the vicinity of Tn7-type sequences, exactly at the position where Tn7 usually carries antibiotic resistance genes. DGRs therefore may be mobilized by transposons and might even co-opt the same integron system that appears to exchange resistance markers in complex transposons.
We could also observe transfer events encompassing DGRs in Bacteroides species. These human gut bacteria are known to have a very plastic genome and a plethora of autonomous and non-autonomous mobile elements that are transferred mostly by conjugation. They also carry numerous DGRs that cluster in two separate clades according to the RT phylogeny (Figure2). For example, Bacteroides sp. 1_1_14 harbors two DGRs, one each from the two Bacteroidetes clusters in the phylogenetic tree (Additional file4). One of these DGRs is located on a 42.6 kb fragment that is 99% identical to a B. ovatus 3_8_47FAA sequence, but the flanking sequences display 95–99% identity with the genome from B. thetaiotaomicron VPI-5482. The nature of the 42.6 kb fragment is not clear. A direct blastn query does not result in significant hits, but one of the encoded proteins is homologous to transposition proteins, thus suggesting a conjugative transposon or a transposable phage as shuttle for the DGR element.
These examples suggest that DGRs do not have a dominant mode of interspecies transmission. They can use bacteriophages, plasmids and transposons for dispersal. The selective advantage they provide to the host should help them to stay maintained in those gene transfer vectors.
DGR reverse transcriptases form a distinct and well-defined clade characterized by an SQ dipeptide
The average length of 378 aa was in line with the 377 aa reported as average DGR RT length by. An alignment of all identified DGR RTs showed the clear organization into seven conserved domains (Additional file6) that had been described before for a smaller subset. Following region 7, we noticed a patch of 20 amino acids that is highly positively charged (often 50–60% R and K residues, Additional file7) Although there is no distinct pattern discernible in the arrangement of charged residues, this positively charged region appears to be unique to DGR RTs. This C-terminal region likely is involved in nucleic acid binding, for example in template recognition.
Apparent exceptions from the DGR RT clade are often artifacts
Twenty-eight sequences of the input sequences that include the SQ motif did not produce hits with DiGReF (Additional file2C). Apart from six RTs that are clearly truncated and thus not functional, they are most likely false negatives. They are apparently intact, featuring SQ and YxDD motifs, but are not associated with obvious TR/VR repeats. Many are located on short contigs of less than 5 kb, or very close to the end of a contig. Thus, the complete DGR sequence is not included in the program input, making it impossible to identify VR/TR repeats. In at least two of the remaining sequences (Figure4D), we could manually identify repeats that include several non-A mutations and therefore cannot be identified by DiGReF. They may be “sloppy” elements that are still active with a reduced A-specificity, but the high number of non-A mutations could also suggest that these hits represent DGR elements that are no longer functional. Selective pressure for high DGR activity might decrease once the target protein is well adapted to its function, thus fixing it in the genome with a normal mutation rate. Decreasing the window size for repeat scanning would allow detection of such “sloppy” repeats, but would also lead to more false positive hits.
The SQ motif may be responsible for RT mediated mutagenesis
Our comprehensive search showed that DGRs are only found within the subset of RTs that cluster with already known DGR RTs. Considering that DGR cassettes are otherwise highly diverse in structural organization, accessory proteins and VR-ORFs (see below), this monophyletic origin means that the RT function in DGRs cannot easily be replaced by another bacterial (group II intron) or viral RT, arguing for the involvement of the RT in diversification. If host factors were responsible for editing the RNA or cDNA, the high plasticity of bacterial genomes would make RT swaps quite likely. The exclusive association with the SQ-clade of RTs prompted us to analyze the most highly conserved regions 4 and 5 for possible structure/function relationships.
The catalytically essential aspartate residues of reverse transcriptases are located in domain 5. In DGR RTs, they are part of a YxDD motif. While the two aspartates are 100% conserved, the tyrosine is replaced by phenylalanine in 14% of the cases. The second position is not that highly conserved. 53% of the DGR RTs have an M at this position, 33% a V, and the remaining entries feature the small non-polar amino acids A or C (Figure5). In HIV RT, the corresponding M184V mutation has a strong influence on the fidelity of the reverse transcriptase[16, 17]. Therefore we analyzed whether DGRs that carried a V instead of an M in domain 5 displayed an altered mutation pattern, but we did not find significant differences either in the overall mutation rate (mutated adenines per total number of adenines in the TR) or in the distribution of mutated nucleotides (data not shown).
The almost exclusive appearance of the SQ motif in region 4 of DGR RTs suggests a mechanistic connection between these amino acid residues and the function of DGRs. Since the unique feature of DGRs is adenine-specific mutagenesis, we hypothesize that the SQ motif plays a vital role in defining RT fidelity. The crystal structure of HIV RT in complex with a DNA template:primer and a dNTP has suggested that domain 4 (which comprises the QGxxxSP motif) participates in binding and selection of the incoming nucleotide as well as template coordination near the active site. In HIV RT, mutation of Q151 changes the discrimination between rNTPs, dNTPs and ddNTPs, the activity on DNA and RNA templates, and the fidelity of the polymerase[19–21]. P157, which corresponds to the Q in region 4 of DGR RTs, is considered part of the template grip; mutations in this residue also affect nucleotide incorporation patterns[22, 23]. Thus mutations in motif 4, coordinating both template and incoming dNTPs, seem ideally poised to modify RT fidelity. As DGR RTs only have relaxed fidelity at As, changes in the binding pocket may specifically modify the interaction with adenine residues. For example, flipping out the template bases in the active site, a process that has been observed in many polymerases[24, 25] may be disturbed in DGR RTs. The geometry of template adenine coordination may be altered to make the enzyme more welcoming for non-complementary incoming nucleotides. Thus, region 4 may be responsible for misincorporations in the resulting cDNA which could lead to the observed A → B mutations on the VR coding strand.
Nucleotide substitutions are essentially random
It seems likely that host factors or different selectivity of individual RTs might influence the choice of substituted nucleotides. If there were no such selectivity, the average number of changes to each nucleotide in a VR would be proportional to the number of A-residues in the TR. The observed results were compared with these predictions using χ2 tests. Significant differences were found for G-residues (p = 0.00008) and T-residues (p = 0.0057), but not for C-residues (p = 0.13). This suggested some influence of host factors or RT specificity on exchange preferences. To investigate this further, we selected a group of 15 DGRs that carry closely related RTs and are found in one genus, the Bacteroides (see Figure2). We analyzed whether these DGRs exhibit a stronger or more homogenous substitution bias. However, we found comparable distribution patterns with equally high variability as in the complete dataset (Figure6B). Even individual VRs within a DGR containing multiple VRs (see below) can have drastically different exchange patterns (data not shown). Thus, our data argue against a strong structural or enzymatic bias for nucleotide selection opposite As in the RNA template. However, the genomic sequences are a snapshot of DGR mutagenesis biased by selective pressure. It is possible that in addition to functional selection bias, differences in %G + C-content and thus, in codon usage, could lead to different biases in different classes of organism. Such questions might be partly addressed by examining the effects of the mutations on the codon affected. However, to truly understand the underlying mechanism, individual DGRs will have to be studied experimentally in more detail.
Repeat length is limited to ~ 150 bp
By providing automated retrieval of sequence information and assignment of VRs and TRs, DiGReF allows for comprehensive analyses of structural features of DGRs. For example, comparison of VR lengths showed that most VRs lie in the range of 100 ± 50 bp. Upon manual inspection, the shorter TR/VR pairs are often flanked by a non-A mismatch and can usually be extended further, but they never exceed 180 bp (data not shown). The relatively short repeat length is in line with recent experimental evidence which showed that although DGRs tolerate some extra sequence in their TR and can transfer it to the VR, longer additional DNA sequences are quickly purged. The observed restriction of the repeat length could be due to low processivity of the RT or a specific recombination mechanism that favors exchanges of shorter DNA stretches. However, it is also possible that the process is not limited mechanistically, but functionally: if the resulting protein loses activity when larger patches of its sequence are hypermutated, there would be a strong selective pressure to keep the VRs short.
DGR structure is highly variable
DGR structure does not seem to be tightly correlated with RT phylogeny. For example, structural group 2 consists mostly of cyanobacterial sequences with similar RTs, but also includes other elements, e.g. from Haliscomenobacter, which have highly divergent RTs (Additional file4). Also, we observed that different structural DGR types can persist in parallel within one class of bacteria and sometime within one organism, e.g. in Candidatus accumulibacter. The absence of a strict order in the cassette components implies that the spatial arrangement is irrelevant to the DGR mechanism. The RT- and TR-RNAs are most likely separately transcribed, and homing is an independent process.
A significant fraction of DGRs includes multiple VRs
DGRs with multiple VRs
GI of RT
Number of VRs
Sequence identity [%]
Chlorobium phaeobacteroides DSM 266
Rhodomicrobium vannielii ATCC 17100
uncultured Desulfobacterium sp.
Pseudomonas fluorescens WH6
Magnetospirillum magneticum AMB-1
Nostoc sp. PCC 7120
Magnetospirillum magneticum AMB-1
Nostoc punctiforme PCC 73102
Cyanothece sp. CCY0110
Cyanothece sp. CCY0110
Geobacter lovleyi SZ
Lachnospiraceae bacterium 1_4_56FAA
Sideroxydans lithotrophicus ES-1
Ralstonia pickettii 12 J
Burkholderia thailandensis MSMB43
Burkholderia glumae BGR1§
Thiorhodospira sibirica ATCC 700588
Thiorhodococcus drewsii AZ1
Desulfobacter postgatei 2 ac9
Collimonas fungivorans Ter331§
47.5 - 52.3
Marichromatium purpuratum 984
Pelodictyon phaeoclathratiforme BU-1
48.0 - 57.2
Ralstonia solanacearum CMR15§
41.1 - 59.6
Pseudogulbenkiania sp. NH8B
45.2 - 72.1
Within each DGR, multiple target ORFs show high protein sequence homology to each other. Also at the nucleotide level, the ORFs fulfill the criteria of paralogy (30% sequence identity over at least 60% of the sequence,) so that gene duplication is the likely mechanism of multiple VR-DGR formation. Notably, even the most distant target ORFs display the hallmarks of continuing and independent diversification (i.e. A exchanges without accumulation of B mutations, and different VR sequences in paralogous ORFs).
Duplication of genes is not an uncommon event in nature. In most cases, there is no significant increase in fitness and one of the copies becomes inactive and is finally deleted again. If duplication proves to be advantageous to the host, both open reading frames are kept as paralogs. The paralogous gene can increase host fitness simply be raising the expression level of the encoded protein, but most often it is associated with neofunctionalization or subfunctionalization[29, 30]. This process can be significantly accelerated by combining gene duplication with DGR activity, leading to parallel diversification of a whole protein family and thus a superior means to adapt to environmental demands. However, if all members of a gene family are mutated simultaneously, essential functions might be lost. Consequently, we checked for the presence of additional paralogs in organisms featuring multiple VRs by using one of the respective target proteins as a query for a blastp search. In all but three cases, we found at least one additional paralog without a VR. Thus the diversified genes in multiple VR DGRs are usually part of a bigger gene family and co-exist with more stable counterparts of similar function which act as conserved “ancestor” genes.
Interestingly, our search for paralogous target genes in the complete genomes of the host organisms also unearthed additional ORFs that include perfect variable repeats differing exclusively in A-positions from their corresponding TR. The maximum distance between a DGR RT and additional target ORFs was observed in Pseudogulbenkiania sp. NH8B with > 370 kb. Further examination revealed the presence of a strongly mutated RT gene in the vicinity of these distal target ORFs, suggesting that a DGR underwent duplication and lost one of the RTs because the remaining enzyme was sufficient to support diversification of all VRs. Generally, these additional target ORFs were found on different contigs or further than 5 kb from the RT, so that our program could not automatically identify them. However, the program’s ability to identify DGRs per se does not seem affected by this limitation. This is due to the fact that all DGRs that we have found so far contain a “core” DGR cassette comprising 2–4 kb, which is easily covered by the ~ 11 kb input sequence. In order to obtain a quantitative assessment of DGRs with multiple VRs, it would be necessary to run the program on whole genome data. While the length of the analyzed sequence can be increased in DiGReF, this significantly increases the computation time and was therefore not done in this initial study.
A new structural DGR type features inversions
During our studies, we identified three RTs (Shewanella baltica OS155, GI 126090247; Vibrio sp. RC586, GI 262403399; Photobacterium angustum S14, GI 90580666) that represent a previously unknown structural DGR type. These “inverted” DGRs (Figure7, Group 4) consist of an RT ORF on one DNA strand, and TR, VR and target ORF on the other DNA strand. Except for the separation of the cassette components on two strands, these elements show all standard features of DGRs such as long repeats (130–139 nt) and a high mutation rate (18–21 A substitutions). Since our program only analyzes the DNA strand coding for the RT, repeats of these “inverted” DGRs cannot be recognized by a standard DiGReF search looking for A-specific mutations. We incidentally found them when we were investigating whether DGRs can only mutate adenine residues. We changed the program to search for repeats with C, G, or T substitutions in the vicinity of RT sequences. For Cs and Gs, we did not find a single hit that matched the search criteria, but for Ts, we found three hits representing the complementary strands of inverted DGRs. Phylogenetically, their RT sequences cluster in one group (Figure2), suggesting that the inversion was a one-time event that subsequently got distributed to different species via HGT. Though a rare event, the inversion proves that unlike for example retrotransposons, the RT mRNA and the template RNA are not required to form a single transcriptional unit. Theoretically, it might even be possible that TR and VR lie on opposite strands. Indeed, the DGR of Pseudogulbenkiania sp. NH8B has four associated VRs, two on the same strand as RT and TR, and two in further distance and on the opposite strand. Thus, the analysis of DGR structures has uncovered two mechanistic aspects of DGR-mediated mutagenesis: transcriptional separation of RT and TR expression, and spatial uncoupling of DGR expression and VR targeting.
The program DiGReF is designed to easily and automatically search for DGRs. With this program, we were able to reliably identify all previously described DGR sequences, but in addition, we found over 100 new cassette structures that show the typical features of DGRs. Changing the search parameters allowed us to identify new structural DGR types. Currently, the program is mainly limited by incomplete or misassembled sequence data, but allows facile constant surveillance of newly sequenced genomes for DGRs.
Moreover, the modular nature of DiGReF and its flexible output (e.g. in graphical format) greatly facilitate the downstream analysis of various aspects of DGRs. In this work we have analyzed repeat length, nucleotide substitution patterns, RT phylogenetics, cassette structure and interspecies transfer of DGRs, but the program output also offers the possibility to address other questions pertaining to DGR function. For example, the program can be adapted to extract the target ORFs of DGRs. Although the crucial role of DGRs in phage tropism switching is well understood, the function of these elements in bacteria is still unclear. Many target ORFs are located in the membrane, belong to the FGE-sulfatase superfamily and assume a Clec-type fold[31, 32], but their exact function is unknown. A systematic large scale comparison of the target proteins may provide insights into which proteins are good targets for DGRs and help to define their biological role. Similarly, DiGReF facilitates the search for accessory proteins and allows detailed analysis of integration determinants such as the IMH region (initiator of mutagenic homing) or the hairpin/cruciform structure downstream of the VR that is required for target site recognition in a subset of DGRs. Thus the software will be a valuable tool for obtaining deeper insights into the function of these unique intriguing retroelements.
RT sequence collection
Using eight protein sequences (GenBank GI-no. 186684985, 134299090, 148359926, 113474819, 42527768, 90580666, 149833092, 41179367) representing RTs from previously described DGRs as queries, we performed psi-blast searches with two iterations against the nr protein database (November 2011). For subsequent iterations, the top thirty hits from the first search were used as queries. More than two iterations did not lead to significant changes in the obtained set of reverse transcriptases. After the last iteration, all hits with an E-value lower than 0.005 were pooled (2651 hits) and used for further analysis.
A program (DiGReF) (Additional file1) was designed to find potential VR and TR sequences. It was written in Perl (ActivePerl) using the BioPerl package. The program retrieves the nucleotide sequences containing RTs from the NCBI GenBank database. A region consisting of the RT-coding sequence and sequences to either side (default length 5 kb to each side) are searched for potential TR/VR pairs. Sliding windows (default size 50 nt, stepsize 1 nt) are considered as candidates for TRs and screened for repeats that match all non-A bases of the whole region containing the RT gene. Hits then are extended to generate the maximum repeat length in which all non-A bases match. In a filtering step, repeats that contain few As in the TR (default: less than 10) and few A-specific substitutions in the VR (default: less than 7) are discarded from the dataset. Alignments of the potential TR/VR pairs are output to a file. A second program module (Additional file3) converts the results of DiGReF to the GenBank DNA format with the RT and potential TR/VR pairs shown as features. This file can be opened with a sequence viewer program (e.g. Artemis,[11, 35]) to allow simple visual assessment of the relative positions of the RT and the TR/VR pair.
Multiple Alignment of the RT sequences from DGRs and other retroelements was performed using MAFFT at the European Bioinformatics Institute[36, 37] and COBALT. Sequence Logos were created using WebLogo[39–41].
Phylogenetic analysis was carried out using MEGA5, RAxML, and PHYLIP. Trees were constructed using the neighbor joining algorithm with the JTT distance matrix and 1000 bootstrap replications were carried out to give a consensus tree. For comparison, a maximum likelihood tree was also constructed with 1000 bootstrap replications, but gave essentially the same clade pattern. 16S rRNA sequences of the organisms were downloaded from the SILVA database[46, 47]). For analysis of distribution of DGRs across prokaryotic classes, the counts for sequenced genomes per class were retrieved from NCBI’s Taxonomy database.
Diversity g enerating r etroelement
Diversity g enerating retroelement finder
Open reading frame
C, T, G: Adenine, cytosine, thymine, guanine
C, T, or G.
We would like to thank F. Kauff for valuable advice and assistance with the phylogenetic analysis, and A. Solem for critically reading the manuscript. This work has been supported by a grant from the EU-FP7 programme (Marie Curie International Reintegration grant PIRG05-GA-2009-248023) to N.Z.
- Denamur E, Lecointre G, Darlu P, Tenaillon O, Acquaviva C, Sayada C, Sunjevaric I, Rothstein R, Elion J, Taddei F, et al: Evolutionary implications of the frequent horizontal transfer of mismatch repair genes. Cell. 2000, 103: 711-721. 10.1016/S0092-8674(00)00175-6.View ArticlePubMed
- Moxon R, Bayliss C, Hood D: Bacterial contingency loci: the role of simple sequence DNA repeats in bacterial adaptation. Annu Rev Genet. 2006, 40: 307-333. 10.1146/annurev.genet.40.110405.090442.View ArticlePubMed
- Liu M, Deora R, Doulatov SR, Gingery M, Eiserling FA, Preston A, Maskell DJ, Simons RW, Cotter PA, Parkhill J, Miller JF: Reverse transcriptase-mediated tropism switching in Bordetella bacteriophage. Science. 2002, 295: 2091-2094. 10.1126/science.1067467.View ArticlePubMed
- Doulatov S, Hodes A, Dai L, Mandhana N, Liu M, Deora R, Simons RW, Zimmerly S, Miller JF: Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature. 2004, 431: 476-481. 10.1038/nature02833.View ArticlePubMed
- Simon DM, Zimmerly S: A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 2008, 36: 7219-7229. 10.1093/nar/gkn867.PubMed CentralView ArticlePubMed
- Medhekar B, Miller JF: Diversity-generating retroelements. Curr Opin Microbiol. 2007, 10: 388-395. 10.1016/j.mib.2007.06.004.PubMed CentralView ArticlePubMed
- Guo H, Tse LV, Barbalat R, Sivaamnuaiphorn S, Xu M, Doulatov S, Miller JF: Diversity-generating retroelement homing regenerates target sequences for repeated rounds of codon rewriting and protein diversification. Mol Cell. 2008, 31: 813-823. 10.1016/j.molcel.2008.07.022.PubMed CentralView ArticlePubMed
- Guo H, Tse LV, Nieh AW, Czornyj E, Williams S, Oukil S, Liu VB, Miller JF: Target site recognition by a diversity-generating retroelement. PLoS Genet. 2011, 7: e1002414-10.1371/journal.pgen.1002414.PubMed CentralView ArticlePubMed
- Perelson AS, Oster GF: Theoretical studies of clonal selection: minimal antibody repertoire size and reliability of self-non-self discrimination. J Theor Biol. 1979, 81: 645-670. 10.1016/0022-5193(79)90275-3.View ArticlePubMed
- Griffiths AD, Tawfik DS: Man-made enzymes–from design to in vitro compartmentalisation. Curr Opin Biotechnol. 2000, 11: 338-353. 10.1016/S0958-1669(00)00109-9.View ArticlePubMed
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16: 944-945. 10.1093/bioinformatics/16.10.944.View ArticlePubMed
- Bose M, Barber RD: Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences. In Silico Biol. 2006, 6: 223-227.PubMed
- Craig NL: Tn7. Mobile DNA II. Edited by: Craig NL, Craigie R, Gellert M, Lambowitz AM. 2002, Washington DC: ASM Press, 423-456.View Article
- Nguyen M, Vedantam G: Mobile genetic elements in the genus Bacteroides, and their mechanism(s) of dissemination. Mobile Genetic Elements. 2011, 1: 187-196. 10.4161/mge.1.3.18448.PubMed CentralView ArticlePubMed
- Xiong Y, Eickbush TH: Similarity of reverse transcriptase-like sequences of viruses, transposable elements, and mitochondrial introns. Mol Biol Evol. 1988, 5: 675-690.PubMed
- Pandey VN, Kaushik N, Rege N, Sarafianos SG, Yadav PN, Modak MJ: Role of methionine 184 of human immunodeficiency virus type-1 reverse transcriptase in the polymerase function and fidelity of DNA synthesis. Biochemistry. 1996, 35: 2168-2179. 10.1021/bi9516642.View ArticlePubMed
- Wainberg MA, Drosopoulos WC, Salomon H, Hsu M, Borkow G, Parniak M, Gu Z, Song Q, Manne J, Islam S, et al: Enhanced fidelity of 3TC-selected mutant HIV-1 reverse transcriptase. Science. 1996, 271: 1282-1285. 10.1126/science.271.5253.1282.View ArticlePubMed
- Huang H, Chopra R, Verdine GL, Harrison SC: Structure of a covalently trapped catalytic complex of HIV-1 reverse transcriptase: implications for drug resistance. Science. 1998, 282: 1669-1675.View ArticlePubMed
- Kaushik N, Harris D, Rege N, Modak MJ, Yadav PN, Pandey VN: Role of glutamine-151 of human immunodeficiency virus type-1 reverse transcriptase in RNA-directed DNA synthesis. Biochemistry. 1997, 36: 14430-14438. 10.1021/bi970645k.View ArticlePubMed
- Kaushik N, Talele TT, Pandey PK, Harris D, Yadav PN, Pandey VN: Role of glutamine 151 of human immunodeficiency virus type-1 reverse transcriptase in substrate selection as assessed by site-directed mutagenesis. Biochemistry. 2000, 39: 2912-2920. 10.1021/bi991376w.View ArticlePubMed
- Singh K, Kaushik N, Jin J, Madhusudanan M, Modak MJ: Role of Q190 of MuLV RT in ddNTP resistance and fidelity of DNA synthesis: a molecular model of interactions with substrates. Protein Eng. 2000, 13: 635-643. 10.1093/protein/13.9.635.View ArticlePubMed
- Klarmann GJ, Smith RA, Schinazi RF, North TW, Preston BD: Site-specific incorporation of nucleoside analogs by HIV-1 reverse transcriptase and the template grip mutant P157S. Template interactions influence substrate recognition at the polymerase active site. J Biol Chem. 2000, 275: 359-366. 10.1074/jbc.275.1.359.View ArticlePubMed
- Smith RA, Klarmann GJ, Stray KM, von Schwedler UK, Schinazi RF, Preston BD, North TW: A new point mutation (P157S) in the reverse transcriptase of human immunodeficiency virus type 1 confers low-level resistance to (−)-beta-2',3'-dideoxy-3'-thiacytidine. Antimicrob Agents Chemother. 1999, 43: 2077-2080.PubMed CentralPubMed
- Patel PH, Suzuki M, Adman E, Shinkai A, Loeb LA: Prokaryotic DNA polymerase I: evolution, structure, and "base flipping" mechanism for nucleotide selection. J Mol Biol. 2001, 308: 823-837. 10.1006/jmbi.2001.4619.View ArticlePubMed
- Doublie S, Tabor S, Long AM, Richardson CC, Ellenberger T: Crystal structure of a bacteriophage T7 DNA replication complex at 2.2 A resolution. Nature. 1998, 391: 251-258. 10.1038/34593.View ArticlePubMed
- Overstreet CM, Yuan TZ, Levin AM, Kong C, Coroneus JG, Weiss GA: Self-made phage libraries with heterologous inserts in the Mtd of Bordetella bronchiseptica. Protein Eng Des Sel. 2012, 25: 145-151. 10.1093/protein/gzr068.PubMed CentralView ArticlePubMed
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1462. 10.1126/science.277.5331.1453.View ArticlePubMed
- Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science. 2000, 290: 1151-1155. 10.1126/science.290.5494.1151.View ArticlePubMed
- Bratlie MS, Johansen J, Sherman BT, da Huang W, Lempicki RA, Drablos F: Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genomics. 2010, 11: 588-10.1186/1471-2164-11-588.PubMed CentralView ArticlePubMed
- Hahn MW: Distinguishing among evolutionary models for the maintenance of gene duplicates. J Hered. 2009, 100: 605-617. 10.1093/jhered/esp047.View ArticlePubMed
- Miller JL, Le Coq J, Hodes A, Barbalat R, Miller JF, Ghosh P: Selective ligand recognition by a diversity-generating retroelement variable protein. PLoS Biol. 2008, 6: e131-10.1371/journal.pbio.0060131.PubMed CentralView ArticlePubMed
- Le Coq J, Ghosh P: Conservation of the C-type lectin fold for massive sequence variation in a Treponema diversity-generating retroelement. Proc Natl Acad Sci U S A. 2011, 108: 14649-14653. 10.1073/pnas.1105613108.PubMed CentralView ArticlePubMed
- Active Perl. [http://www.activestate.com]
- BioPerl Wiki. [http://www.bioperl.org/wiki/Main_Page]
- Artemis Genome Viewer. [http://www.sanger.ac.uk/resources/software/]
- MAFFT: Multiple Alignment using Fast Fourier Transform. [http://www.ebi.ac.uk/Tools/msa/mafft/]
- Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002, 30: 3059-3066. 10.1093/nar/gkf436.PubMed CentralView ArticlePubMed
- Papadopoulos JS, Agarwala R: COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics. 2007, 23: 1073-1079. 10.1093/bioinformatics/btm076.View ArticlePubMed
- WebLogo Sequence Logo Tool. [http://weblogo.berkeley.edu/logo.cgi]
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.PubMed CentralView ArticlePubMed
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18: 6097-6100. 10.1093/nar/18.20.6097.PubMed CentralView ArticlePubMed
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28: 2731-2739. 10.1093/molbev/msr121.PubMed CentralView ArticlePubMed
- Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics. 2005, 21: 456-463. 10.1093/bioinformatics/bti191.View ArticlePubMed
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8: 275-282.PubMed
- Silva rRNA database. [http://www.arb-silva.de/]
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.PubMed CentralView ArticlePubMed
- NCBI Taxonomy database. [http://www.ncbi.nlm.nih.gov/taxonomy]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.