Classification and evolutionary history of the single-strand annealing proteins, RecT, Redβ, ERF and RAD52

Background The DNA single-strand annealing proteins (SSAPs), such as RecT, Redβ, ERF and Rad52, function in RecA-dependent and RecA-independent DNA recombination pathways. Recently, they have been shown to form similar helical quaternary superstructures. However, despite the functional similarities between these diverse SSAPs, their actual evolutionary affinities are poorly understood. Results Using sensitive computational sequence analysis, we show that the RecT and Redβ proteins, along with several other bacterial proteins, form a distinct superfamily. The ERF and Rad52 families show no direct evolutionary relationship to these proteins and define novel superfamilies of their own. We identify several previously unknown members of each of these superfamilies and also report, for the first time, bacterial and viral homologs of Rad52. Additionally, we predict the presence of aberrant HhH modules in RAD52 that are likely to be involved in DNA-binding. Using the contextual information obtained from the analysis of gene neighborhoods, we provide evidence of the interaction of the bacterial members of each of these SSAP superfamilies with a similar set of DNA repair/recombination protein. These include different nucleases or Holliday junction resolvases, the ABC ATPase SbcC and the single-strand-binding protein. We also present evidence of independent assembly of some of the predicted operons encoding SSAPs and in situ displacement of functionally similar genes. Conclusions There are three evolutionarily distinct superfamilies of SSAPs, namely the RecT/Redβ, ERF, and RAD52, that have different sequence conservation patterns and predicted folds. All these SSAPs appear to be primarily of bacteriophage origin and have been acquired by numerous phylogenetically distant cellular genomes. They generally occur in predicted operons encoding one or more of a set of conserved DNA recombination proteins that appear to be the principal functional partners of the SSAPs.


Introduction
Homologous DNA recombination is a fundamental process in the biochemistry of DNA repair and replication, which contributes to the generation of the genetic diversity critical for natural selection. An important step in the recombination process is the pairing of homologous dou-ble-stranded DNAs followed by the exchange of DNA strands between the paired molecules. Experimental studies have shown that members of the archetypal RecA family of recombinases are central to this reaction in all extant forms of life [1,2]. Studies in Escherichia coli have shown that, although RecA is the principle protein involved in pairing and strand exchange, unrelated proteins, that have a much more restrictive phyletic distribution, can also promote similar reactions in a RecA dependent or RecA-independent manner [3]. These alternative or additional mediators of homologous recombination include the well-characterized prophage RecT, phage λ Redβ and phage P22 ERF proteins [4,5]. Similarly, in yeast and vertebrates, the RAD52 protein is involved in the pairing and strand exchange reaction and can promote recombination in a RAD51 (the eukaryotic RecA homolog)-dependent or independent manner [6]. The RecT protein works in conjunction with the RecE-nuclease [7] and was initially described in genetic studies on the complemention of mutations in the RecBCD pathway of DNA repair [8][9][10]. Biochemically, RecT has been shown to bind single-stranded (ss) DNA 3' overhang regions generated by the RecE nuclease, and promote strand exchange between homologous DNA partners by assisting the pairing of complementary singlestranded regions [4,10]. The reaction catalyzed by the RecT/RecE system is similar to that described for the phage λ exonuclease (exo/Redβ) and the single-strand annealing protein Redβ. The similarity between these two systems is further extended by the observation that the RecT/E system can complement mutations in the λ exo/Redβ system [10,12]. In eukaryotes, RAD52 protein has been shown to exhibit properties similar to those of RecT and Redβ proteins: it binds ssDNA and promotes strand exchange via the pairing of complementary single strands [6,13]. In vitro studies on quaternary structures have shown that the single strand annealing proteins (SSAPs), RecT, Redβ, ERF and RAD52, form similar helical super-structures [14][15][16][17]. This has led to the proposal that RecT, Redβ, ERF and the eukaryotic RAD52 function in an analogous fashion, and even are "structural homologs" [14]. However, no sequence or secondary structural similarities have been noticed between different SSAPs and current understanding of their evolutionary history and phyletic range remains poor. Here, we describe the results of an indepth sequence analysis of these proteins and delineate their evolutionary relationships and phyletic horizon in available genomes. We show that, in spite of the functional similarities, and the similar quaternary structures, there are three distinct superfamilies of SSAPs, namely the RecT/ Redβ, RAD52 and ERF, that appear to be evolutionarily unrelated to each other. These superfamilies show a wide distribution in viral and cellular genomes, but appear to have originally evolved in large DNA bacteriophages. Through an analysis of the contextual information provided by the predicted operons, in which the SSAPs occur, we predict several previously undetected functional connections of these proteins, which might shed new light on the corresponding DNA repair/recombination pathways.

RecT and Redβ are evolutionarily related and define a widespread family of DNA recombination proteins
Several lines of evidence, including genetic analyses, and similarities in biochemistry and quaternary structures, suggest that the E. coli RecT and phage λ Redβ proteins are functionally equivalent as mediators of single-strand exchange in DNA recombination [10,12]. However, no sequence similarity has been detected between these proteins leaving their actual evolutionary relationships unresolved. In order to gain a better understanding of their functions and origins, we undertook a detailed sequence analysis of these two proteins using iterative sequence profile searches with the PSI-BLAST program with a inclusion threshold of .01 iterated until convergence. Such searches, with Redβ proteins from different lambdoid bacteriophages as queries, retrieved not only other obvious Redβ homologs, but also the RecT protein family. For example, searches initiated with the Redβ homolog, PF161 protein (Genbank gi: 9836834 amino acids 1 to 188) from Borrelia hermsii [18], detected the E. coli RecT protein in the 5 th iteration with significant expectation (e) values (3 × 10 -3 ). Subsequent iterations retrieved several more RecT-related proteins from diverse sources. Further, transitive searches with the proteins detected in the above searches resulted in the identification of more divergent homologs, such as a protein termed the 'enterohemolysin associated protein' (EHAP1) from E. coli [19] and its orthologs in Salmonella (Fig. 1). An examination of the pairwise alignments generated by these searches showed that all these proteins shared a characteristic set of residues, including two highly conserved aromatic residues at the Nand C-termini, respectively, and two consecutive acidic residues near the C-terminus. These observations strongly suggested that RecT and Redβ, along with several other proteins, could be unified into a single protein superfamily with a core conserved domain of approximately 200 amino acid residues.
A multiple alignment of all members of the RecT/Redβ superfamily was generated using the T_coffee program followed by adjustments based on the PSI-BLAST search results. This alignment was used to predict their secondary structure using the JPRED and PHD methods; these predictions pointed to an α + β domain with a core of five βstrands and five α-helices (Fig. 1). Some of the strongest conservation is concentrated in the long helices, and the pattern includes some charged or polar residues, suggest-ing that they are probably exposed and participate in the protein-protein and protein-DNA interactions that are typical of this superfamily (helices 2,3, 4 in Fig. 1). The conserved, regularly spaced hydrophobic residues in the RecT/Redβ superfamily are predicted to be buried, allowing these domains to assume a globular structure. Experimental studies have shown that the strand transfer reaction mediated by RecT and its binding to dsDNA are sensitive to Mg +2 concentrations and it was proposed that the levels of free Mg +2 could regulate RecT activity [4]. Similarly, Redβ has been shown to promote single strand annealing in a Mg +2 -dependent manner [20]. In this context, the conservation of the two C-terminal acidic residues in the majority of members of this superfamily suggests that these might be involved in the coordination of Mg +2 and implies that the metal ion-dependent conformational switching is likely to be a generic feature of this family.
Phylogenetic analyses of the RecT/Redβ superfamily using the least squares and maximum likelihood methods distinguished three distinct groups, namely the RecT-like, the Redβ-like and the EHAP1-like families (Fig. 2). Previously, the RecT proteins have been known from very few bacteria and Redβ has only been detected in λ and closely related phages. However, we showed that the Redβ family is widespread in bacteria, such as Borrelia hermsii, Xylella, Ureaplasma, Listeria, Streptococcus pyogenes, Mesorhizobium loti. The RecT family is predominantly seen in the low-GC Gram-positive bacteria, such as Bacillus, Streptococcus, Lac-

ERF defines a superfamily of SSAPs that are evolutionarily distinct from the RecT/Redβ super family
The ERF protein of phage P22 is involved in the circularization of the linear dsDNA phage genome upon entry into the host cell [21][22][23]. Experimental studies have shown that, mutations in ERF are complemented by Redβ and that in vitro ERF adopts quaternary structures analogous to those of Redβ and RecT [14,17,24,25]. However, in the comprehensive analysis of the RecT/Redβ superfamily no statistically significant similarity could be detected between these proteins and the ERF proteins. To explore the evolutionary affinities of the ERF domains, we carried out a sequence profile analysis as described above for the RecT case using transitive PSI-BLAST analysis. As a result of these searches, homologs of ERF encoded in several bacterial and phage genomes from diverse taxa were identified.
The alignments generated in these searches consistently pointed to a region of approximately 150 amino acids that is conserved in all these proteins, with a characteristic motif of the form: GuXXoYhp + YXhXXhh (where G is glycine, Y-tyrosine, u is a tiny residue, h-hydrophobic, p is a polar residue, o is an alcohol residue, + is a basic residue, and X is any residue; Fig. 3). This suggested that ERF was the prototype of a family of conserved bacterial domains.
Secondary structure prediction based on the multiple alignment of the ERF domain suggests a globular α + β fold with five helices and three or four strands (Fig. 3). The above-mentioned motif that is typical of this family is associated with helix 4 of this domain; given the presence of conserved basic residues, it may be critical for DNA-binding and strand-transfer activity of the ERF-like proteins. Additionally, in the loop between helices 4 and 5 of the ERF domain there is a universally conserved acidic motif of the form DXD. Analogous to the RecT superfamily, this acidic dyad might coordinate a divalent cation and undergo a conformational change dependent on metal-binding. However, the average size of the core domains, the patterns of conserved residues, and the predicted secondary structures of the RecT/Redβ and ERF domains show no correspondence to each other, implying that there is no direct evolutionary link between these protein groups.
ERF homologs are encoded by the genomes of several temperate phages of Gram-positive bacteria and γ-proteobacteria; additionally, we detected members of this superfamily in Listeria and in all the circular plasmids and one linear plasmid of Borrelia burgdorferi (Fig. 4). Thus, like the RecT/Redβ superfamily, the ERF family is likely to have emerged in the temperate phages, and was disseminated to the Borrelia circular plasmids and some bacterial genomes via prophages.

Detection of bacterial homologs of RAD52 and identification of an aberrant HhH domain in these proteins
The baker's yeast protein RAD52 and its paralog RAD59 define a small family of proteins thus far represented in fungi, vertebrates and the early-branching ameboid eukaryote, Entamoeba histolytica. Rad52 functions in conjunction with the RecA ortholog, the RAD51 recombinase in double-strand break repair and meiotic recombination [6]. RAD52 binds ssDNA during recombination and also   Associated with a λ exonuclease shows a quaternary organization similar to those of RecT/ Redβ and ERF [16,26]. However, RAD52-like proteins showed no detectable sequence similarity with either the ERF or the RecT/Redβ-like proteins. Sequence searches initiated with the conserved globular region of the eukaryotic RAD52 proteins readily detected their homologs from other eukaryotes and, at convergence, also retrieved from the database certain bacterial proteins, such as DR0423 from Deinococcus and CAC1936 from Clostridium respectively, with border-like statistical significance (e ~ .05). These bacterial proteins form a small family that is additionally represented in Salmonella paratyphi A, the temperate bacteriophage u136 of Lactococcus lactis (ORF252encoded protein) and a Shiga toxin-converting phage from E. coli. Iterative profile searches initiated with CAC1936 from Clostridium acetobutylicum and its S. para-typhi A ortholog correspondingly retrieved S. cerevisiae RAD52 and its eukaryotic homologs, with borderline evalues at convergence (~0.043). The alignment between these bacterial proteins and the eukaryotic Rad52 homologs was co-linear throughout the entire length of their shared globular region and the Gibbs sampling procedure detected two motifs of greater than 20 residues, with a probability of chance occurrence in these proteins less than 10 -18 (Fig. 5). In addition to the similar conservation pattern, separate secondary structure predictions for both the eukaryotic RAD52 family and their potential bacterial homologs showed a complete concordance of the predicted structural elements between RAD52 and the bacterial proteins, strongly suggesting that they all belong to a single homologous superfamily (hereinafter the RAD52 superfamily).

HhH-1
HhH-2 The secondary structure predictions showed that the Rad52 superfamily proteins adopt a structure with interspersed α-helices and β-strands (Fig. 5). Additionally, fold predictions using 3DPSSM (E-value=.0085, corresponding to a 90% confidence in the prediction) and the hybrid fold method (Z-score = 19.5) predicted the presence of a potential Helix-hairpin-Helix (HhH) fold in members of the RAD52 superfamily. The HhH domain is a small nucleic acid-binding module comprised of two helices joined by a central loop (hairpin), which functions as the DNA-binding moiety of numerous repair and recombination proteins [27,28]. Two HhH modules are predicted in the core conserved domain of the RAD52 family, the first one bounded by the predicted helices 2 and 3, and the second one bounded by helices 5 and 6 ( Fig. 5). Although these predicted HhH modules are very divergent in sequence from the typical versions, the hairpin in both HhH modules of the RAD52 family proteins is bounded by small residues, typically glycine; this conforms to the signature motif characteristic of the classical HhH modules [28,29]. However, in the case of the RAD52 superfamily the predicted HhH modules appear to have been welded into a large globular superstructure that maintained its evolutionary distinctness over time. The conservation pattern and predicted structural elements of the RAD52 superfamily are distinct from those predicted for the ERF and RecT/Redβ superfamilies (Fig 1, 3, 5), supporting the lack of a direct evolutionary relationship between these proteins.
The RAD52 superfamily shows a sporadic phyletic distribution, and even in the crown-group eukaryotes, might have been secondarily lost in certain lineages, such as plants, nematodes and insects. The sporadic distribution of this family among phylogenetically distant bacteria, along with its presence in several prophages, suggests that, like the RecT/Redβ and ERF superfamilies, at least the bacterial RAD52-proteins might be of predominantly phage origin. The core of the eukaryotic recombination system appears to have been inherited from the system present in the common ancestor shared with the archaea [29]. However, RAD52 is thus far absent in all archaeal genomes and is restricted to a single orthologous group in the eukaryotes [29]. Thus, it appears plausible that eukaryotic RAD52 was ultimately derived through lateral transfer either from a bacterial genome or directly from a viral source, at a point at least predating the divergence of the crown group eukaryotes and Entamoeba.

Contextual information from gene neighborhoods provides details regarding functional interactions of the SSAPs with DNA recombination pathways
The clustering of functionally related genes in prokaryotic genomes into co-transcribed and co-regulated units, operons, often allows functional assignments through the principle of 'guilt by association' [30][31][32]. Generally, genes whose products physically interact to form a complex or are involved in successive steps in a biochemical pathway form operons that are conserved over large evolutionary distances [30]. On previous occasions, we have used gene neighborhoods or operons to predict novel DNA repair complexes and their components [33]. Accordingly, a similar approach was applied to the three families of SSAPs (RecT/Redβ, ERF, Rad52), to shed light on their functional links.
Notably, the genes encoding the three evolutionarily distinct SSAPs co-occurred with similar sets of DNA repair/recombination-related proteins (Figs. 2,4). In at least one case, each of them was found adjacent to the gene for the single-strand-binding protein (SSB), an OB-fold protein that binds ssDNA (Figs. 2,4). This association ties in with the function of the SSAPs in single-strand annealing, suggesting that they closely interact with SSB. It has been suggested in the case of RecT that it may compete with SSB for binding single strand overhangs and thereby make them available for the annealing process [3]. Similar interactions between other SSAPs and SSB, that probably coats the ssDNA generated by nucleases, appear likely. Genes for SSAPs from all the 3 distinct superfamilies may also occur adjacent to or in the vicinity of genes encoding nucleases or Holliday junction resolvases (HJRs). Genes for RecT/Redβ superfamily proteins are associated with genes encoding a λ-type exonuclease (LE) of the type II restriction enzyme fold, RecE, which also might be a divergent member of this fold, and a nuclease of the Endonuclease VII (EndoVII) fold [7,34] (Fig. 2). The ERF superfamily genes are associated with a RusA superfamily nuclease/ HJR and EndoVII fold nucleases (Fig. 4) [34]. Furthermore, the Borrelia plasmids that encode ERF, also almost always additionally encode a λ-type exonuclease, even if it is not the adjacent gene. In a single instance, in the Grampositive bacterium Ruminococcus albus, the gene encoding a RAD52 superfamily protein occurs adjacent to a gene for a λ-type exonuclease. These nucleases probably contribute to the repair process, in which SSAPs are involved, by providing the initial break in the dsDNA and/or in digesting the nicked target to generate ssDNA.
The RecT and Redβ family proteins often co-occur with the SbcC gene that encodes an ABC ATPase with a large coiled-coil segment. These proteins are known to cooperate with SbcD, nuclease of the calcineurin-like phosphoesterase superfamily and to degrade dsDNA in the 3' → 5' direction generating ssDNA [35,36]. It seems likely that RecT/Redβ proteins, at least in certain cases, function in conjunction with the SbcCD-pathway, by utilizing the single-stranded regions generated by the SbcCD nuclease. Additionally, several genes, whose functions are less clear, tend to co-occur with the genes coding for the SSAPs.
These include DNA methyltransferases and the primosomal protein DnaD from low-GC Gram-positive bacteria [37] that co-occur with both ERF and RecT superfamily members (Figs. 2,4). The poorly characterized phage-or prophage-specific genes that are frequently observed in these neighborhoods include ORF15 (Streptococcus thermophilus bacteriophage 7201), ORF86, ORF100a (Staphylococcus aureus temperate phage φSLT) and ORF364 (bacteriophage φ31.1) (Figs. 2,4). Secondary structure predictions indicate a high α-helical content for these proteins. It is likely that these α-helical proteins are phage innovations that could function as adaptors in the recombination pathway either as accessory protein-protein interacting domains or as DNA-binding domains.

Evidence for convergent operon evolution and in situ nonorthologous displacement of genes in operons encoding SSAPs
A superposition of the gene neighborhood information upon the phylogenetic trees for the SSAP superfamilies provides insights into the evolutionary processes that led to the emergence of the operons that include the SSAP genes. As discussed above, the RecT/Redβ superfamily clearly splits into three distinct families (Fig. 2). The phylogenetic tree shows that SbcC co-occurs with the SSAP once within Redβ-family and once within the RecT-family. An examination of the tree and the respective gene neighborhoods suggests that independent juxtaposition of SbcC with Redβ-like and RecT-like genes on two separate occasions is the most parsimonious explanation. The alternative explanation, namely that the gene coding for the common ancestor of the Redβ and RecT already co-occurred with the sbcC gene is far less likely because it would require over 10 independent losses of this, apparently, functionally advantageous organization in different bacterial and bacteriophage lineages. Likewise, the observation that, in one or more cases, genes encoding each of the SS-APs co-occur in the same predicted operon with SSB or a λ-type exonuclease, suggests that similar operon structures may also emerge independently in evolution. Thus, the same or analogous operon organizations may emerge convergently on multiple occasions, probably due to the selective pressure arising from the strong interactions between the SSAPs and their functional partners such as Sb-cC, SSB and LE.
The distribution of the gene neighborhoods, in which a member of the RecT superfamily occurs next to the RecE on the phylogenetic tree of the RecT/Redβ superfamily, indicates that the RecE-RecT combination was probably the ancestral state for at least the RecT and EHAP1 families (Fig. 2). This implies that, on at least two occasions, the gene for λ endonuclease displaced the functionally analogous recE gene and became the adjacent gene to RecT (Fig.  2). That this displacement might have occurred by in situ insertion of a non-orthologous gene is suggested by the detection, on three separate occasions, of unusual remnants of pre-existing genes. The RecT/Redβ superfamily members, namely EHAP1 from the enterobacteria and PF161 from Borrelia hermsii, contain a small, C-terminal fragment of the core conserved domain of the ERF superfamily, which is located C-terminal of their bona fide RecT/ Redβ domains. These fragments of the ERF protein are closely related to other ERF domains from related organisms and are unlikely to fold into the native conformation characteristic of the full-length ERF domain. For example the ERF fragment fused to the EHAP1 RecT/Redβ domain is closely related to the P22 phage ERF domain. This suggests that in each of these cases a RecT superfamily gene was inserted in frame into a pre-existing ERF gene leaving behind only a non-functional fragment of it (Fig. 2). In a very similar case, the bacterial RAD52-like protein from a Shiga toxin encoding temperate phage is fused to an extreme C-terminal fragment that is nearly identical to the C-terminal most portion of the P22 ERF protein. In this case, it appears that the pre-existing ERF gene was displaced through the insertion of a bacterial RAD52-like gene. Interestingly, and in the same vein, the RecT proteins from Bacillus species contain a short C-terminal acidic module that is missing in other RecT proteins, but is highly similar to the C-terminal region of SSBs, particularly those from Gram-positive bacteria (data not shown). This suggests that, at some stage in their evolution, the Bacillus recT gene protein has recombined with the gene coding for SSB, which might even have resulted in a functional replacement of an SSB with an SSAP.
Thus it appears likely that functionally equivalent genes may displace their analogs in operons via insertion into the same position.

Conclusions
We show that functionally similar SSAPs belong to at least three evolutionarily distinct superfamilies. We unify the Redβ and RecT proteins and their homologs, which have not been reported as being related at the sequence level, into a single superfamily, supporting the notion that these proteins share a similar mechanism of action. The second superfamily typified by the ERF proteins is predominantly found in bacteriophages and is also present on all circular plasmids from Borrelia, suggesting a role in the recombination of these plasmids. The third superfamily, typified by the yeast RAD52 protein and previously detected only in eukaryotes, was shown to include bacterial and phage homologs and to contain a modified HhH domain. By comparing the gene neighborhoods of the SSAPs, we show that the predicted operons that include the SSAP genes evolve according to the "LEGO" principle. In these operons, the SSAP genes are linked to the genes for various DNAses and DNA repair related proteins, such as SSB and SbcC, which implies functional connections between the encoded proteins. Evidence is presented of convergent emergence of similar SSAP-encoding operons in different lineages and of in situ non-orthologous displacement of functionally similar genes in these operons.

Materials and Methods
Sequence searches of the non-redundant (NR) and the unfinished genomes databases, were done using the gapped BLAST and PSI-BLAST programs [38]. Iterative PSI-BLAST searches used for in-depth sequence analysis were done with the profile inclusion cutoff expectation value (E value) set at 0.1. Multiple sequence alignments were generated using the T_Coffee program [39] and the output was adjusted using PSI-BLAST search results and secondary structure predictions, which were conducted using the PHD [40,41] and Jpred [42] programs. Fold predictions were done using the 3-D position specific score matrix (3DPSSM) [43] and the Hybrid fold method [44]. Phylogenetic analysis was carried out using the neighbor-joining algorithm, with subsequent local rearrangements using the maximum likelihood algorithm [45]. The robustness of tree topology was assessed with 10000 Resampling of Estimated Log Likelihoods (RELL) bootstrap replicates. The MOLPHY and Phylip software packages were used for the analyses [46,47].