Phylogenomic identification of five new human homologs of the DNA repair enzyme AlkB

Background Combination of biochemical and bioinformatic analyses led to the discovery of oxidative demethylation – a novel DNA repair mechanism catalyzed by the Escherichia coli AlkB protein and its two human homologs, hABH2 and hABH3. This discovery was based on the prediction made by Aravind and Koonin that AlkB is a member of the 2OG-Fe2+ oxygenase superfamily. Results In this article, we report identification and sequence analysis of five human members of the (2OG-Fe2+) oxygenase superfamily designated here as hABH4 through hABH8. These experimentally uncharacterized and poorly annotated genes were not associated with the AlkB family in any database, but are predicted here to be phylogenetically and functionally related to the AlkB family (and specifically to the lineage that groups together hABH2 and hABH3) rather than to any other oxygenase family. Our analysis reveals the history of ABH gene duplications in the evolution of vertebrate genomes. Conclusions We hypothesize that hABH 4–8 could either be back-up enzymes for hABH1-3 or may code for novel DNA or RNA repair activities. For example, enzymes that can dealkylate N3-methylpurines or N7-methylpurines in DNA have not been described. Our analysis will guide experimental confirmation of these novel human putative DNA repair enzymes.


Background
The 2-oxoglutarate-Fe 2+ -depedent (2OG-Fe 2+ ) oxygenase superfamily groups together enzymes that catalyze a variety of reactions that typically involve oxidation of an organic substrate using a di-oxygen molecule [1]. Among them are proteins implicated in the hydroxylation of proline and lysine side chains in collagen, the synthesis of plant hormones, pigments and flavones, enzymes catalyzing the reaction of oxidative ring expansion in biosynthesis of penicillin and cephalosporin antibiotics, proteoglycans associated with basement membrane in chordates (leprecan and its homologs). X-ray crystallographic studies of isopenicillin N synthase revealed a variant of the "jelly-roll" fold and provided insight into the architecture of the conserved active site and the catalytic mechanism of 2OG-Fe 2+ oxygenases [2].
Using the protein fold-recognition method (a bioinformatic approach for determination if the sequence of interest is related to known protein structures) Aravind and Koonin found that the Escherichia coli AlkB protein implicated in DNA repair is also a member of the 2OG-Fe 2+ oxygenase superfamily [3]. They proposed that AlkB may catalyze oxidative demethylation of alkylated bases in DNA, while the eukaryotic homologs of AlkB maybe involved in RNA demethylation. In accordance with this prediction, it was demonstrated experimentally that bacterial AlkB carries out the oxidative demethylation of N1methyladenine (m 1 A) or N3-methylcytosine (m 3 C) lesions in single-stranded and double-stranded DNA; the reaction requires non-heme Fe 2+ , O 2 , and α-ketoglutarate and yields succinate, CO 2 and formaldehyde and the restoration of unalkylated bases in DNA [4,5]. To date, three human homologs of AlkB have been reported, hABH1, hABH2 and hABH3. Although hABH1 was initially reported by one group to partially complement the sensitivity of an E. coli alkB mutant towards an alkylating agent [6], this result could not be reproduced by others [7]. hABH2 and hABH3 remove m 1 A, m 3 C and 1-ethyladenine from alkylated polynucleotides in an α-ketoglutarate dependent reaction [7]. Interestingly, bacterial AlkB and hABH3 (but not hABH2) were shown to prefer singlestranded nucleic acids and to repair lesions in RNA [8].
The availability of several completed eukaryotic genome sequences [9][10][11][12][13] allowed us to carry out large-scale searches for novel members of previously identified protein families and to perform "phylogenomic" analyses, i.e. functional predictions via phylogenetic analysis. The premise of phylogenomics [14] is that only the analyses carried out for complete sets of proteins obtained from whole genomes allow to make accurate inferences of the duplication/speciation patterns, identification of orthologs and paralogs, and mapping on known functions onto the evolutionary tree [15,16]. Thereby, functions of uncharacterized genes can be predicted by their phylogenetic position relative to the characterized ones, and by analyzing the rates and patterns of gene evolution (reviews: [14,17,18]. Previous analyses of AlkB and its homologs utilized databases of a smaller size (less complete genomes and less correctly predicted genes in the human genome) that did not allow comprehensive comparisons of the eukaryotic protein families on the genome/proteome scale. This prompted us to carry out a genome-wide search for novel eukaryotic members of the (2OG-Fe 2+ ) oxygenase superfamily (with a special focus on potential new close homologs of AlkB) and to carry out the evolutionary analysis to predict their molecular function.

Results
We have identified 24 members of the 2OG-Fe 2+ oxygenase superfamily in the human genome, including 19 previously known members (according to the annotation in the Ensembl database [19]) and 5 novel ones (see Methods for details). According to the phylogenomic analysis, 8 of these 24 human oxygenases grouped together with the AlkB lineage. This group of proteins included previously reported hABH1, hABH2, and hABH3, three proteins annotated only as oxygenase homologs and two proteins which have not been functionally annotated to date. We propose to name these 5 novel putative AlkB family members as ABHs (AlkB homologs) numbered 4 through 8 (Fig. 1). We also were able to find a number of orthologs of the newly identified ABHs in other fully sequenced metazoan genomes (Table 1) and in the EST division of the GenBank database (data not shown, links between the individual genes/contigs/transcripts are available via the Ensembl website using the gene identifiers provided in Table 1).
To substantiate the hypothesis that the new hABHs are genuine members of the 2OG-Fe 2+ oxygenase superfamily, we have confirmed their structural relationship to the jelly-roll fold by secondary structure prediction and protein fold-recognition i.e. the same techniques that contributed to the original identification of AlkB as a member of the 2OG-Fe 2+ oxygenase superfamily [3]. All protein structure analyses were carried out via the GeneSilico protein structure prediction meta server interface [20]. The consensus pattern of secondary structures predicted for all ABHs ( Fig. 1) agrees very well with that observed in the known crystal structures of genuine β-helix oxygenases (see the "clavaminate synthase-like" superfamily in the SCOP database [21]). The protein fold-recognition analysis has also corroborated our prediction: according to the FFAS 03 algorithm [22], the sequences of all ABHs are compatible with the structures of 2OG-Fe 2+ oxygenases (such as 1mze, 1gp4, 1bk0 in the Protein Data Bank) as well as other proteins that exhibit the same doublestranded β-helix fold (such as 1lr5 or 2arc). This compatibility has statistically significant value (FFAS score lower than -9.00). Taken together, these results can be regarded as strong indication that all five novel ABHs (4-8) are indeed true members of the 2OG-Fe 2+ oxygenase superfamily.
In order to find whether the new human ABHs belong to the AlkB family (and are likely to be genuine "dealkylases" involved in nucleic acid repair) or to some other 2OG-Fe 2+ oxygenase family, phylogenetic analysis was carried out and the multiple sequence alignment of ABHs ( Fig. 1) was scrutinized for the presence of conserved residues and other features characteristic for AlkB. We found that all hABHs exhibit the catalytic residues (HxD...H...RxxxxxR) The multiple sequence alignment of E. coli AlkB and all human members of the AlkB family described previously (ABH1-3) and in this study (ABH4-8) Figure 1 The multiple sequence alignment of E. coli AlkB and all human members of the AlkB family described previously (ABH1-3) and in this study (ABH4-8). Residues conserved in at least 4 members of the family are highlighted in black, residues with similar physico-chemical character are highlighted in gray. The length of loops omitted for the clarity of the presentation are shown in parentheses. The N-and C-terminal residues are numbered. The predicted catalytic residues are indicated with asterisks (*). predicted for the AlkB family [3]. It is remarkable that all these proteins share the conserved Arg residue in the Cterminal β-strand (R210 in E. coli AlkB), which was found to be characteristic of the AlkB family and not found in the other 2OG-Fe 2+ oxygenases. The latter enzymes typically have a bulky aromatic side chain at this position [3].
Furthermore, in the phylogenetic tree of the 2OG-Fe 2+ oxygenase superfamily all ABHs cluster together with the known AlkB family members from Prokaryota and Eukaryota with strong bootstrap support (>90%; data not shown). Interestingly, the novel ABHs (ABHs 4-8) were found to be more closely related to the hABH2/hABH3 lineage described previously [3] rather than to the lineage comprising E. coli AlkB and the ABH1 protein (Fig. 2). This result suggests that eukaryotic ABHs 2-8 have a relatively recent common origin and that they have most likely radiated by a series of duplication events in the metazoan lineage. Our analysis shows that all eight hABHs have orthologs in vertebrate genomes, while the number of ABHs in the genomes of C. elegans and D. melanogaster is more limited. According to the maximum likelihood phylogenetic tree, ABH2&3, ABH5&7 and ABH4&6&8 are in-paralogous (i.e. originated by duplication following the radiation of the main ABH lineages. The mouse orthologs of hABH5, hABH6, and hABH7 are within known regions of human-mouse synteny. The hABH5 gene maps to the 17p11 locus and hABH6 and hABH7 map to the 19th chromosome (loci 19q13 and 19p13, respectively). On the other hand, the hABH4 gene is found outside the region of human-mouse chromosome synteny. It maps to the locus 7q22. There are two separate transcripts in the Ensembl database for the hABH8 gene (locus 11p11), seemingly corresponding to two protein domains, namely the RRM domain, typically found in RNA-binding proteins and the the oxygenase domain homologus to AlkB. However, in the hABH8 orthologs from other metazoan genomes (including C. elegans and D. melanogaster) these two domains are found fused together to form a single polypeptide. The fusion of the AlkB-like domain with an RNA-binding domain suggests that the ABH8 protein, like previously found for hABH3, may be involved in repair of alkylation damage in RNA. Whether the human hABH8 protein is formed by two independently expressed domains or is encoded by a single polypeptide like it orthologs, remains to be determined experimentally.

Conclusions
Based on the phylogenomic considerations supported by structure prediction and identification of characteristic conserved residues we predict that five so far uncharacterized human proteins are members of the AlkB family of nucleic acid repair enzymes. hABH4, hABH5, and hABH6 have been annotated in Ensembl only as members of the huge 2OG-Fe 2+ oxygenase superfamily without any specification as to the possible substrate or biological function. hABH7 and hABH8 have not been annotated as members of the 2OG-Fe 2+ oxygenase superfamily in any database at the time of this analysis (August 2003). We hypothesize that hABH 4-8 could either be back-up enzymes for hABH1-3 or may code for novel DNA or RNA repair activities. For example, enzymes that can dealkylate N3-methylpurines or N7-methylpurines in DNA have not been described. It is possible that some of the predicted ABHs perform such demethylation reaction. Our prediction has therefore a significant potential to facilitate the experimental analysis aimed at understanding the molecular function of these proteins. However, it must be remembered that for hABH1 no activity could be reproducibly demonstrated and some of the ABHs (4-8) reported in this work may turn out to be inactive or may be difficult to assay. Nonetheless, all ABHs have (at least partial) counterparts in the characterized cDNA or EST databases (Table 1) suggesting that they are all expressed in human cells.

Methods
Our re-analysis of (2OG-Fe 2+ ) oxygenases was initiated with the sequence data obtained from the Pfam and the The maximum likelihood phylogenetic tree of the AlkB family Figure 2 The maximum likelihood phylogenetic tree of the AlkB family. Only representative members (from human and E. coli), corresponding to the sequences in Fig. 1 are shown for the clarity of presentation. The position of the root was inferred from the large 2OG-Fe2+ oxygenase superfamily tree (data not shown). The topology of the presented tree received very high p-value support according to the Shimodaira-Hasegawa [31] and "approximately unbiased" [32] tests (0.998 and 0.988, respectively). [23,24]. The Pfam database (release 9.0) contains 857 sequences from various species annotated as members of the (2OG-Fe 2+ ) oxygenase superfamily. We also used the COG 3145 cluster with 23 prokaryotic members of the AlkB subfamily. The multiple sequence alignments (MSA) from PFAM (oxygenases) and COG (AlkB orthologs) were converted to profiles and used to search for oxygenase/ AlkB homologs in the following genomic databases with a variant of the RPS-BLAST program [25]: All bona fide oxygenases/AlkB orthologs and their homologs identified in the genomic databases were aligned using HMMER [26] and subjected to the evolutionary analysis using the SDI/RIO software package [15,16]. Briefly, a maximum likelihood (ML) tree of the bona fide members of the superfamily (sequences from PFAM and COG) was built using Tree-Puzzle [27]. Each newly identified putative oxygenase/AlkB homolog was separately aligned to the genuine oxygenases and its position on the superfamily tree (as well as the statistical support for this position) was determined. The SDI algorithm was used to identify the type of evolutionary relationships (orthology/paralogy) between the tested sequence and the subfamilies of genuine oxygenases, using the established evolutionary tree of species [28] as the reference. Consequently, for each oxygenase candidate it was predicted whether it could have evolved from the known superfamily members by speciation or duplication and what would be the relative timing of its radiation compared to the radiation of other members.

Clusters of Orthologous Groups (COG) databases
The final evolutionary tree of the oxygenase superfamily was computed using the fast neighbor-joining (NJ) approach [29] (for all sequences analyzed) and the computationally expensive maximum likelihood (ML) approach [30](for the selected members of each major lineage, including all hABHs). The topologies of major branches of both trees were virtually identical, therefore only the ML tree is shown, for the clarity of the presentation. The final phylogenetic tree was obtained following using the "approximately unbiased" and Shimodaira-Hasegawa tests [31,32].

Authors' contributions
MAK performed most of database searches, multiple sequence alignment, and phylogenetic analyses and drafted the first version of the manuscript. ASB initiated the study, participated in the interpretation of the results and writing of the manuscript. GP performed database searches and multiple sequence alignment to identify eukaryotic orthologs of all eight ABHs and used this information to identify missing exons in human genes. JMB supervised the study, refined the multiple sequence alignment, participated in phylogenetic analyses, prepared the figures and participated in the discussions of the results and writing of the manuscript. All authors read and approved the final manuscript.