Feature amplified voting algorithm for functional analysis of protein superfamily
© Chung et al. 2010
Published: 01 December 2010
Skip to main content
© Chung et al. 2010
Published: 01 December 2010
Identifying the regions associated with protein function is a singularly important task in the post-genomic era. Biological studies often identify functional enzyme residues by amino acid sequences, particularly when related structural information is unavailable. In some cases of protein superfamilies, functional residues are difficult to detect by current alignment tools or evolutionary strategies when phylogenetic relationships do not parallel their protein functions. The solution proposed in this study is Feature Amplified Voting Algorithm with Three-profile alignment (FAVAT). The core concept of FAVAT is to reveal the desired features of a target enzyme or protein by voting on three different property groups aligned by three-profile alignment method. Functional residues of a target protein can then be retrieved by FAVAT analysis. In this study, the amidohydrolase superfamily was an interesting case for verifying the proposed approach because it contains divergent enzymes and proteins.
The FAVAT was used to identify critical residues of mammalian imidase, a member of the amidohydrolase superfamily. Members of this superfamily were first classified by their functional properties and sources of original organisms. After FAVAT analysis, candidate residues were identified and compared to a bacterial hydantoinase in which the crystal structure (1GKQ) has been fully elucidated. One modified lysine, three histidines and one aspartate were found to participate in the coordination of metal ions in the active site. The FAVAT analysis also redressed the misrecognition of metal coordinator Asp57 by the multiple sequence alignment (MSA) method. Several other amino acid residues known to be related to the function or structure of mammalian imidase were also identified.
The FAVAT is shown to predict functionally important amino acids in amidohydrolase superfamily. This strategy effectively identifies functionally important residues by analyzing the discrepancy between the sequence and functional properties of related proteins in a superfamily, and it should be applicable to other protein families.
(The software is freely available for download from reference ).
Retrieving useful functional/structural information from a set of amino acid sequences is essential in experimental biological studies. Desired information is often obtainable by analyzing the sequence conservations, functional correlations and related structures that belong to a protein/enzyme family or superfamily. An enzyme superfamily is defined as a group of proteins that share the same structural scaffold and that undergo fundamentally similar chemical reactions . Earlier studies [3–5] adopted various pair-wise alignment and multiple sequence alignment (MSA) methods to detect the conserved residues that reveal functional roles in a set of sequences. Classical sequence comparison tools such as FASTA , BLAST , CLUSTALW , T-COFFEE  and MUSCLE  can detect similarities in aligned sequences and identify the conserved positions. These positions are essential for further functional analysis. Hierarchical analysis [10–12] is often used to select the most desirable pattern of an alignment. Some protein groups with dissimilar sequences but substantial structural fold similarity (hereinafter referred to as remote homologues) have similar or related biochemical functions . These proteins can be classified into the same superfamily according to their biological properties. Due to their low overall similarity, using alignment methods alone may not reveal the amino acid residues that reflect their physicochemical properties.
In addition to alignment methods, the most common strategy for predicting functional residues from sequences is motif-based sequence analysis [14–17]. However, the motif-based approach often obtains excessive false positives, which limits its use for analyzing a protein superfamily. Phylogenomic techniques such as the evolutionary trace method of identifying functionally important residues  use evolutionary information to improve accuracy and are particularly useful for large-scale analyses. This method automatically relates the results back to a given structure and identifies key features structurally clustered around substrate and dimmer interfaces [19–22]. This tool is useful for analyzing protein or enzyme superfamilies and for extracting functional information from enzyme families or superfamilies when the phylogenetic tree or dendrogram approximates a functional distribution.
This study employs a voting concept to search for functional key residues in an enzyme superfamily. Voting or voting-like concepts are widely used in computing algorithms for various purposes. In computational biological applications, voting concepts are often integrated with neural networks for protein clustering and structure prediction . Some theoretical analyses [24–29] indicate that comparing three sequences is better than comparing two sequences because it increases the alignment power needed to distinguish significant matches. Likewise, aligning three groups provides more information than aligning two groups does. Therefore, we developed a Feature Amplified Voting Algorithm with Three-profile alignment (FAVAT) according to the observed sequence similarity and biochemical properties of proteins in the amidohydrolase superfamily. The FAVAT identifies the key residues by calculating a score for each residue in a rat imidase. The functional residues of a rat imidase were identified and further confirmed by experimental references and available structural information.
In this study, rat imidase was the target sequence, and DRPs (Group II proteins) were classified into ~A proteins. Bacterial hydantoinases (Group III enzymes) were classified as A proteins. Although Dihydroorotase, allantoinase and other amidohydrolases (Group IV) were also classified as A proteins, they differ from the Group II enzymes in their functional correlation to target sequence (rat imidase). Following the above classification, the clustered sequences were subjected to FAVAT analysis, and two sets of scores were obtained for each residue of the target sequence. In the experiments, Groups II, III and IV were aligned using the MUSCLE tool adopted by the National Center for Biotechnology Information (NCBI) for protein database alignment.
Functional annotations of the residues in rat imidase selected by FAVAT
FAVAT score ranking
FAVAT selected residues
Corresponding residues in 1GKQ1
Corresponding residues in 1KCX1
MSA predicted residues2
Functional annotation base on 1GKQ
Secondary structure core residue
The possible functions of imidase amino acids selected by FAVAT were further analyzed using 1GKQ and 1KCX, which are known structures of imidase related proteins. The former is the crystal structure of a D-hydantoinase that represents an A protein (Group III) in FAVAT analysis. The latter is the crystal structure of a dihydropyrimidinase-related protein (CRMP1) that represents a ~A protein (Group II) in FAVAT analysis. Figure 3 shows their corresponding sequences and secondary structures. The similar β/α core structures were observed in the wiring diagrams of 1GKQ and 1KCX. The significant difference in these structures is that 1GKQ forms a typical (β/α)8 domain, but 1KCX does not. The FAVAT-selected amino acids may reflect both the structure feature and metal requirement that are responsible for the different functions of the A and ~A proteins. The corresponding locations of Ala34 and His459 in 1GKQ (Arg30 and Trp448 in the N-terminal and C-terminal β-Sheet, respectively) and in 1KCX (Gln44 and Met465 in the N-terminal and C-terminal β-Sheet, respectively) were domains in which they interact with another monomer to form a quaternary structure in both hydantoinase and CRMP1 [34, 35]. Residues His67, His69, Ala134, Lys159, His248 and Asp326 (His59, His61, Ala126, Kcx150, His239 and Asp315 in 1GKQ; His73, Tyr75, Asp139, Gln165, Lys254 and Gly332 in 1KCX) are located in the β/α core region.
Although the metal coordinators of imide-hydrolyzing enzymes in this case study were dispersed sequentially, almost all the known metal coordinators in 1GKQ were identified by FAVAT except His 183 (His 192 for rat imidase). This residue is conserved in CRMP1 but lacks metal and amidohydrolytic activity. The role of this histidine needs further study. The major difference between bacterial hydantoinase and mammalian imidase is their metal content. The former contains two metal ions while the later contains only one metal ion [39, 40]. Fewer metal coordinators may be needed for mammalian imidase, and residue His 192 may not be required as a coordinator of metal ions in rat imidase. A mammalian imidase was crystallized recently . The difference between mammalian imidase and non-mammalian imidase is expected to be clarified in the near future.
The FAVAT was developed to predict functionally important amino acids in mammalian imidase. A T-score was given to each residue of the target enzyme by analyzing imidase-related proteins in the amidohydrolase superfamily on the basis of their sequence-function relationships. Of the ten top T-score amino acids selected, six (His67, His69, Lys159, His192, His248 and Asp326) corresponded to metal coordination in D-hydantoinase. The other four amino acids corresponded with positions that were structurally important for forming quaternary structures and secondary structures in 1GKQ. Residue Asp57, which was misrecognized as a metal coordinator in previous MSA analyses, was correctly recognized by FAVAT. This study showed that analyzing the discrepancy between the sequence and functional properties of related proteins in a superfamily is an effective method of identifying functionally important residues. This strategy should be applicable to other protein families, and the authors expect to employ this strategy for analyzing critical residues of viruses in future works.
Hydantoinase activity was first reported in plants and animals [42, 43] to hydrolyze hydantoin derivatives that are not known as physiological metabolites. This enzymatic activity is useful for preparing optically pure amino acids that are precursors for various antibiotics . Due to its industrial application, several hydantoinases have been studied and purified from microorganisms [45, 46]. A dihydropyrimidinase (5, 6-dihydropyrimidine amidohydrolase) partially purified from animal livers was shown to hydrolyze the physiological substrate dihydropyrimidine . A detailed study of a homogenous imide-hydrolyzing enzyme, imidase, which was purified from rat, pig or fish livers [48–51], revealed that it catalyzes a wide spectrum of substrates, including dihydropyrimidines, hydantoins and other imides . Despite the substrate spectra of hydantoinase highly similar to imidase, these imide-hydrolyzing enzymes from bacterial and mammalian sources reportedly have relatively low sequence similarity. Some mammals, flies and C. elegans, reveal proteins with high sequence similarity to dihydropyrimidinase (or imidase). These dihydropyrimidinase-related proteins (DRPs) may be involved in cancer and neuron cells development, but possess no imidase activity [53–55]. Additionally, other enzymes revealed by the studies in evolution of the metabolic pathway are also known to use mechanisms similar to those observed in imidase [56, 57]. These enzymes include dihydroorotase, allantoinase, urease and amidohydrolases, which originate in mammals, plants and fungi . All use distinct substrates that contain similar imide functional groups.
All of the above enzymes can be classified into the amidohydrolase superfamily according to their properties and structures . In this superfamily, some proteins have similar sequences but divergent functions whereas others have similar functions but low sequence similarity. This phenomenon strongly suggests that only a few critical amino acid residues in this superfamily are needed for specific protein functions. Proteins in the amidohydrolase superfamily can be grouped according to their sequence similarity and biochemical properties, and an effective strategy for analyzing these proteins may yield valuable information.
Grouping of imidase related proteins1
I. Imidase (target enzyme)
II. Sequence related proteins2
III. Functionally identical enzymes3
IV. Functionally related enzymes4
V. Putative proteins5
The FAVAT was performed in two steps. The first step was to align the target enzyme, functionally identical enzymes (A proteins) and sequence-related proteins (~A proteins) using three-profile alignment. The three-profile alignment algorithm, which is based on the dynamic programming three-way alignment approach [62, 63], was designed to align three profiles in a space. As in the FAVAT pre-process, each profile can be generated by multiple sequence alignment tools such as T-COFFEE, HMMER  and MUSCLE.
The next step after the alignment is to determine whether amino acid residues critical for imidase activity exist in target and A proteins but are absent in ~A proteins. In the second step, a voting score (V-score) is given based on the previous assumption, and the V-scores are then summed in each comparison. In this step, a substitution matrix (BLOSUM62) is used to give V-score when each sequence of property A and ~A is compared to the target sequence. The V-score is calculated as follows:
V k (a,b) = M [t k , A (a,k)] – M [t k , ~ A (b,k)],
Algorithm FAVAT (t, P, Q);
Input: Target sequence t, a set of proteins P without property A, a set of proteins Q with property A. P has p sequences and Q has q sequences.
Ouput: The scores correspond to the residues of t (high T-scores indicate potentially critical residues)
1/* Step1: Do three-profile alignment by dynamic programming method among t, the proteins P, and the proteins Q. The length of the resulting alignments is m max . t [k] indicates the k -th residue of t.*/
2 for k <- 1 to m max do
3 if t [k] <> ‘-‘ then
4/* T-score [k] indicates the potential importance of the k -th residue of t. */
5 T-score [k] <- 0
6 max <- -∞
7 min <- ∞
8 for i <- 1 to p do
9 for j <- 1 to q do
10/* X and Y are used to store the k-th residue of this i -th protein in P and this j -th protein in Q, respectively.*/
11 (X, Y) = (P [i], Q [j], k -th)
12/* Step 2: Find V-score [k] based on the BLOSUM62 substitution matrix.*/
13 V-score [k]<-BLOSUM62(t [k], Y)
14 V-score [k]< -V-score [k] + (-1) × BLOSUM62(t [k], X)
15 T-score [k] <- T-score [k] + V-score [k]
16 max <- MAX (max, T-score [k])
17 min <- MIN (min, T-score [k])
18 end if
19 for k <- 1 to m max do
20 T-score [k] = (T-score [k] – min/max – min) × 100/*normalization*/
The novel feature of the FAVAT algorithm is its use of the sequence and functional properties among target sequence, ~A proteins and A proteins. When voting for reliable critical residue candidates, three relations are considered: the relation between target sequence and A proteins, the relation between target sequence and ~A proteins and the relation between A proteins and ~A proteins. To accurately identify the key residues, some useful alignment tools with physicochemical properties, such as T-COFFEE and HMMER, can be employed in the FAVAT pre-process to align A and ~A proteins separately (profiles). The appropriate alignments of A and ~A proteins can enhance the accuracy of the resulting alignment to the target sequence, A and ~A proteins by three-profile alignment. The most important residues can then be found accurately from the resulting alignment using FAVAT. The FAVAT algorithm was designed to account for the importance of alignment-based voting skill by V-score function. The time complexity of FAVAT is O(m max 3), and m max is the length of the resulting alignment by three-profile alignment. To reduce the time complexity of three-profile alignment method, this study developed a parallel version implemented by the MPICH library. The time complexity for the parallel version is O(m max 3/p), where p is the number of processors. After the voting process, the residue candidates obtain high T-scores. The uncritical candidates can be eliminated by advanced research.
We would like to thank the anonymous referees for many constructive comments during the revision. We also would like to thank Ted Knoy for editorial assistance. Part of this work was supported by the National Science Council (NSC) under grant NSC98-2218-E-007-005 and NSC97-2221 -E-182-03 3 -MY3. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 3, 2010: The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.