Prediction and classification of aminoacyl tRNA synthetases using PROSITE domains
© Panwar and Raghava. 2010
Received: 15 January 2010
Accepted: 22 September 2010
Published: 22 September 2010
Skip to main content
© Panwar and Raghava. 2010
Received: 15 January 2010
Accepted: 22 September 2010
Published: 22 September 2010
Aminoacyl tRNA synthetases (aaRSs) catalyse the first step of protein synthesis in all organisms. They are responsible for the precise attachment of amino acids to their cognate transfer RNAs. There are twenty different types of aaRSs, unique for each amino acid. These aaRSs have been divided into two classes, each comprising ten enzymes. It is important to predict and classify aaRSs in order to understand protein synthesis.
In this study, all models were developed on a non-redundant dataset containing 117 aaRSs and an equal number of non-aaRSs, in which no two sequences have more than 30% similarity. First, we applied the similarity search technique, BLAST, and achieved a maximum accuracy of 67.52%. We observed that 62% of tRNA synthetases contain one or more domains from amongst the following four PROSITE domains: PS50862, PS00178, PS50860 and PS50861. An SVM-based model was developed to discriminate between aaRSs, and non-aaRSs, and achieved a maximum MCC of 0.68 with accuracy of 83.73%, using selective dipeptide composition. We developed a hybrid approach and achieved a maximum MCC of 0.72 with accuracy of 85.49%, where SVM model developed using selected dipeptide composition and information of four PROSITE domains. We further developed an SVM-based model for classifying the aaRSs into class-1 and class-2, using selective dipeptide composition and achieved an MCC of 0.79. We also observed that two domains (PS00178, PS50889) in class-1 and three domains (PS50862, PS50860, PS50861) in class-2 were preferred. A hybrid method was developed using these domains as descriptor, along with selected dipeptide composition, and achieved an MCC of 0.87 with a sensitivity of 94.55% and an accuracy of 93.19%. All models were evaluated using a five-fold cross-validation technique.
We have analyzed protein sequences of aaRSs (class-1 and class-2) and non-aaRSs and identified interesting patterns. The high accuracy achieved by our SVM models using selected dipeptide composition demonstrates that certain types of dipeptide are preferred in aaRSs. We were able to identify PROSITE domains that are preferred in aaRSs and their classes, providing interesting insights into tRNA synthetases. The method developed in this study will be useful for researchers studying aaRS enzymes and tRNA biology. The web-server based on the above study, is available at http://www.imtech.res.in/raghava/icaars/.
Aminoacyl tRNA synthetases (aaRSs) play a central role in protein translation by covalently linking the correct amino acid to its cognate transfer RNA . This covalent linkage is a two-step aminoacylation reaction and ensures the fidelity of translation of the genetic code. In the first step, an amino acid (aa) activated by ATP, releases pyrophosphate (PPi) and is converted into the aminoacyl-adenylate (aa-AMP) complex. This complex remains bound to the tRNA synthetase. In the second step, the activated amino acid is transferred onto the 2'-terminal or 3'-terminal ribose of the corresponding tRNA (aa-tRNA). The aaRSs also perform editing activity by clearance of mischarged tRNA . The editing activity is shown by both class-1 (ValRS, IleRS and LeuRS) and class-2 (ThrRS, AlaRS, ProRS and PheRS) tRNA synthetases . The defects in editing activity of aaRSs can be lethal and may lead to many pathological problems e.g. neuronal pathologies (encephalopathy, cerebellar ataxia and peripheral neuropathy), autoimmune disorders and disrupted metabolic conditions [4–8].
Studies of tRNA and tRNA synthetases from bacteria, fungi, plants and mammals have shown that there are twenty aminoacyl tRNA synthetases in all organisms and each is specific for a single amino acid . Aminoacyl-tRNA synthetases differ in amino acid sequence length, three-dimensional structure, molecular weight, and subunit organization, and have limited sequence homology [10–12]. Based on the multiple sequence analysis and the architecture of catalytic sites, aaRSs are divided into two classes of ten aaRSs each . The structural characteristics of the catalytic domains of all aaRSs reveal that the active sites of class-1 enzymes contain the classical Rossmann dinucleotide-binding fold and two signature peptides, HIGH and KMSKS. The active sites of class-2 aaRSs contain an anti-parallel β-sheet flanked by helices on both sides, and have three (motif 1, motif 2 and motif 3) signature motifs [14, 15]. The catalytic site-based partition of aaRSs into two classes provides a strong correlation with function. An amino acid is transferred onto the 2'-OH group of the ribose of last nucleotide of tRNA by class-1 (ArgRS, CysRS, GlnRS, IleRS, LeuRS, GluRS, MetRS, TrpRS, TyrRS & ValRS) and the 3'-OH group by class-2 (AlaRS, AsnRS, AspRS, GlyRS, HisRS, LysRS, PheRS, ProRS, SerRS & ThrRS) tRNA synthetases . Two synthetases PheRSs and LysRSs are exceptions to this rule. All known PheRSs belong to class-2, going by their structural characteristics, but transfer amino acids onto the 2'-OH group [16, 17]. The lysyl-tRNA synthetases are found to belong both class-1 and class-2. Most of the LysRSs are found in class-2 in many eubacteria and all eukaryotes, while class-1 LysRSs are mainly found in archea and some eubacteria [18, 19]. This class-determination is very important for the functional annotation of tRNA synthetases. At the present time, genome sequencing projects continuously produce huge amounts of sequence data, but the function of several proteins is still unclear. Some aaRSs have been recognized as validated drug targets and one representing a potential drug target has been identified from each of the essential twenty aaRSs . Earlier many methods have been developed for the prediction of DNA or RNA binding proteins [21–25] and further classification into mRNA, rRNA, tRNA & snRNA binding protein [21, 22]. In this paper, an attempt has been made to predict and classify aaRSs from the primary structure of protein. A novel hybrid approach based on SVM and PROSITE has been adopted in order to predict the aaRSs and further classify them into two classes. In this hybrid approach, most distinguishable domains were selected from the PROSITE database by using the ProfileScan method of InterProScan, and integrated in composition-based SVM models. The domains PS50862, PS00178, PS50860 and PS50861 were used for the prediction of aaRSs and one additional domain PS50889 was integrated with SVM for the classification of aaRSs into class-1 and class-2.
The performance of BLAST at E-value threshold 10.
Discrimination between aaRSs and Non-aaRSs
Discrimination between class-1 and class-2 aaRSs
We have developed two different types of prediction methods: (1) prediction of aaRSs, and (2) discrimination between class-1 and class-2 aaRSs. For each prediction, we applied five approaches: (i) domain-based approach, (ii) SVM modules using amino acid composition, (iii) SVM modules using dipeptide composition, (iv) Hybrid approach1 based on SVM using composition and PROSITE domains and (v) Hybrid approach2 based on SVM using dipeptide composition and PROSITE domains. We trained and tested all our models on a 30% non-redundant dataset of aaRSs, non-aaRSs, class-1 and class-2 aaRSs.
First we developed prediction tools for discriminating between aaRSs and non-aaRS. We used aaRSs and non-aaRSs as positive and negative instances respectively.
The weight matrix based signature profile gives evolutionary information of any protein family or group of protein sequences. It is possible to discriminate some protein families based on the distinguishable constant and variable regions . So we used PROSITE from the ProfileScan method of InterProScan . We used aaRSs and non-aaRSs protein sequences, each group represented by 117 sequences. We analysed all profiles and selected the four most distinguishable domains which were PS50862, PS00178, PS50860 and PS50861. The accession numbers of ProfileScan and the short names of these domains are AA_TRNA_LIGASE_II, AA_TRNA_LIGASE_I, AA_TRNA_LIGASE_II_ALA and AA_TRNA_LIGASE_II_GLYAB respectively. It has been observed that 62% of aaRSs contain at least one of these PROSITE domains; about 9.4% of non-aaRSs also contain one of the above-mentioned domains. This indicates that a domain based approach is not sufficient for discriminating amongst all aaRSs. Thus, there is a need to develop sophisticated techniques, in combination with domain-based approaches. An SVM-based classifier has been developed, using four dimensions of vector, one for each domain; this achieved 61.41% sensitivity, 90.65% specificity, 76.08% accuracy and 0.54 MCC. In this dataset ~38% tRNA synthetases lacked any distinguishable PROSITE domains.
In this hybrid approach1, we combined the 20 features of amino acid composition and 4 features of selected four domains of PROSITE and generated a vector of 24 dimensions. We developed SVM based model and achieved 80.36% sensitivity, 79.64% specificity, 79.93% accuracy and 0.60 MCC.
The performance of SVM modules developed for discriminating aaRSs and non-aaRSs.
Name of Approach
Amino acid Composition
We further developed tools for discriminating class-1 and class-2 aaRSs. We used class-1 and class-2 aaRSs as positive and negative instances respectively. We applied same approaches and techniques, which have been used for the prediction of aaRSs.
We analysed all protein sequences of class-1 and class-2 aaRSs by using ProfileScan. We selected five most distinguishable domains, which were PS50862, PS00178, PS50860, PS50861 and PS50889 (accession number of ProfileScan). Domains PS50862, PS00178, PS50860 and PS50861 have been already used for the prediction of aaRSs. One additional domain PS50889 is an S4 RNA-binding domain profile and commonly found in tyrosyl-tRNA synthetases (class-1). It was observed that two domains (PS00178, PS50889) in class-1 and three domains (PS50862, PS50860, PS50861) in class-2 are preferred. A SVM-based classifier was developed using 5 features, one for each domain, and achieved 100.00% sensitivity, 53.85% specificity, 74.64% accuracy and 0.56 MCC. In this case, class-1 was considered as positive and class-2 as negative instances. It is shown in the results that the method was able to predict class-1 with 100% accuracy but failed to predict class-2. Thus there is a need to develop a method which can discriminate two classes with reasonable accuracy.
In the hybrid approach1, a vector of 25 dimensions was created from five domains and the 20 features of amino acid composition. Finally, an SVM-based model has been developed using the above vector and this achieved 81.09% sensitivity, 83.08% specificity, 82.18% accuracy and 0.65 MCC.
The performance of SVM modules developed for discriminating class-1 and class-2 aaRSs.
Name of Approach
Amino acid composition
The performance of Hybrid approach2 based SVM model on independent dataset.
Prediction results of SVM-Prot by using 117 aaRSs of main dataset.
Predicted Result of SVM-Prot
Number of aaRSs
EC 6.1.-.-: Ligases - Forming Carbon-Oxygen Bonds
EC 2.7.-.-: Transferases - Transferring Phosphorus-Containing Groups
EC 3.6.-.-: Hydrolases - Acting on Acid Anhydrides
All lipid-binding proteins
EC 3.1.-.-: Hydrolases - Acting on Ester Bonds
EC 2.4.-.-: Transferases - Glycosyltransferases
EC 5.4.-.-: Isomerases - Intramolecular Transferases
Ligases - Forming Carbon-Oxygen Bonds
TC 3.A.5 Type II (general) secretory pathway (IISP) family
Tyrosine Kinase Receptors
In recent years, rapid advances in genomics and proteomic studies have yielded a tremendous amount of data. The functional annotation of all these sequences using experimental approaches is a very labour-intensive and time-consuming process. Therefore, computational approaches are required to fill the gap. The prediction based functional annotation of all protein classes is not possible. Thus, it is important to concentrate on a single class of functionally important proteins. The aaRSs enzymes constitute a major protein class and play a vital role in protein synthesis. The catalytic activities of these aaRSs affect the determination of the genetic code. For this reason, they are essential for protein synthesis and cell viability. The catalytic specificity of aaRSs is necessary for cell survival. Many natural compounds and antibiotics specifically target aaRSs, and inhibit the growth or survival of the target bacteria. The aaRSs have been already recognized as validated drug targets. In the present scenario, drug-resistance is continuously increasing for existing antibiotics and we need more novel antimicrobial agents directed against the novel targets. From all essential twenty aaRSs, each one represents a potential drug target. But these are still poorly exploited drug targets and only one aaRS inhibitor mupirocin is a marketed drug, which is specifically targeted against the isoleucyl-tRNA synthetase . The investigation of aaRS families by genomic and biochemical research has been suggested. In this respect, aaRSs constitute a promising platform for the development of novel-antibiotics, and these are predicted to have no cross-resistance to other classical antibiotics . This protein family has limited sequence homology, therefore we need more powerful computational tools than BLAST and other similarity based methods.
In order to assist biologists in assigning the function of unknown aaRS proteins, a systematic attempt has been made for predicting aaRS proteins and their classes. We obtained aaRSs protein sequences from the ENZYME database of ExPASy and created class-1 and class-2 specific datasets. The creation of a negative dataset (non-aaRSs) is as important as positive datasets (aaRSs) for developing any classification method. Thus, we manually extracted non-aaRSs from Swiss-Prot by using appropriate searching options. We selected four most distinguishable PS50862, PS00178, PS50860 and PS50861 domains from PROSITE, which can discriminate aaRSs from non-aaRSs. In the classification between class-1 and class-2 aaRSs, we selected one more PS50889 (S4 RNA-binding) domain. The S4 is a small globular domain consisting of 60-65 amino acid residues, found in class-1 (tyrosyl-tRNA synthetases) and many other RNA-related protein families but absent in class-2 aaRSs. Domain PS00178 contains the 'HIGH' signature, which is a part of the adenylate binding site of class-1 aaRSs. Class-2 aaRSs do not share a high degree of similarity; however, at least three conserved regions are present and PS50862 is a domain of these conserved regions. PS50860 and PS50861 are the domains of AlaRS and GlyRS respectively and both belong to class-2 aaRSs. This information of domains was used for discrimination between class-1 (PS00178 and PS50889) and class-2 (PS50862, PS50860 and PS50861) aaRSs. We have also found domain PS50886, which is a signature of tRNA binding domain. This is widely distributed among different tRNA synthetases (class-1 and class-2 both) and found in their association factors (such as p43, ARC1, and Trbp111 isolated from various species) ; this is because we have not used it in our hybrid approach. The main limitation is that ~38% aaRSs do not contain any of these distinguishable domains. We implemented SVM for the prediction and classification of aaRSs because machine learning-based approaches were essential for the development of prediction tools. We have found that composition of 18 dipeptides (CL, DR, FY, GM, GR, GS, IN, MV, ND, PL, QI, QR, RD, RF, ST, WF, YD, YV) were significantly different in aaRSs and non-aaRSs. The fraction of 14 dipeptides (AY, DP, DW, EN, EV, GH, IG, KM, PW, QW, SG, WD, YA, YV) is different in class-1 and class-2 aaRSs. The SVM module based on hybrid approach2 [Dipeptide composition and PROSITE] achieved higher accuracy in comparison to amino acid composition, dipeptide composition and hybrid approach1 [amino acid composition and PROSITE], both in the prediction of aaRSs and class-specific predictions. It showed that the aaRSs or class-specific aaRSs that failed to be predicted by dipeptide composition could be predicted by hybrid approaches2 containing additional information of PROSITE. These domains contain information about the constant regions inside the aaRSs protein sequences during their development and throughout evolution. Therefore, it is better to use the combined approach for functional analysis where there are unique domains. We implemented all approaches in our web-server "icaars" and by default, it uses a hybrid approach2. We anticipate that users will be willing to query many sequences at a time and our online server " icaars " will take time accordingly. We are in the process of developing stand-alone version to help annotation faster. We hope that an annotation of aaRSs enzyme will be helpful for the designing of new drug-targets and drug-discovery processes.
To conclude, the present work is an attempt to predict and classify aminoacyl tRNA synthetases. We analysed protein sequences of aaRSs (class-1 and class-2) and non-aaRSs and selected the distinguishable patterns. These were amino acid, dipeptide, hybrid approach1 and hybrid approach2. We used these features as a SVM input based machine learning. We were able to model an efficient classifier from hybrid approach2 based information. A server icaars has been developed based on the SVM modules obtained.
Number of protein sequences of aaRSs, class-1 aaRSs and class-2 aaRSs at different redundancy level by using CD-HIT software.
Firstly, we have used aaRSs as positive dataset and non-aaRSs as negative dataset for the development of the tools for prediction of aaRSs. Proteins sequences of both positive and negative datasets were divided into five parts. Each of these five sets consists of one-fifth of aaRSs and one-fifth of non-aaRSs. For training, testing and evaluating our methods, we have used a five-fold cross-validation technique. In this technique, the training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training . Secondly, we used aaRSs class-1 as positive dataset and aaRSs class-2 as negative dataset, repeated above mentioned five-fold cross-validation method for the tools development for discrimination between class-1 and class-2 aaRSs.
PROSITE is a database of protein families and domains. It is apparent, when studying protein sequences, that during evolution all protein families conserve some portion of protein sequences for efficient function/performance and/or stability of three-dimensional structure, which distinguishes its members from all other unrelated proteins . InterProScan (iprscan) is a Perl-based stand-alone tool that combines different proteins' signature-recognition methods into a single resource . PROSITE database is an integrated part of InterProScan. We applied 4.3 version of InterProScan tool for the PROSITE-based ProfileScan method for the all datasets of aaRSs, aaRSs class-1, aaRSs class-2 and Non-aaRSs. This is a weight matrix-based technique and useful for the detection of diverse protein sequences.
The aim of calculating the composition of proteins is to perform the variable length of protein sequences to fixed length feature vectors. This is important and crucial step because SVM machine learning techniques require fixed length patterns. The amino acid composition is the fraction of each amino acid in a protein sequence and provides vector of 20 dimensions. The dipeptide composition was used to encapsulate the global information about each protein sequence, which gives a fixed length pattern of 400 (20 × 20) dimensions of vector. Both amino acids and dipeptide composition was calculated, and used as input to classification between aaRSs and non-aaRSs as well as aaRS Class-1 and aaRSs class-2 by using machine learning of SVM.
In this study, a highly successful machine learning technique termed as a Support Vector Machine was used. SVM is based on the structural risk minimization principle of statistics learning theory and SVMs are a set of related supervised learning methods used for classification and regression . SVM allows us to choose a number of parameters and kernels (e.g. Linear, polynomial, radial basis function and sigmoidal) or any user-defined kernel. In this study, we implemented SVMlight Version 6.01 package  of SVM and learning was carried out by using three (linear, polynomial and radial basis function) kernels. SVM takes a set of feature vectors as input, along with their output, which is used for training of model. After training, learned model can be used for prediction of unknown examples . In this work, the SVM training has been carried out by the optimization of various kernel function parameters and the value of the regularization parameter C. Preliminary tests showed that the radial basis function (RBF) kernel gives better results than other kernels. Therefore, in this work, the RBF kernel was used for all the experiments. Total four methods, are used for the SVM-based machine learning and these methods were amino acid composition, dipeptide composition, hybrid approach1 [amino acid composition and PROSITE] and hybrid approach2 [dipeptide composition and PROSITE] based.
One of the common practices for predicting the function of a new protein is to perform a similarity search against a database of well-annotated proteins. In this study, we used BLAST for predicting tRNA synthetase proteins using 5-fold cross-validation where four sets of aaRS and non-aaRS proteins were used to create a BLAST database and aaRSs proteins of the corresponding test set were searched against this BLAST database. This process was repeated five times so the BLAST search was performed once for each tRNA synthetase protein. We have calculated performance of BLAST in term of accuracy (percentage coverage); it indicates the correctly predicted proteins from the BLAST search. The number of positive and negative sequence those have not any hit (target) considered as false negative and false positive respectively.
We have selected most significant dipeptides from all datasets by using WEKA 3.6.0 version . WEKA is a package of java programs for machine learning. We have used attribute evaluator for SVMAttributeEval (parameter -X 1 -Y 0 -Z 0 -P 1.0E-25 -T 1.0E-10 -C 1.0 -N 0) method with ranker (parameter -T -1.7976931348623157E308 -N -1). We have used these selected dipeptide composition with PROSITE domains for the hybrid approaches.
In the case of prediction between aaRSs and non-aaRSs -
TP is correctly predicted positive (aaRSs) proteins
TN is correctly predicted negative (non-aaRSs) proteins
FP is wrongly predicted positive (aaRSs) proteins
FN is wrongly predicted negative (non-aaRSs) proteins.
In the case of prediction between aaRSs class-1 and aaRSs class-2 -
TP is correctly predicted positive (aaRSs class-1) proteins
TN is correctly predicted negative (aaRSs class-2) proteins
FP is wrongly predicted positive (aaRSs class-1) proteins
FN is wrongly predicted negative (aaRSs class-2) proteins.
The performance of a method is an average of five sub sets, created by five-fold cross validation technique. For the evaluation of any prediction method MCC is considered to the most robust parameter . The MCC value of 1 corresponds to a perfect prediction, whereas 0 corresponds to a completely random prediction. The limitation of all above described parameters that they are threshold-dependent and they require proper optimization for the better performance. We manually optimized all these parameters and selected the one which gives best performance. All the measures described above have a common drawback that their performance depends on threshold selected. A known threshold independent parameter is Receiver Operating Curve (ROC). It is a plot between true positive proportion (TP/TP+FN) and false positive proportion (FP/FP+TN). We have used SPSS package to plot ROC.
We have developed a user-friendly web-server "icaars" for the prediction of aaRSs. This prediction method is freely available from URL http://www.imtech.res.in/raghava/icaars/. It is developed under Solaris envronment on SUN system, using CGI-PERL. This server predicts whether query protein sequence is aaRSs or non-aaRSs. If a protein sequence is predicted as aaRSs then it will further predict whether the protein sequence belongs to class-1 or class-2 aaRSs. All datasets used in this study are available from this server.
We are grateful to Dr. Purnananda Guptasarma for critically reading this manuscript. The authors are thankful to Council of Scientific and Industrial Research (CSIR), Govt. of India for financial support under project Open Source Drug Discovery (OSDD). This report has Institute of Microbial Technology (IMTECH) communication number 065/2009.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.