Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition
© Habib et al.. 2008
Published: 20 March 2008
Skip to main content
© Habib et al.. 2008
Published: 20 March 2008
Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.
We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%.
A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.
Subcellular localization is a key functional characteristic of proteins. Each protein has some elementary functions. To co-operate for such physiological function, proteins must be localized to the correct intra- or extracellular compartments in a soluble form or attached to a membrane . Although the subcellular location of a protein can be determined by conducting various locational determination experiments, it is time consuming and costly to acquire the knowledge solely based on experimental measures. The number of protein sequence entry is increasing rapidly; it is highly desirable to develop a theoretical method for fast and accurately predicting protein subcellular location. A number of automated systems have been developed to predict the subcellular localization of proteins. Most of these methods for prediction of subcellular localization of proteins are based on signal peptides , the location by sequence homology, and the correlation between the total amino acid compositions of proteins using artificial neural network (ANN) . Prediction schemes rely upon the identification of a key sorting signal (e.g. signal peptide, mitochondrial targeting signal or nuclear localization signal), the presence of which suggests a fairly unambiguous subcellular localization. von Heijne  and colleagues have developed subcellular localization predictors designed to identify either signal peptides or chloroplast transit peptides. According to Claros et al., 1997; Nakai and Kanehisa, 1992 , biological implication is the merit of such predictors because newly synthesized proteins are governed by an intrinsic signal sequence to their destination, whether they are to be translocated through a membrane into a particular organelle, or to become integrated into the membrane . The utility of such localization prediction is dependent upon the availability of accurate N-terminal sequence, which can be somewhat problematic if the method predicts the start codons correctly, but can lead to leader sequences being missing or partially included, thereby confusing the algorithms depending on them . Subcellular localization can often be assigned by searching for homologous sequences. Several localization predictors consider compartmentalizing proteins on the basis of amino acid sequence composition - correlating a typical amino acid composition with localization to a particular subcellular compartment or organelle . Unfortunately, his studies of the predictive power of amino acid compositional data for subcellular localization were restricted to small sets of only a few hundred proteins. Studies showed that classifying into 12 different groups according to their subcellular locations improves the prediction accuracy. On the basis of this classification a covariant discriminant algorithm was proposed  to predict subcellular location of a query protein. This method is also based on amino-acid composition and the results obtained through self-consistency, jacknife and independent dataset tests indicate the improved accuracy rate.
In this paper, an attempt has been made to improve the prediction accuracy of subcellular localization of proteins using support vector machine (SVM). Two feature vectors e.g. Amino acid compositions and amino acid pairs (di-peptide) composition is considered for the prediction.
Performance comparison of accuracies with various values of gamma (γ) parameter used in RBF kernel in SVM.
Performance comparison of total accuracies with various C value (Tradeoff between training error and margin) parameter used in RBF kernel in SVM.
Shows comparison of total accuracies with different d value (degree) parameter used in Polynomial kernel in SVM
Table shows comparison of TNf, TPf and TA with balanced and unbalanced dataset using Polynomial kernel in SVM with d = 2.
The SVM was provided with 20 dimensional feature vector based on amino acid compositions. RBF, Polynomial and linear kernel functions are used with the most optimal value of the parameters. The best results are achieved using RBF kernel. The value of gamma factor and regulatory parameter “C” was optimized to “0.1” and “100” respectively. The results obtained after five fold cross-validation gives the overall total accuracy of 71.8%.
Prediction results of Support Vector Machine (SVM) with radial basis function (RBF) kernel. Table shows comparison of true negative fraction (TNf), true positive fraction (TPf), and total accuracy (TA).
Subcellular locations and number of sequences in each location. Table shows unbalanced and balanced dataset.
Unbalanced Data (No of Sequences)
Balanced Data (No of Sequences)
Comparison of total accuracies of our method with other methods.
Generalization ability is important for learning algorithms because the main purpose of learning is to accurately predict unseen data. On the other hand, comprehensibility i.e. the transparency of learned knowledge and the ability to give explanation for reasoning process is also important for learning algorithm. Neural networks are the good examples of generalization ability while decision tree is with comprehensibility ability .
Support Vector Machines (SVMs), proposed by Vapnik and co-workers [10–12], are a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc. Support Vector training algorithms work with pseudocode, as well as principles of optimization, generalization and kernel theory. The construction of the support vector learning algorithm is between the “inner-product kernel” and a “support vector”. To design learning algorithm, a class of functions must be made, whose capacity can be computed. The goal of SVM is to construct a classifier that classifies the data instances in the testing data. Each instance in training data contains one class label and one feature vector. SV classifiers are based on the class of hyperplanes corresponding to decision functions. The support vectors are those data points that lie closest to the decision surface and the kernel functions are to construct optimal hyperplane, where margin of separation (i.e. closest data point) is maximized. The data points are the small subset made up of informative points. Support vector learning algorithms, which from a set of positively and negatively labelled training vectors, learn a classifier that can be used to classify the test samples. SVM learns the classifier by solving optimization problem i.e. trade-off between maximizing geometric margin and minimizing margin violations. Classifiers map the input samples into a high-dimensional feature space and seeking a hyperplane, which separates positive samples from the negative ones with the largest possible margin .
where sign f(x) gives prediction, w gives vectors which are a combination of weights α i and labels yi of the feature vectors x i .
is the kernel function .
where N is the size of the training set .
We have constructed twelve SVM modules to classify the proteins to particular localization. Each SVM modules is trained with all samples of one class as positive label and rest samples with negative label in 1-v-r SVMs (one-versus-rest). The goal is to construct a binary classifier or derive a decision function from the available samples. Input vector of 20 amino acid composition and 400 amino acid pair compositions is also carried out to compare the performance and the prediction accuracy.
Amino acid composition consists of only 20 components, representing the occurrence frequency of each of the 20 native amino acids in a given protein and corresponding to a 20-dimensional vector. In this study, we considered the amino acid compositions and amino acid pair composition to detect different sequence features. The feature vector has 20 dimensions for amino acid compositions and 400 for amino acid pair compositions.
Two datasets with 1741 and 9943 protein sequences are used for the evaluation. The datasets were generated from version 44.3 and 44.4 of SWISS-PROT . Since the neural networks are static pattern analyzers, the sequence datasets are segmented to provide much coarser representation. Sequences are grouped into 12 different cellular locations (see Table 6). Sequences with more than 40% homologs are removed using PSI-BLAST . For protein sequences with the same name but from different species, only one of them was included. Protein sequences with unknown amino acid ‘X’ are not considered as we have little information about it.
Java programming language is used to generate 420 feature matrix. Adding 20 amino acids at the front and 400 dipeptides at the back, a total to 420 vectors are formed. Different methods are coded using C language to create training and testing datasets for 10-fold cross validation. SVMlight is used to predict the subcellular localization of proteins. The software is freely downloadable from http://www.cs.cornell.edu/People/tj/svm_light/.
All programs are implemented on Linux based PC.
The authors thank Mississippi Functional Genomics Network (DHHS/NIH/NCRR grant # 2PORR016476-04) for providing the support.
This article has been published as part of BMC Genomics Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.