Sequence feature-based prediction of protein stability changes upon amino acid substitutions
© Wang et al. 2010
Published: 02 November 2010
Skip to main content
© Wang et al. 2010
Published: 02 November 2010
Protein destabilization is a common mechanism by which amino acid substitutions cause human diseases. Although several machine learning methods have been reported for predicting protein stability changes upon amino acid substitutions, the previous studies did not utilize relevant sequence features representing biological knowledge for classifier construction.
In this study, a new machine learning method has been developed for sequence feature-based prediction of protein stability changes upon amino acid substitutions. Support vector machines were trained with data from experimental studies on the free energy change of protein stability upon mutations. To construct accurate classifiers, twenty sequence features were examined for input vector encoding. It was shown that classifier performance varied significantly by using different sequence features. The most accurate classifier in this study was constructed using a combination of six sequence features. This classifier achieved an overall accuracy of 84.59% with 70.29% sensitivity and 90.98% specificity.
Relevant sequence features can be used to accurately predict protein stability changes upon amino acid substitutions. Predictive results at this level of accuracy may provide useful information to distinguish between deleterious and tolerant alterations in disease candidate genes. To make the classifier accessible to the genetics research community, we have developed a new web server, called MuStab (http://bioinfo.ggc.org/mustab/).
Amino acid substitutions can cause a series of changes to normal protein function, such as geometric constraint changes, physico-chemical effects, and disruption of salt bridges or hydrogen bonds . These changes may lead to protein destabilization or some abnormal biological functions. Previous studies suggest that each person may have 24,000 – 40,000 non-synonymous Single Nucleotide Polymorphisms (nsSNPs), and there are a total of 67,000 – 200,000 common nsSNPs in the human population . These nsSNPs give rise to amino acid substitutions in proteins. While most nsSNPs appear to be functionally neutral, the others affect protein function and may cause or influence diseases. Yue and Moult  investigated the effect of amino acid substitutions on protein stability, and estimated that approximately 25% of nsSNPs in the human population might be deleterious to protein function. Of the known disease-causing missense mutations, the vast majority (up to 80%) resulted in protein destabilization . However, it is not feasible to experimentally determine the effect of each human nsSNP on protein stability. Rather, computational methods are needed to provide fast and efficient tools for examining a large number of nsSNPs for potential disease-causing mutations.
Machine learning has been applied to sequence-based prediction of protein stability changes upon amino acid substitutions . The machine learning problem can be specified as follows: given the amino acid sequence of a protein and a single amino acid substitution, the task is to predict whether the substitution may alter protein stability. By using the available data from experimental studies, classifiers can be constructed for predicting either the free energy change (ΔΔ G) of protein stability upon mutations or the direction of the change (increased stability if ΔΔ G > 0, or decreased stability if ΔΔ G < 0). Nevertheless, for many biological applications, correctly predicting the direction of the stability change (a binary classification problem) is more relevant than estimating the magnitude of the free energy change (a regression problem) .
Capriotti et al.  reported an artificial neural network-based method for predicting the direction of protein stability changes upon point mutations. The predictor was trained with protein sequence alone. It was shown that the sequence-based system could be used to complement the available energy-based methods for improving protein design strategies. The same research group also developed support vector machine (SVM) models for sequence- based prediction of both the free energy change and the direction of the change upon mutations . These SVM models were used to develop the I-Mutant2.0 web server, which could predict protein stability increase or decrease at the overall accuracy of 77% (based on cross-validation). Interestingly, it was found that the sequence-based system was almost as accurate as the structure-based method (80% overall accuracy) on the same dataset . This observation was further confirmed by Cheng and coworkers, who trained SVMs for predicting protein stability changes from amino acid sequence and structural information . More recently, Huang and coworkers developed the iPTREE-STAB web server, which used decision trees with an adaptive boosting algorithm to discriminate stabilizing and destabilizing substitutions in protein sequences . Among all the existing methods, iPTREE-STAB achieved the best classifier performance in cross-validation tests (82.1% overall accuracy with 75.3% sensitivity and 84.5% specificity).
The above-mentioned studies suggest that protein stability changes can be predicted directly from primary sequence data with similar prediction accuracy as structure-based methods. The sequence-based approach is particularly appealing since structural information is still not available for most proteins. However, little domain-specific knowledge in terms of biological features was used for classifier construction in the previous studies . In the present study, we have examined twenty sequence features for classifier construction. Support vector machines (SVMs) have been trained with the feature-encoded data instances of protein stability changes upon amino acid substitutions. Our results indicate that accurate SVM models can be constructed by using relevant sequence features for input vector encoding. To make the classifier publicly available, we have developed a new web server, called MuStab (http://bioinfo.ggc.org/mustab/).
The dataset used in this study was derived from two previous studies , in which experimental data for the free energy changes of protein stability upon mutations were collected from the ProTherm database . To construct a robust classifier, data redundancy was removed and the dataset had less than 25% identity among the amino acid sequences. Each data instance in the dataset had the following attributes: amino acid sequence, wide- type amino acid identity and sequence position, mutant amino acid identity, pH value, and free energy change. If the free energy change was negative (protein destabilization), the instance was labelled as a negative example. Otherwise, the instance was labelled as a positive example. The dataset contained 464 positive instances and 1,016 negative instances.
Biochemical features: including molecular weight (feature M); side-chain pKa value (K); hydrophobicity index (H); polarity (P); and overall amino acid composition (Co). Each amino acid has a unique molecular weight (M), which is related to the volume of space that a residue occupies in protein structures. Side-chain pKa (K) is related to the ionization state of a residue, and thus plays a key role in pH-dependent protein stability. Hydrophobicity (H) is important for amino acid side chain packing and protein folding. Hydrophobic interactions make non-polar side chains to pack together inside proteins, and disruption of these interactions may cause protein destabilization. Polarity (P) is the dipole-dipole intermolecular interactions between the positively and negatively charged residues. The amino acid composition (Co) was previously shown to be related to the evolution and stability of small proteins .
Structural features: including the conformational parameters for alpha-helix (A), beta- sheet (B), and coil (C); average area buried on transfer from standard state to folded protein (Aa); and bulkiness (Bu). Protein secondary structures can be divided into alpha-helix, beta- sheet, and coil conformations. An amino acid often has a different tendency to form one of the three types of secondary structures. For instance, amino acids A, I, E, L and M tend to be in the alpha-helical conformation, whereas K, N and D are often found in beta-sheets. In this study, the conformational parameters reported by Deléage and Roux  were used for features A, B and C. Feature Aa is another structural parameter, which estimates a residue’s average area buried in the interior core of a globular protein . Bulkiness (Bu), the ratio of the side chain volume to the length of an amino acid, may affect the local structure of a protein .
Empirical features: the protein stability scale based on atom-atom potential (S1); the relative protein stability scale derived from mutation experiments (S2); and the side-chain contribution to protein stability (S3). Zhou et al.  derived two protein stability scales from atom-atom potential of mean force based on Distance scaled Finite Ideal-gas REference (DFIRE) state (S1) and a large database of mutations (S2). Takano and Yutani  calculated the transfer Gibbs energy of mutant proteins, and derived the amino acid scale for the side-chain contribution to protein stability (S3) based on data from protein denaturation experiments.
Other biological features: including the average flexibility index (F); the mobility of an amino acid on chromatography paper (Mc); the number of codons for an amino acid (No); refractivity (R); recognition factor (Rf); the relative mutability of an amino acid (Rm); and transmembrane tendency (Tt). The average flexibility index of an amino acid (F) was derived from structures of globular proteins . Feature Mc was derived from experimental data by Aboderin . Refractivity (R) refers to protein density and folding characteristics . Recognition factor (Rf) is the average of stabilization energy for an amino acid . The relative mutability (Rm) indicates the probability that a given amino acid can be changed to others during evolution. Feature Tt is the transmembrane tendency scale described by Zhao and London .
where and are two data vectors, and γ is a training parameter. A smaller γ value makes decision boundary smoother. The regularization factor C, another parameter for SVM training, controls the tradeoff between low training error and large margin.
The SVMlight software package (available at http://svmlight.joachims.org/) was used to construct the SVM classifiers in this study. Each training instance was a subsequence of w consecutive residues, where w was also called the window size. The amino acid substitution site was positioned in the middle of the subsequence, and the other (w – 1) neighbouring residues provided context information for the substitution site. The input vector was then obtained by encoding each residue with one or more biological features. The input vector also included the pH value at which the free energy change was measured experimentally. In this study, various values of w, γ and C parameters were examined to optimize SVM classifier performance.
where TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives; and FN is the number of false negatives. In addition to the commonly used performance measures (overall accuracy, sensitivity and specificity), the average of sensitivity and specificity or the so-called prediction strength  was also used for classifier comparison in this study. Matthews Correlation Coefficient (MCC) measures the correlation between predictions and the actual class labels. Nevertheless, for imbalanced datasets, different tradeoffs of sensitivity and specificity may give rise to different MCC values for a classifier.
We also used the Receiver Operating Characteristic (ROC) curves  for classifier evaluation and comparison. In this study, the ROC curve was generated by varying the output threshold of an SVM classifier and plotting the true positive rate (sensitivity) against the false positive rate (1 – specificity) for each threshold value. Since the ROC curve of an accurate classifier is close to the left-hand and top borders of the plot, the area under the curve (AUC) can be used as a reliable measure of classifier performance . The maximum value of AUC is 1, which indicates a perfect classifier. Weak classifiers and random guessing have AUC values close to 0.5.
Effect of window sizes on sequence-based prediction of protein stability changes.
To determine whether classifier performance was affected by the sequence context of the substitution site, SVMs were trained with data instances of various window sizes. As shown in Table 1, protein stability prediction was affected by window sizes. The classifier constructed without any context information (w = 1) gave 67.94% prediction strength (70.69% sensitivity and 65.20% specificity), MCC = 0.3349 and AUC = 0.7425. The prediction strength, MCC and AUC were improved when neighbouring residues of the substitution site were included for input encoding. The use of w = 11 gave the highest prediction strength (79.79%), MCC (0.5843) and AUC (0.8804), and classifier performance was not further improved by including more neighbouring residues (Table 1).
Predictive performance of classifiers constructed using single sequence features.
The results suggest that a variety of sequence features are relevant for predicting protein stability changes upon amino acid substitutions. Of the five biochemical features (H, K, M, P and Co), the hydrophobicity index (H) gave the best predictive performance at 74.70% prediction strength (71.62% sensitivity and 77.79% specificity), MCC = 0.4728 and AUC = 0.8237 (Table 2). Hydrophobicity is a key factor in amino acid side chain packing and protein folding. Hydrophobicity changes owing to amino acid substitutions may cause proteins not to fold into stable conformation, and thus result in protein destabilization.
Of the structural features (A, B, C, Aa and Bu), bulkiness (Bu) gave rise to the highest prediction strength at 80.28% with MCC = 0.5919 and AUC = 0.8777. In contrast, the classifier using the conformational parameter for coil (C) had the relatively low performance with 71.86% prediction strength, MCC = 0.4116 and AUC = 0.7847 (Table 2). The possible explanation is that since coils are often unstructured and flexible, amino acid substitutions in the coil region may not cause significant changes in protein structure and stability.
The empirical features (S1, S2 and S3) are protein stability scales based on experimental data. Interestingly, when used for SVM classifier construction, these features did not give significantly better performance than the other sequence features. While the use of the S3 feature (side-chain contribution to protein stability) resulted in the highest level of AUC (0.8835) with 79.67% prediction strength and MCC = 0.5922, the other two empirical features (S1 and S2) were much less accurate for predicting protein stability changes (Table 2). Thus, it is possible that the empirical features do not capture all the information about the determinants of protein stability.
Of the other biological features, transmembrane tendency (Tt) achieved the highest level of MCC (0.6035) with 78.86% prediction strength and AUC = 0.8704 (Table 2). The feature Mc (the mobility of an amino acid on chromatography paper) also gave rise to relatively high classifier performance (77.02% prediction strength, MCC = 0.5202 and AUC = 0.8417). Therefore, multiple features from each of the four feature classes achieved high performance for predicting protein stability changes upon amino acid substitutions. It might be possible that classifier performance could be further improved by combining several sequence features for input encoding.
Predictive performance of classifiers constructed by combining the best single features.
S3, Bu, Tt
S3, Bu, Tt, B
S3, Bu, Tt, B, Aa
S3, Bu, Tt, B, Aa, Mc
All 20 features
We then constructed SVM classifiers by combining some of the best single features for input encoding. Interestingly, none of these feature combinations gave rise to better classifier performance than the best single feature S3 (Table 3). For example, the classifier constructed using the best six single features (S3, Bu, Tt, B, Aa, and Mc) achieved only 77.54% prediction strength with MCC = 0.5993 and AUC = 0.8737.
Predictive performance of classifiers constructed using the optimal subsets of sequence features.
B, Co, S3
B, Co, H, S3
A, Aa, B, Co, P
A, Aa, B, Co, No, P
The results suggest that classifier performance can be enhanced by combining certain sequence features for input encoding. The optimal six-feature subset contains sequence features from different classes, especially biochemical features and structural features. Each of these features may not be an accurate scale of protein stability, but when combined, they can outperform the best empirical feature (S3) for predicting protein stability changes upon amino acid substitutions.
To make the accurate SVM classifier accessible to the biological research community, we have developed the MuStab web server (http://bioinfo.ggc.org/mustab/). Users can enter an amino acid sequence in FASTA format, and specify the position and the identity of the substituting residue. The system encodes the input sequence with the optimal feature subset, and then calls the svm_classify program of the SVMlight software package to classify the protein stability changes upon the amino acid substitution using the best SVM model developed in this study.
In this study, we have developed a machine learning method for predicting protein stability changes upon amino acid substitutions. The novelty of our method lies in the use of sequence features representing biological knowledge for input encoding. Twenty sequence features were examined for SVM classifier construction, and several of them were shown to be highly relevant for protein stability prediction. However, the SVM classifier constructed using all the twenty features did not show high predictive performance. We thus used a wrapper approach for feature selection, and identified the optimal subset of six sequence features for input encoding. The best classifier achieved the overall accuracy of 84.59% with 70.29% sensitivity and 90.98% specificity. This SVM classifier is compared favourably in performance with the previously published models for protein stability prediction. Since the previous studies did not utilize the biological knowledge for classifier construction, our method can be used to complement the existing methods to predict the consequences of amino acid alterations in disease candidate genes and may provide useful information for elucidating the molecular mechanisms of human genetic disorders. We have thus developed the MuStab web server (http://bioinfo.ggc.org/mustab/) to make our classifier accessible to the genetics research community.
This work is supported by the CSREES/USDA, under project number SC-1700355. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.