On the effect of protein conformation diversity in discriminating among neutral and disease related single amino acid substitutions
© Juritz et al; licensee BioMed Central Ltd. 2012
Published: 18 June 2012
Skip to main content
© Juritz et al; licensee BioMed Central Ltd. 2012
Published: 18 June 2012
Non-synonymous coding SNPs (nsSNPs) that are associated to disease can also be related with alterations in protein stability. Computational methods are available to predict the effect of single amino acid substitutions (SASs) on protein stability based on a single folded structure. However, the native state of a protein is not unique and it is better represented by the ensemble of its conformers in dynamic equilibrium. The maintenance of the ensemble is essential for protein function. In this work we investigated how protein conformational diversity can affect the discrimination of neutral and disease related SASs based on protein stability estimations. For this purpose, we used 119 proteins with 803 associated SASs, 60% of which are disease related. Each protein was associated with its corresponding set of available conformers as found in the Protein Conformational Database (PCDB). Our dataset contains proteins with different extensions of conformational diversity summing up a total number of 1023 conformers.
The existence of different conformers for a given protein introduces great variability in the estimation of the protein stability (ΔΔG) after a single amino acid substitution (SAS) as computed with FoldX. Indeed, in 35% of our protein set at least one SAS can be described as stabilizing, destabilizing or neutral when a cutoff value of ±2 kcal/mol is adopted for discriminating neutral from perturbing SASs. However, when the ΔΔG variability among conformers is taken into account, the correlation among the perturbation of protein stability and the corresponding disease or neutral phenotype increases as compared with the same analysis on single protein structures. At the conformer level, we also found that the different conformers correlate in a different way to the corresponding phenotype.
Our results suggest that the consideration of conformational diversity can improve the discrimination of neutral and disease related protein SASs based on the evaluation of the corresponding Gibbs free energy change.
Human single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation in humans. Less than 1% variations are associated with non-synonymous coding SNPs (nsSNPs). About 64,971 nsSNPs are presently listed as human polymorphisms and disease single amino acid substitutions, SASs, (http://www.uniprot.org/docs/humsavar) and approximately 40% of these SASs are disease related.
It has been documented that in proteins a single amino acid substitution (SAS), can produce the loss of function in different ways. Although the less frequently found , the most obvious mechanism at the disease origin is due to change of key residues participating directly in protein function. This is the case when residue substitution occurs at the active site or in binding-sites for substrate and/or allosteric regulators [2–4]. When the biological functional unit is a complex, SASs at the subunit interface may also hamper the activity [4, 5]. A second possible mechanism is related with the perturbation of protein stability. Residue substitution can indeed destabilize the native protein fold [1, 6]. Also stabilizing residue changes have been reported to be associated with diseases [7, 8]. Furthermore, related with protein stability alteration, the origin of pathogenesis was also related with anomalous post-translational modifications  and aggregation . The correlation among protein SASs and their involvement in human diseases has been proven to be moderate , suggesting that change in protein stability is not the only source of diseases.
Protein stability can be estimated measuring the variation of Gibbs free energy (ΔΔG) between the folded and unfolded state of the protein. Most of the experimental data reported in literature are contained in ProTherm , a thermodynamic database of proteins and their variation in different organisms. Alternatively, several computational methods have been developed to estimate stability changes caused by substitution of lateral side chains in proteins (ΔΔG=ΔGwild-ΔGmutated). Most of them rely on the analysis of the energetic and/or structural perturbation introduced by the variations in the protein native structure. Although computationally intensive, early methods used all atom models to estimate ΔΔG . Soon later, simplified potentials coupled with limited conformational searches [14, 15] and the use of different types of potentials, like those based on hydrophobic interactions , secondary structure , inter-residue contacts  and knowledge-based , allowed to study the effect of different mutations in proteins in a reasonable computational time. Recently machine learning based approaches have been implemented for the prediction of ΔΔG in proteins upon residue substitution taking as input either the protein structure or sequence (for a recent review see ). The discrimination among disease related and neutral SASs can be investigated by determining ΔΔG upon residue change in the protein. This analysis is based on the notion that most harmful SASs are related to protein stability perturbation above a certain threshold ΔΔG (±1 kcal/mol). Most methods suited to predict free energy changes have been recently benchmarked in relation to their ability to discriminate disease from neutral SASs based on the corresponding ΔΔG value and their performance has been proven to be rather poor .
In most methods, predictions of SASs effects are commonly estimated using a single structure of the corresponding protein [21–24]. This approach apparently underestimates the well established concept that the native state of a protein is better represented by an ensemble of conformers [25–27]. The conformational ensemble is a key concept to explain essential properties of proteins like function [28–30], enzyme and antibody promiscuity [31, 32], enzyme catalytic power , signal transduction , protein-protein recognition  and the origin of new functions . Conformers describing the native state of a protein exist in a dynamic equilibrium which changes in response to the presence of ligands such as substrate or allosteric modulators that shift the relative conformational population [37, 38]. From a practical perspective, conformational diversity could be described using experimentally available structures of the same protein obtained in alternative conditions. As these different structures for the same protein have been obtained under different conditions (for example presence of substrate, inhibitors or allosteric activators) they can be taken as snapshots of protein dynamics and then characterize putative conformers belonging to the native ensemble [39, 40]. In this way, the description of the native ensemble of the protein will be more or less complete depending on available experimental data. A way to describe the extension of conformational diversity could be the estimation of the maximum RMSD measured between the available conformers. The distribution of the structural diversity extension measured in this way in the protein space was recently studied . The analysis involves an all vs. all comparisons between structures of the same sequence deposited in PDB database . Conformer distribution exhibits a peak at 0.3 Å RMSD with a large skew that ends at about 24 Å RMSD.
In the present work we investigated how the presence and extension of conformational diversity affect the estimation of ΔΔG in proteins with neutral and disease related SASs. Like protein function relies on the existence and preservation of the dynamic ensemble of conformers, the study of the effect of a SAS in each conformer of a given protein could help to understand the loss of function. We used a set of 803 SASs (482 disease related and 323 neutral) in 119 proteins showing different extension of conformational diversity. These proteins were taken from the Protein Conformational Database (PCDB)  a redundant collection of protein structures linked with biological information. ΔΔG for each SAS in each conformer for a given protein was estimated using FoldX . We found that the ΔΔG estimated value for a SAS highly depends on the conformer used in the estimation. In 35% and 58% of the studied proteins we found that at least one SAS could be classified as neutral, stabilizing or destabilizing depending on a ΔΔG threshold value of ±2Kcal/mol and ±1 Kcal/mol respectively. We also found that the consideration of conformational diversity increases the performance of the prediction of disease related SASs based on ΔΔG analysis. Our results show that the different conformers correlate in different ways with the phenotype (disease or neutral) and that, in most cases, one conformer per protein correlates perfectly with the corresponding phenotype. Our results indicate that the use of conformational diversity may be important to understand the effects of neutral and disease related SASs on protein stability.
When each SAS is mapped into its corresponding protein, we found that in 35% of the cases there is at least one SAS that can be classified as neutral and stabilizing or neutral and destabilizing depending on the considered conformer. It is interesting to note that for this 35%, the average of the maximum RMSD between conformers is 3.78 Å compared to 2.38 Å for proteins without ambiguous predicted stability changes. The discriminative threshold of ΔΔG was set at ± 2 kcal/mol, as it was previously discussed by Worth and coworkers (Worth, Preissner, and Blundell 2011). However, other works have defined different thresholds to classify the stability changes of a SAS [47–49]. When we analyzed the data distribution for ΔΔG values using a threshold of ±1 kcal/mol, the number of proteins with at least one ambiguous prediction was even higher (58%) (these calculations can be done using the data and information included as additional file 1). Thus, the extension of conformational diversity measured by the maximum RMSD between conformers increases the variability of the stability prediction of SASs generating certain ambiguity in the prediction. It is noteworthy that this ambiguity is manifested in the same proportion respect to the total population of SASs in disease-associated SASs as in polymorphic SASs. That is, the proportion of neutral SASs with respect to the total number of SASs is similar to the proportion of neutral SASs with ambiguous prediction with respect to the total number of SASs with ambiguous predictions (almost 29%). From this we can conclude that the uncertainty in the evaluation of the thermodynamic effect of a SAS equally affects neutral and disease related SASs.
Summing up, all the observations indicate that there is a large spread of estimated ΔΔG values for the different conformers due to their structural differences. The spread and, eventually, the corresponding ambiguity in the prediction of ΔΔG, can blur the correlation among ΔΔG values and the SAS type (disease related or neutral) when only one protein structure is used, as it is routinely the case.
Scoring the capability of discriminating among disease related and neutral SASs on different set of conformers.
The elucidation of the effect that a single amino acid substitution (SAS) has in a specific phenotype is a central problem in different areas of research. Particularly, in protein computational biology it is very challenging to understand the mechanisms that lead to disease . Estimation of protein stability and its perturbation after single SAS have played a key role to predict the effects of SASs [1, 6, 11, 51]. Notwithstanding important advances in the area, the estimation of stability in proteins is still problematic and its correlation with disease or neutral variants highly depends on the dataset considered . Here, we improved the correlation between the perturbation of protein stability and the classification of SASs as disease related or neutral. This improvement was obtained by considering maximum values of ΔΔG computed taking into account conformational diversity. As the native state is not unique, the conservation of the biological activity of a protein relies on the different conformers of the ensemble. It has been found that the different conformers for a given protein constraints in different way the substitution pattern of proteins . Then it is expected that the different conformers should have different robustness to SASs. Although the correlation is still far from being perfect, our results suggest that the effect of each SAS should be studied in all the protein conformers in order to obtain a better understanding of protein function perturbation and the disease origin. Our results also suggest that conformational diversity could add value to new computational tools for predicting SAS effects and play a key role to obtain a deeper understanding of the relationship among protein structure and function.
A list of proteins with disease and polymorphic SASs was extracted from http://www.uniprot.org/docs/humsavar. This list was linked with PCDB (http://pcdb.unq.edu.ar) database . PCDB contains a collection of redundant protein structures obtained in different conditions (for example presence of ligands, change in oligomerization state, etc.). These structures can be considered as snapshots of protein dynamism, assumption validated by previous works that have proved the correspondence between structural deformations detected under different crystallographic conditions and conformational changes related to the flexibility of the native state [39, 40]. The description of the conformational ensemble of the protein native state is limited to the information deposited in PCDB and PDB databases. The maximum RMSD (RMSDmax) between alpha carbon coordinates of the different conformers, calculated with MAMMOTH , is taken as a measure of the conformational diversity of the protein. From this cross linking, a dataset containing 119 proteins with different number of conformers was defined (8.6 conformers in average per protein, with a minimum of 2 and a maximum of 73). The length of all the structures in this dataset covers more than 70% of the length of the corresponding protein sequence. The dataset contains 803 SASs with 482 disease related and 323 neutral.
The presence of cognate ligands in each structure was determined using the Procognate database . The cognate ligands deposited in this database are those involved in the biological function of the proteins. After this filtering we found that 35 proteins have bound/unbound states in our dataset.
For each SAS and for each protein and conformation, we estimated ΔΔG values using the program FoldX . FoldX uses an empirical potential calibrated to fit in vitro ΔΔG values. Area exposed to solvent was calculated with Naccess program (http://www.bioinf.manchester.ac.uk/naccess/).
A value of MCC=1 defines the best possible prediction, while a value of MCC=-1 indicates the worst possible prediction. A value of MCC=0 corresponds to predictions made by chance.
RC thanks the following grants: PRIN 2009 project 009WXT45Y (Italian Ministry for University and Research: MIUR), COST BMBS Action TD1101 (European Union RTD Framework Programme), and PON project PON01_02249 (Italian Ministry for University and Research: MIUR).
MSF thanks Erasmus Mundus Action 2 Lot 13a – EADIC II proyect, funded by the European Commission and coordinated by the University of Bologna for the research fellowship. GP and MSF thank the following grants: PIP CONICET (112-200801-02849) and UNQ (1004/11). EJ has a Type II fellowship from CONICET.
This article has been published as part of BMC Genomics Volume 13 Supplement 4, 2012: SNP-SIG 2011: Identification and annotation of SNPs in the context of structure, function and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S4.