The SAAP pipeline and database: tools to analyze the impact and predict the pathogenicity of mutations
© Al-Numair and Martin; licensee BioMed Central Ltd. 2013
Published: 28 May 2013
Skip to main content
© Al-Numair and Martin; licensee BioMed Central Ltd. 2013
Published: 28 May 2013
Understanding and predicting the effects of mutations on protein structure and phenotype is an increasingly important area. Genes for many genetically linked diseases are now routinely sequenced in the clinic. Previously we focused on understanding the structural effects of mutations, creating the SAAPdb resource.
We have updated SAAPdb to include 41% more SNPs and 36% more PDs.
Introducing a hydrophobic residue on the surface, or a hydrophilic residue in the core, no longer shows significant differences between SNPs and PDs. We have improved some of the analyses significantly enhancing the analysis of clashes and of mutations to-proline and from-glycine. A new web interface has been developed allowing users to analyze their own mutations. Finally we have developed a machine learning method which gives a cross-validated accuracy of 0.846, considerably out-performing well known methods including SIFT and PolyPhen2 which give accuracies between 0.690 and 0.785.
We have updated SAAPdb and improved its analyses, but with the increasing rate with which mutation data are generated, we have created a new analysis pipeline and web interface. Results of machine learning using the structural analysis results to predict pathogenicity considerably outperform other methods.
The explosion in the availability of mutation data, resulting from the application of SNP chips  and next-generation sequencing  has led to a huge demand to analyze and predict the effects of mutations. The genes for many genetically linked diseases are now routinely sequenced in the clinic.
While a mutation is defined as 'any change in the DNA', most work has focused on studying 'Single Nucleotide Variations' (SNVs). Broadly these can be classified into Single Nucleotide Polymorphisms (SNPs) and pathogenic deviations (PDs). SNPs which, if strictly defined, occur in at least 1% of a normal population, are estimated to occur once every 100-300 bases in the human genome , giving rise to subtle phenotypic variation without causing major deleterious phenotypic changes; PDs occur at much lower frequencies and are causative of disease.
In reality, SNVs form a spectrum from completely silent SNPs at one end, to 100% penetrance, Mendelianly inherited PDs at the other end. In between, SNVs show partial penetrance; that is, only a fraction of individuals having the mutation show altered phenotype and this can be influenced by the presence of other mutations and/or environmental factors.
To date, most effort has gone into understanding the effects of missense SNVs that lead to changes in protein sequence. We use the term 'Single Amino Acid Polymorphism' (SAAP) to refer to such amino acid changes whatever the frequency and resulting phenotype of the mutation. More than a dozen groups have devised methods to analyze the effects a given SAAP will have and in some cases attempt to predict whether the mutation will have a deleterious effect on phenotype [4–15]. However, the best known methods are SIFT  (an evolutionary method which calculates a sophisticated residue conservation score from multiple alignment) and PolyPhen-2  which uses machine learning on a set of eight sequence- and three structure-based features. A recent addition to the set of tools is Condel , a consensus predictor which makes use of SIFT, PolyPhen-2 and MutationAssessor . Condel significantly outperforms any of its component predictors. Until recently, rather than trying to predict whether a given SAAP will result in a deleterious phenotype, our focus has been on trying to understand the effects that mutations have on protein structure, comparing these effects in SNPs (that is non-pathogenic mutations) and PDs. Our approach has been to map SAAPs onto protein structure and to perform a rule-based analysis of the likely structural effects of these mutations in order to 'explain' the functional effect (if any) of the mutation. Since we map mutations to structure, we only consider mutations in proteins for which a structure has been solved. Data resulting from the analysis of SNPs and PDs have been collected into a relational database and made available over the web in the resource SAAPdb  (http://www.bioinf.org.uk/saap/db/).
In this paper we describe (i) an update of the data in SAAPdb, (ii) enhancements to methods used to analyze the structural impact of SNPs, (iii) a new web interface allowing the analysis of new mutations and (iv) results of the application of machine learning to predict the phenotypic effects of mutations based on our structural analyses.
Number of distinct mutations from different sources that have been mapped to protein sequence and included in SAAPdb
ADABase Adenosine DeAminase deficiency
Hamsters The Haemophilia A Mutation, Structure, Test
IARC-p53-Germline Tumor Protein 53 gene germline
mutation in familial cancers
IARC-p53-Somatic Tumor Protein 53 gene somatic mu-
tations in sporadic cancers
G6PD Glucose-6-Phosphate Dehydrogenase
OTC Ornithine TransCarbamylase (OTC)
SODdb SuperOxide Dismutase 1
ZAP70Base Zeta-chain-Associated Protein kinase 70
Kinbase Somatic protein kinase driver mutations
Kinbase Somatic protein kinase passenger mutations
LDLR Low Density Lipoprotein Receptor
PAHdb Human Phenylalanine Hydroxylase gene
STAT3 Signal Transducer and Activator of Transcription
Since we map mutations to protein structure and therefore require a structure to be solved of the protein of interest, we are not able to analyze all mutations. Of the amino acid mutations in OMIM, we are only able to map approximately 57% to structure, while only approximately 22% of 'valid' SNPs from dbSNP, which result in an amino acid change, map to structure. Consequently the coverage of our analysis is currently somewhat limited, but clinically relevant proteins tend to be key targets for structural studies, so we expect this figure to improve. Where multiple structures have been solved, we analyze the effects of the mutations in all available structures.
In summary, the number of SNPs in the database has risen by 41% and the number of PDs by 36% (including two new sources of mutation data). The comparison of structural effects between SNPs and PDs shows the same trends as in the previous analysis, but the 'surfacephobic' (introducing a hydrophobic residue onto the surface) and 'corephilic' (introducing a hydrophilic residue into the core) analyses no longer show significant differences between SNPs and PDs.
In SAAPdb, all assignments of structural effects are Boolean -- that is, any mutation either does, or does not, have a given effect. While Boolean assignment is appropriate in some cases (for example, a residue either is, or is not, annotated as a feature in UniProtKB/SwissProt), in other cases, it relies on some cutoff (for example, energy, void volume, hydrophobicity difference) as described previously [19, 21–23].
We found that assigning a mutation as (not) having a structural effect is very sensitive to precise structural details; where multiple structures are available for the same protein, one structure may indicate that a mutation has a value just below a cutoff while another structure has a value just above. Wherever appropriate, we have now implemented real-number scores or pseudo-energies for each effect. In particular, we have enhanced the analysis of clashes and torsion angles to provide energy values.
Overall, approximately 32% of mutations previously classified as not clashing are now found to clash while approximately 15% of mutations previously classified as clashing are now found to have only minor clashes which could be relieved by very slight movements in the structure.
Glycine and proline are the 'structural' amino acids which show an unusual Ramachandran distribution. Because glycine has no sidechain, it is able to access a wider range of phi/psi combinations while the cyclic sidechain of proline restricts the available phi angles. Consequently, backbone conformational changes may be necessary to accommodate mutations from-glycine or to-proline.
where 'obs' is the (smoothed) observed number of residues with a given phi/psi combination while 'exp' is the expected number, calculated as the total number of observations divided by the number of cells. A threshold energy was calculated for each plot based on 1% of observations in high quality non-redundant structures having a worse energy.
SAAPdb was designed as a regularly updated pre-calculated resource. However, it has proved very difficult to maintain and changes in licensing of OMIM data mean that we may no longer be able to use this as our primary source of PDs. In addition, with the increasing routine use of high-throughput sequencing methods to detect mutations, more and more people want to be able to analyze their own mutations.
Consequently we believe the value of SAAPdb has diminished and have now implemented SAAPdap (Single Amino Acid Polymorphism Data Analysis Pipeline). This is a complete rewrite of the mutation analysis software in SAAPdb using a plugin architecture and making use of the new non-Boolean analyses. While we still indicate whether a mutation is likely to have a detrimental effect on structure using cutoff values, we also provide continuous values for each of the analyses.
The data in SAAPdb (Figure 1) show clear differences in the sequence and structural characteristics of SNPs and PDs: PDs have additional, and more severe, structural effects. Thus there is a clear signal that can be used to predict the pathogenicity of a novel mutation.
Breakdown of the number of mutations in SAAPdb and their mapping to structure.
Number of Mutations
Mapped to UniProtKB/SwissProt
Mapped to PDB
Mapped to multiple PDBs
Mapped to multiple Chains
González-Pérez and López-Bigas , report that well known individual methods (SIFT, PolyPhen2, Logre , MAPP  and MutationAssessor) give accuracies between 0.690 and 0.771 evaluated on the HumVar dataset developed for PolyPhen2. Their consensus method (Condel) gives an accuracy of 0.882. While our preliminary value of 0.935 is considerably better, we are using a different dataset.
Consequently, for the final version of SAAPpred, we both trained and tested our method on the HumVar dataset (using 10-fold cross-validation). HumVar (2011/12) contains 22,196 deleterious mutations and 21,151 neutral mutations of which 7,182 and 1,540, respectively, can be mapped to structure. Consequently, to obtain a balanced dataset, only 3,080 mutations (equal numbers of deleterious and neutral) can be used. Ten runs were performed, each of which used all 1,540 neutral mutations with a random selection of 1,540 deleterious mutations from the total of 7,182. Results from the ten runs were then averaged. In each run, to avoid the 'structural overlap' between the training and testing data during cross-validation (which was present in the preliminary experiments with SAAPdb data), the mutations were split into 10 sets of approximately the same size. Each of these 10 sets in turn was chosen as a test set. The remaining 9 sets were used for training by randomly drawing balanced datasets of different sizes from the mutations as mapped to protein chains (see Table 2). This manual cross-validation ensures that there are no cases of the same mutation in the training and test sets but from different PDB chains.
Nonetheless, our results from training and testing on HumVar mutations that map to structure considerably outperform other well-known individual methods including SIFT and PolyPhen2 as reported by González-Pérez and López-Bigas  (Accuracies between 0.690 and 0.771). However these results are still not directly comparable with the other methods as those methods are evaluated on the complete HumVar dataset and it may be argued that the subset of mutations for which structures are available somehow outperform those for which structures are not available in these other methods. For example, PolyPhen2 makes limited use of structural data where these are available.
Consequently, we have evaluated the performance of PolyPhen2, SIFT and MutationAssessor using balanced datasets (1451 neutral mutations and ten random selections of 1451 deleterious mutations) used to assess the performance of our method. (Note that we could only use 1451 rather than 1540 mutations since the remaining 89 PDs failed in at least one of the other predictors.) In fact this gives a significant advantage to PolyPhen2 which is itself trained on HumVar leading to an overlap between the training and test set. It is not clear precisely what data are used to train SIFT; in their latest paper, Sim et al. . state that SIFT was originally trained and tested on LacI, Lysozyme and HIV protease, and refer to the original SIFT papers, but they do not state whether the training has since been modified. MutationAssessor does not appear to use a training set per se.
Performance of different prediction methods using a balanced dataset of mutations that map to structure extracted from HumVar.
While SAAPpred is clearly performing extremely well, we expect to be able to improve results further through feature selection (to help with the relatively small HumVar dataset size), feature combination (e.g. subtracting native void sizes from mutant void sizes) and feature normalization (e.g. taking the log of some feature values to improve the distribution of values). We also hope to develop methods to make more complete use of unbalanced datasets and intend to use our predictor as a component predictor of the consensus predictor Condel  which outperforms the other individual methods (Acc = 0.882).
In conclusion, we have updated the data in SAAPdb, improved the analyses and integrated these into the new SAAPdap pipeline and web interface. It is intended that SAAPdap will replace SAAPdb (which has proved difficult to update regularly and reliably). The submission page for SAAPdap is available at http://www.bioinf.org.uk/saap/dap/.
We are currently working on new analyses that examine sequence differences at the DNA and RNA level. In addition to changes to the protein structure, mutations can have effects on expression, RNA splicing and RNA folding and stability [28–30].
Results of machine learning using the structural parameters calculated in SAAPdap considerably out-perform any other individual predictor and approaching the performance of the combined predictor, Condel. Future work will further optimize the performance of this method using feature selection, feature combination and feature normalization as well as exploiting strategies to make more complete use of unbalanced datasets. We will integrate our predictor as a component of Condel and expect performance to outperform the current Condel method.
While the coverage of our method is currently somewhat limited by the need for a structure of the protein, we plan to investigate the use of modelled structures. However, we currently don't know how well this will work given the detailed structural analysis (e.g. of hydrogen bonds) that our method performs. However clinically relevant proteins tend to be key targets for structural studies, and as more structures become available, the number of mutants mapped to structure will increase, improving the coverage of our method. In addition, more structural data will allow us to train the machine learning methods with more data. Consequently, as shown in Figure 8, we expect performance to increase further.
SNP data were extracted from the XML format dump of dbSNP  obtained from the NCBI. Non-synonymous, 'valid' human SNPs (i.e. those annotated with validation strings 'by frequency', 'by 2hit 2allele', or 'by hapmap'), were extracted and combined into a single XML file. Any mutations not annotated as having disease involvement were assumed to be neutral. PDs were obtained from Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/omim/) and a number of locus-specific mutation databases ('LSMDBs') , see Table 1. All mutations were then mapped to protein sequences and thence to structure as described previously .
SAAPdb and SAAPdap perform fourteen analyses: Interface: (or PQS:) Residue is in an interface according to solvent accessibility criteria; Binding: Residue makes specific interactions with a different protein chain or ligand; SprotFT: Residue is annotated as functionally relevant by UniProtKB/SwissProt; Clash: Mutation introduces a steric clash with an existing residue; Void: Mutation introduces a destabilizing void in the protein core; Cis-Proline: Mutation from cis-proline, introducing an unfavorable omega torsion angle; Glycine: Mutation from glycine, introducing unfavorable torsion angles; Proline: Mutation to proline, introducing unfavorable torsion angles; HBond: Mutation disrupts a hydrogen bond; Corephilic: Introduction of a hydrophilic residue in the protein core; Surfacephobic: Introduction of a hydrophobic residue on the protein surface; Buriedcharge: Mutation causes an unsatisfied charge in the protein core; SSgeometry: Mutation disrupts a disulphide bond; Impact: Residue is significantly conserved. From these analyses (using software written in Perl and C) we derive 47 features that are used for machine learning. Random Forests (implemented in Weka ) were used for all predictions. Random Forests are ensemble classifiers that consist of multiple decision trees, each of which uses a random subset of the available features. The output of the predictor is the fraction of trees voting for the most popular class (in this case PD or SNP). Initial trials were performed using SAAPdb and HumVar data with 1000 trees and between 4 and 45 features per tree. 40 features performed best with SAAPdb while 4 features performed best with HumVar and these values were used for building the final machine learning models. Data are stored in a PostgreSQL relational database.
An additional predictor FATHMM (http://fathmm.biocompute.org.uk/) has become available since this work was completed (Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt, TR. (2013). Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat., 34:57-65). Evaluation of FATHMM on the same dataset shows a performance of ACC=0.836, MCC=0.671. While approaching our cross-validated performance, it is likely that some of the HumVar data were included in training FATHMM.
NSAN thanks the King Faisal Specialist Hospital and Research Centre and the Royal Embassy of Saudi Arabia Cultural Bureau (reference S12063) for funding. ACRM thanks Prof Giuliano Armano, DIEE, University of Cagliari for useful discussions on machine learning.
The publication costs for this article were funded by the above grant.
This article has been published as part of BMC Genomics Volume 14 Supplement 3, 2013: SNP-SIG 2012: Identification and annotation of SNPs in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S3
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.