Volume 14 Supplement 3
Assessment of computational methods for predicting the effects of missense mutations in human cancers
© Gnad et al.; licensee BioMed Central Ltd. 2013
Published: 28 May 2013
Recent advances in sequencing technologies have greatly increased the identification of mutations in cancer genomes. However, it remains a significant challenge to identify cancer-driving mutations, since most observed missense changes are neutral passenger mutations. Various computational methods have been developed to predict the effects of amino acid substitutions on protein function and classify mutations as deleterious or benign. These include approaches that rely on evolutionary conservation, structural constraints, or physicochemical attributes of amino acid substitutions. Here we review existing methods and further examine eight tools: SIFT, PolyPhen2, Condel, CHASM, mCluster, logRE, SNAP, and MutationAssessor, with respect to their coverage, accuracy, availability and dependence on other tools.
Single nucleotide polymorphisms with high minor allele frequencies were used as a negative (neutral) set for testing, and recurrent mutations from the COSMIC database as well as novel recurrent somatic mutations identified in very recent cancer studies were used as positive (non-neutral) sets. Conservation-based methods generally had moderately high accuracy in distinguishing neutral from deleterious mutations, whereas the performance of machine learning based predictors with comprehensive feature spaces varied between assessments using different positive sets. MutationAssessor consistently provided the highest accuracies. For certain combinations metapredictors slightly improved the performance of included individual methods, but did not outperform MutationAssessor as stand-alone tool.
Our independent assessment of existing tools reveals various performance disparities. Cancer-trained methods did not improve upon more general predictors. No method or combination of methods exceeds 81% accuracy, indicating there is still significant room for improvement for driver mutation prediction, and perhaps more sophisticated feature integration is needed to develop a more robust tool.
Cancer arises as a result of genetic and epigenetic alterations in the genome. While most DNA mutations are considered neutral passenger mutations, driver mutations can increase the fitness of a cancer cell allowing its clonal expansion. Identifying driver mutations is crucial to elucidating tumorigenesis and revealing novel therapeutic targets. Recent developments in next-generation sequencing technologies enable extensive identification of DNA mutations in cancer as well as normal genomes. Large-scale efforts such as the Cancer Genome Atlas  have uncovered tens of thousands of sequence variants. While the avalanche of sequence data has revealed the spectrum of genetic variations in cancer, the results are difficult to interpret, as the vast majority of mutations do not drive tumorigenesis. Non-synonymous changes (those that change protein sequences) are the most investigated group of genetic perturbations. These mutations vary greatly in their functional impact, depending on their position and function in the protein and nature of the replacement amino acid. Several computational methods have been developed to predict the effect of any missense mutation on protein function, using evolutionary sequence comparison, structural constraints, and physicochemical attributes of amino acids. More recently, machine learning methods aim to specifically predict cancer-driving deleterious mutations, based on a wider set of attributes and training with sets of likely cancer mutations. These mutations form a subset of deleterious mutations in that they are positively selected during tumor evolution, but are negatively selected during organismal evolution. Metapredictors that combine several methods have also been developed .
In this study, we introduce and compare the results of several general and cancer-focused methods, using both known and novel testing sets. We discuss their individual strengths and highlight associated challenges as well as future prospects. We also examine the availability, coverage and inter-dependence of various tools.
Materials and methods
We created a positive (non-neutral) test set of likely cancer driver mutations from the COSMIC database (v58) . From a total of 40,707 missense mutations, we picked 2,682 mutations (corresponding to 482 genes) found in at least two tumor samples as likely driver mutations (Additional file 1). Since COSMIC has been used to develop some of the methods reviewed, we also created a novel test set, from recurrent somatic mutations in colorectal carcinoma identified in a very recent study of the Cancer Genome Atlas Network . 455 somatic missense mutations were found in at least two tumor samples but not seen in COSMIC or dbSNP . A second novel set of 147 recurrent unique mutations found in breast [6, 7] or colon cancer  was similarly created.
Our negative (neutral) set of likely non-deleterious variants was built from germline SNPs found in dbSNP (Build Id 135). To avoid rare deleterious mutations and errors, we selected only SNPs with a minor allele frequency of at least 0.25, resulting in a set of 7,170 variants.
We obtained SIFT 4.0.4 from http://sift.jcvi.org and followed the default instructions to install and run. A Java based pipeline was implemented to manage input and output data. We obtained PolyPhen-2 from http://genetics.bwh.harvard.edu/pph2 and followed the standard instructions for installation. Condel scores for the combination of SIFT and PolyPhen-2 were calculated with a Perl program provided by Ensembl. We retrieved functional impact scores from MutationAssessor, using http://mutationassessor.org. LogRE scores were derived with a Java class to align wild-type and mutant protein sequences against Pfam protein domain models (version 25.0)  using HMMER 3.0 . The differences (wild-type versus mutant) of resulting E-values were used to calculate LogRE scores. SNAP was installed and applied in coordination with its developers from the Technische Universitaet Muenchen. mCluster scores were calculated as described . CHASM scores were derived with CRAVAT (http://www.cravat.us).
ROC curves and specificity/sensitivity estimation
Receiver operating characteristic (ROC) curves are composed of points that reflect the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) at varying threshold values. For each predictive method, the score range was divided into 1000 bins, for which the proportions of variants from the positive and the negative set above and below the given threshold were calculated. Variants that were not covered (scored) by a method were excluded from the evaluation of that particular method. To assure the same number of mutations in the positive and negative sets, for each tool assessment the size of the neutral set was adjusted to the resulting depth of the covered non-neutral set with a preference for variants with high minor allele frequencies.
To calculate specificity and sensitivity values for each tool, we used score cutoffs that yielded the highest accuracy as measured by the proportion of correctly classified variations to the total number of variants in the test set.
Si is the normalized score as calculated by the i-th tool, while Wi is the weight for the given classification. Weights were calculated on the basis of the proportions (probabilities) of tolerated (Pti) or deleterious (Pdi) variants with a normalized score higher than Si as observed for COSMIC mutations as positive set and dbSNP mutations as negative set.
Results and discussion
Overview of general tools for predicting the functional impact of amino acid changes
SIFT (Sorting Intolerant From Tolerant) [13, 14] is a widely used pioneering method for identifying deleterious mutations using only evolutionary information. Installation and usage are straightforward, and the method depends only on PSI-BLAST . SIFT identifies conserved protein residues based on multiple sequence alignment of homologous proteins, and calculates the probability for each of the 19 amino acid changes to be tolerated relative to the most frequent residue. Mutations of highly conserved protein positions tend to be predicted as deleterious, whereas changes in lower conserved protein regions are more likely to be neutral. Bi-directional SIFT (B-SIFT)  is a modification of SIFT that attempts to classify both gain- and loss-of-function mutations. By calculating SIFT scores for both the mutant and wild-type alleles, it identifies potential gain-of-function mutations where the mutant residue is more similar to those found in homologous proteins. As B-SIFT is exclusively based on SIFT, its implementation is also straightforward.
MutationAssessor  has a more elaborate conservation-based approach. It distinguishes between conservation patterns within aligned families (conservation score) and sub-families (specificity score) of homologs and so attempts to account for functional shifts between subfamilies of proteins. Specificity residues are defined by the clustering-based identification of homologous sequence subfamilies to determine functional specificity on the background of overall conservation. Interestingly, specificity residues were found to be predominantly located in binding interfaces on the protein surface implicating them in protein interaction .
In addition to conservation the feature space can be further increased by the inclusion of physiochemical characteristics. MAPP (Multivariate Analysis of Protein Polymorphism) [19, 20] and Align-GVGD , for example, combine both evolutionary conservation and physiochemical information. While most sequence-based tools are capable of predicting the functional consequence of any mutation in a protein with homologs in other species, some are restricted to the classification of a subset of amino acid alterations. For example, LogRE (Log R Pfam E-value)  predicts only on Pfam domains, by comparing the Pfam score of the wild type and mutant alleles.
Structure-based methods model the structure of a protein using a protein structure database, and then examine structural features such as solvent accessibility or crystallographic B-factor surrounding the substituted amino acid. Predictors based exclusively on structural information have been clearly outcompeted. Their coverage is relatively low due to the lack of available protein structures, and the isolated context of a crystal structure might not reflect the functional importance of certain residues in an interactive environment. For example, a multitude of solvent accessible residues such as posttranslational modification sites are fundamental for protein function, which is reflected in their conservation [23, 24], but not in their structural context. Combining sequence and structure information can increase prediction accuracy to a certain degree . PolyPhen-2  is the most prominent tool based on both sequence and structural information. It uses eight sequence-based and three structure-based features as input to a naive Bayes classification. Due to the diverse feature space, PolyPhen-2 is dependent on a variety of tools. For single amino acid substitutions it is therefore more straightforward to use the associated website (http://genetics.bwh.harvard.edu/pph2/).
To our knowledge the neural network-based tool SNAP (screening for non-acceptable polymorphisms) [27, 28] spans the most comprehensive feature space. SNAP incorporates evolutionary constraints, structural features and protein annotation information. The most important single feature for SNAP prediction is conservation in a family of related proteins as reflected by PSIC scores . As a result of the extensive feature space, SNAP depends on several other tools, which makes its installation complex. For a limited set of mutations it is possible to use SNAP's website.
These methods can give widely differing scores on the same variant, and have individual strengths and weaknesses. A combination of predictors may improve predictability. Condel (consensus deleteriousness score of missense mutations)  is a weighted average of the normalized scores from multiple methods. Implementing Condel is not complicated, but it involves the installation of various predictive methods and their supporting tools. Condel scores can be derived for a limited set of specified mutations via the corresponding web application, and the Ensembl database  provides position-specific Condel predictions that combine SIFT and Polyphen-2 for every possible amino acid substitution in all human proteins.
Overview of cancer-specific predictors
Cancer driver mutations are a subset of deleterious mutations that decrease the organism's evolutionary fitness, while increasing cellular proliferation, survival or metastasis. Cancer-specific mutation predictors mainly use frequency-based or machine learning techniques trained on recurrent cancer mutations that are likely to be drivers. A variety of statistical methods has been developed to determine increased mutation frequency. mCluster  aggregates mutation data by mapping known disease related mutations to positions along conserved domains, and then mapping novel variants to those same conserved domains. Conserved mutation-enriched domain regions reflect hotspots for cancer driving functional changes. The mCluster score expresses the probability of observing a cluster of certain size given the number of positions in the domain and the mutation frequency. As a consequence of the underlying methodology, only mutations that occur in protein domains can be scored.
CHASM (cancer-specific high-throughput annotation of somatic mutations)  is a major machine learning approach that uses a random forest approach and is trained on cancer mutations from COSMIC and other cancer-related resources. CHASM uses an extensive set of 49 predictive features ranging from exon conservation to UniProt annotation  and frequency of missense change type in COSMIC. Notably the latter feature was ranked as second most predictive feature. CHASM is available via the web application CRAVAT (http://www.cravat.us).
Analogously to Condel, CanPredict  uses a random forest classifier to combine results from different methods. It uses SIFT and LogRE to determine the functional impact of changes, and Gene Ontology Similarity Score (GOSS)  to estimate the resemblance between the given mutated gene and known cancer-causing genes.
Missense mutations from COSMIC and dbSNP used for testing
Notably, these datasets do not represent a true gold standard in which all variants are either functionally deleterious or neutral, and there is in any case no uniform definition of functionality. However, they provide a sufficient enrichment in both classes of variants to be effective for comparison of methods. In general it is not straightforward to generate an optimal set for benchmark analysis. In contrast to the assessment of protein structure predictors, where the experimental structure gives a clear answer, the biology of underlying sets of missense mutations is far more complicated. We performed a relatively intuitive approach by taking recurrent somatic mutations as positive set. The overrepresentation of mutations of some canonical cancer genes in the COSMIC set supports our selection. For example, TP53, PTEN and EGFR each have more than 100 mutations reported in COSMIC.
Prediction accuracy based on curated datasets
Prediction accuracies, sensitivities, specificities, AUC values and Matthew's correlation coefficients (MCC) compared between methods (based on COSMIC dataset)
In comparison, we found SIFT and PolyPhen-2 to have maximum accuracies of 76% and 77%, respectively. Saunders and Baker  showed that in general the additional inclusion of structural information (if available) contributes to a slight increase in performance. This might also play a role for the marginally increased performance of PolyPhen-2. The combination of Polyphen-2 and SIFT as reflected by the Condel score did not improve the accuracy significantly (78%).
Interestingly the accuracy of SNAP (68%) was lower than those of SIFT and PolyPhen-2, despite its more elaborate feature set. CHASM (89%) was the only tool that outperformed MutationAssessor in this assessment. CHASM predicted 99% of the negative set as non-drivers. However, recurrent COSMIC mutations were used to train the CHASM predictor, and several properties in CHASM's complex feature space are derived from COSMIC. For this reason, the CHASM performance in this test should be viewed with caution.
Excluding CHASM, the results of this assessment suggest that conservation based predictors, MutationAssessor in particular, achieve the highest accuracies in distinguishing neutral from deleterious mutations. However, none of these methods gives correct classifications of all mutations in the test sets. As an example for likely misclassification, MutationAssessor predicted the somatic G1007D mutation in phosphatidylinositol-4,5-biphosphate 3-kinase (PIK3CA), which was identified in haematopoietic, lymphoid and thyroid cancer, to be neutral, while all other methods defined the amino acid change to be deleterious. On the other hand, Bromberg and Rost showed that SNAP, which achieved relatively low sensitivity but high specificity in our assessment, outperformed competing approaches when using an independent dataset from four proteins (LacI repressor, bacteriophage T4 lysozyme, HIV-1 protease and human Melanocortin-4 receptor) . Difference in performance might reflect testing dataset bias, or that cancer mutations are inherently different from those enzyme mutations commonly used in various training and testing programs.
Prediction accuracy based on novel recurrent somatic mutations
Overall, we see a slight drop in prediction accuracy. This may be due to a drop in the severity of mutations in these new sets, since they exclude highly recurrent mutations seen in COSMIC. The most notable change is that CHASM accuracy dropped from 89% to 50%, as all mutations from the positive set were predicted to be neutral. The reason for this drop is not clear, but it has to be noted that mutations matching to COSMIC variants were ignored in this evaluation and these excluded mutations were the ones with the highest frequencies in the test sets. It should also be noted that the CHASM algorithm was developed to predict both tumor suppressor mutations as well as oncogene mutations. In our particular test, our choice of using recurrent mutation biased the data toward oncogenic driver mutations, which might contribute to the poor performance by CHASM. Furthermore, it is important to note that the relative frequency of missense changes in the COSMIC database is one of the 49 features used for CHASM prediction. Remarkably, this feature was shown to be the second most important feature for CHASM prediction. We purposely exclude any known COSMIC mutations in our independent test data, presumably causing the sharp performance drop by CHASM. It would be interesting to determine whether the CHASM performance might be more consistent across multiple test data sets if the COSMIC mutation frequency is excluded from the 49 feature collections.
Prediction accuracies, sensitivities, specificities, AUC values and Matthew's correlation coefficients (MCC) compared between methods (based on TCGA dataset)
Prediction accuracies, sensitivities, specificities, AUC values and Matthew's correlation coefficients (MCC) compared between methods (based on COBR dataset)
The observation that performances of individual methods can vary extremely between different test sets, is in concordance with findings from the Critical Assessment of Genome Interpretation (CAGI) project (http://genomeinterpretation.org) - an analogous approach to the critical assessment of techniques for protein structure prediction (CASP) .
Combining individual predictors
To determine if multiple methods can be combined into a unified classification, we implemented metapredictors on the basis of weighted average scores  (Materials and Methods). We used cumulative distributions of true and false positives from the COSMIC set as reference to estimate weights (Figure 6). To validate the consensus classification on a dataset different from the reference set, we used the two sets of novel mutations. For both test sets, the performances of Condel (combining Polyphen-2 and SIFT) and our metapredictor that combined PolyPhen-2 and SIFT predictions were almost identical (Additional file 5), even though underlying distributions for weight estimation and cutoff optimization were different.
We examined several combinations of predictors and found that unifying predictions from Polyphen-2 and MutationAssessor, SIFT and MutationAssessor, or Polyphen-2 and SIFT achieved better predictions compared to other combinations. However, none of the combinations improved significantly on the best included predictor, and no combination improved on MutationAssessor alone. This is in contrast to a previous report in which combining prediction results from LogRE, MAPP, Mutation Asssessor, PolyPhen-2 and SIFT was shown to outperform each individual method . The reason of this difference is not clear, but it is possible that only certain datasets are suitable for metaprediction approaches.
Our independent assessment of commonly available tools reveals challenges and inconsistencies of existing tools. Although the cancer-specific predictor CHASM performed particularly well using COSMIC mutations, we observed a dramatic drop in performance when using novel recurrent mutations not present in the COSMIC database. Other cancer-specific methods did not perform better than general tools for predicting the functional impact of amino acid changes. It is debatable what causes such performance difference. One major challenge is the generation of underlying datasets for training and testing. Using recurrent somatic changes as positive set seems to be an intuitive and reasonable approach. However, there is no experimental evidence for the potential to be driver mutations in cancer. It is clear that machine learning-based approaches are essentially affected by this problem and need further improvement to become generally applicable. In contrast, sequence conservation-based approaches seem to be less affected by different testing datasets. In fact, MutationAssessor provides consistently reasonable prediction results in this study. However, it is premature to declare any single predictor as the sole winner since we have identified many instances where an otherwise good predictor would completely miss obvious driver mutations. It is not obvious that metapredictors based on multiple approaches would produce the "silver bullet" cancer driver mutation predictor, therefore novel and more robust methodology development is still needed.
One idea for potential improvement is to train specialized predictors on different classes of putative driver mutations. Functional driver mutations can impact both tumor suppressors and oncogenes, and the characteristics of these mutations are epected to be different. While tumor suppressors are likely impacted by inactiving mutations, oncogenes can be impacted by a more complex pattern. Mutations that activate oncogenes may exert their effect by different mechanisms, such as utilizing residues that are evolutionarily more fit, inactiving a regulatory region to make a kinase constitutively active, or simulating the activated state of a protein. It is perhaps more practical to develop multiple specific algorithms for different classes of mutations, instead of develop a "one-size-fit-all" approach. With more validated, novel driver mutation data available, such robust and specialized prediction tools should be within reach.
We thank Rachel Karchin (Johns Hopkins University), Guy Yachdav (TUM, Munich), Yana Bromberg (Rutgers), Boris Reva (Memorial Sloan-Kettering Cancer Center), Abel Gonzalez-Perez (Research Programme on Biomedical Informatics Barcelona) and Nuria Lopez-Bigas (Research Programme on Biomedical Informatics Barcelona) for helpful discussions. Special thanks go to Slaton Lipscomb (Genentech) for our compute cluster and Peng Yue (Genentech) for support of mCluster.
The costs for this article were covered by Genentech Inc.
This article has been published as part of BMC Genomics Volume 14 Supplement 3, 2013: SNP-SIG 2012: Identification and annotation of SNPs in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S3
- Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455 (7216): 1061-1068. 10.1038/nature07385.Google Scholar
- Gonzalez-Perez A, Lopez-Bigas N: Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. American journal of human genetics. 2011, 88 (4): 440-449. 10.1016/j.ajhg.2011.03.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A et al: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research. 2011, 39 (Database): D945-950. 10.1093/nar/gkq929.PubMed CentralView ArticlePubMedGoogle Scholar
- Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012, 487 (7407): 330-337. 10.1038/nature11252.Google Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMed CentralView ArticlePubMedGoogle Scholar
- Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, Lawrence MS, Sivachenko AY, Sougnez C, Zou L et al: Sequence analysis of mutations and translocations across breast cancer subtypes. Nature. 2012, 486 (7403): 405-409. 10.1038/nature11154.PubMed CentralView ArticlePubMedGoogle Scholar
- Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, Wedge DC, Nik-Zainal S, Martin S, Varela I, Bignell GR et al: The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012, 486 (7403): 400-404.PubMed CentralPubMedGoogle Scholar
- Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, Conboy CB, Chaudhuri S, Guan Y, Janakiraman V, Jaiswal BS et al: Recurrent R-spondin fusions in colon cancer. Nature. 2012, 488 (7413): 660-664. 10.1038/nature11282.PubMed CentralView ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J et al: The Pfam protein families database. Nucleic acids research. 2012, 40 (Database): D290-301.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Accelerated Profile HMM Searches. PLoS computational biology. 2011, 7 (10): e1002195-10.1371/journal.pcbi.1002195.PubMed CentralView ArticlePubMedGoogle Scholar
- Yue P, Forrest WF, Kaminker JS, Lohr S, Zhang Z, Cavet G: Inferring the functional effects of mutation through clusters of mutations in homologous proteins. Human mutation. 2010, 31 (3): 264-271. 10.1002/humu.21194.View ArticlePubMedGoogle Scholar
- Vitkup D, Sander C, Church GM: The amino-acid mutational spectrum of human genetic disease. Genome biology. 2003, 4 (11): R72-10.1186/gb-2003-4-11-r72.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar P, Henikoff S, Ng PC: Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature protocols. 2009, 4 (7): 1073-1081.View ArticlePubMedGoogle Scholar
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research. 2003, 31 (13): 3812-3814. 10.1093/nar/gkg509.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee W, Zhang Y, Mukhyala K, Lazarus RA, Zhang Z: Bi-directional SIFT predicts a subset of activating mutations. PloS one. 2009, 4 (12): e8311-10.1371/journal.pone.0008311.PubMed CentralView ArticlePubMedGoogle Scholar
- Reva B, Antipin Y, Sander C: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic acids research. 2011, 39 (17): e118-10.1093/nar/gkr407.PubMed CentralView ArticlePubMedGoogle Scholar
- Reva B, Antipin Y, Sander C: Determinants of protein function revealed by combinatorial entropy optimization. Genome biology. 2007, 8 (11): R232-10.1186/gb-2007-8-11-r232.PubMed CentralView ArticlePubMedGoogle Scholar
- Binkley J, Karra K, Kirby A, Hosobuchi M, Stone EA, Sidow A: ProPhylER: a curated online resource for protein function and structure based on evolutionary constraint analyses. Genome research. 2010, 20 (1): 142-154. 10.1101/gr.097121.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Stone EA, Sidow A: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome research. 2005, 15 (7): 978-986. 10.1101/gr.3804205.PubMed CentralView ArticlePubMedGoogle Scholar
- Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, de Silva D, Zharkikh A, Thomas A: Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. Journal of medical genetics. 2006, 43 (4): 295-305.PubMed CentralView ArticlePubMedGoogle Scholar
- Clifford RJ, Edmonson MN, Nguyen C, Buetow KH: Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms. Bioinformatics. 2004, 20 (7): 1006-1014. 10.1093/bioinformatics/bth029.View ArticlePubMedGoogle Scholar
- Gnad F, Forner F, Zielinska DF, Birney E, Gunawardena J, Mann M: Evolutionary constraints of phosphorylation in eukaryotes, prokaryotes, and mitochondria. Molecular & cellular proteomics : MCP. 2010, 9 (12): 2642-2653. 10.1074/mcp.M110.001594.PubMed CentralView ArticlePubMedGoogle Scholar
- Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M: PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome biology. 2007, 8 (11): R250-10.1186/gb-2007-8-11-r250.PubMed CentralView ArticlePubMedGoogle Scholar
- Saunders CT, Baker D: Evaluation of structural and evolutionary contributions to deleterious mutation prediction. Journal of molecular biology. 2002, 322 (4): 891-901. 10.1016/S0022-2836(02)00813-6.View ArticlePubMedGoogle Scholar
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nature methods. 2010, 7 (4): 248-249. 10.1038/nmeth0410-248.PubMed CentralView ArticlePubMedGoogle Scholar
- Bromberg Y, Rost B: SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic acids research. 2007, 35 (11): 3823-3835. 10.1093/nar/gkm238.PubMed CentralView ArticlePubMedGoogle Scholar
- Bromberg Y, Yachdav G, Rost B: SNAP predicts effect of mutations on protein function. Bioinformatics. 2008, 24 (20): 2397-2398. 10.1093/bioinformatics/btn435.PubMed CentralView ArticlePubMedGoogle Scholar
- Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN: PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein engineering. 1999, 12 (5): 387-394. 10.1093/protein/12.5.387.View ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S et al: Ensembl 2012. Nucleic acids research. 2012, 40 (Database): D84-90.PubMed CentralView ArticlePubMedGoogle Scholar
- Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R: Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer research. 2009, 69 (16): 6660-6667. 10.1158/0008-5472.CAN-09-1133.PubMed CentralView ArticlePubMedGoogle Scholar
- Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research. 2012, 40 (Database): D71-75.Google Scholar
- Kaminker JS, Zhang Y, Watanabe C, Zhang Z: CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic acids research. 2007, 35 (Web Server): W595-598. 10.1093/nar/gkm405.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaminker JS, Zhang Y, Waugh A, Haverty PM, Peters B, Sebisanovic D, Stinson J, Forrest WF, Bazan JF, Seshagiri S et al: Distinguishing cancer-associated missense mutations from common polymorphisms. Cancer research. 2007, 67 (2): 465-473. 10.1158/0008-5472.CAN-06-1736.View ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Kryshtafovych A, Tramontano A: Critical assessment of methods of protein structure prediction (CASP)--round IX. Proteins. 2011, 1-5. 79 Suppl 10Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.