Predicting cancer-associated germline variations in proteins
© Martelli et al; licensee BioMed Central Ltd. 2012
Published: 18 June 2012
Skip to main content
© Martelli et al; licensee BioMed Central Ltd. 2012
Published: 18 June 2012
Various computational methods are presently available to classify whether a protein variation is disease-associated or not. However data derived from recent technological advancements make it feasible to extend the annotation of disease-associated variations in order to include specific phenotypes. Here we tackle the problem of distinguishing between genetic variations associated to cancer and variations associated to other genetic diseases.
We implement a new method based on Support Vector Machines that takes as input the protein variant and the protein function, as described by its associated Gene Ontology terms. Our approach succeeds in discriminating between germline variants that are likely to be cancer-associated from those that are related to other genetic disorders. The method performs with values of 90% accuracy and 0.61 Matthews correlation coefficient on a set comprising 6478 germline variations (16% are cancer-associated) in 592 proteins. The sensitivity and the specificity on the cancer class are 69% and 66%, respectively. Furthermore the method is capable of correctly excluding some 96% of 3392 somatic cancer-associated variations in 1983 proteins not included in the training/testing set.
Here we prove feasible that a large set of cancer associated germline protein variations can be successfully discriminated from those associated to other genetic disorders. This is a step further in the process of protein variant annotation. Scoring largely improves when protein function as encoded by Gene Ontology terms is considered, corroborating the role of protein function as a key feature for a correct annotation of its variations.
The problem of annotating variations in proteins is particularly urgent given the high frequency of detection of non-synonymous Single Nucleotide Variations (SNVs) in humans thanks to the recent technological advancements in nucleotide sequencing. This direct approach allows the identification of common and also rare disease-associated germline variants that may play a role in susceptibility to different genetic disorders. Indeed a better knowledge of all the genes endowed with inherited variations will help case-control variation screening of human genetic diseases . At present several available computational tools estimate with various scoring efficiencies whether a variation is or is not disease-associated, starting from the protein sequence and/or structure [2, 3]. Recently the performance of prediction methods of variation pathogenicity on missense variants was assessed and two methods, SNPs&GO  and MutPred  scored with accuracies of 82% and 81%, respectively . However the characterization of variations associated to specific phenotypes is still at its beginning. The vast majority of the methods  can classify the variations as disease associated or not with a likelihood of the prediction output without providing the type of associated pathogenicity. Alternatively, only few methods focus on variations that are known to be associated to specific disorders.
This is so particularly for cancer associated variations [7–9]. All the methods suited at predicting the cancer associated variations are based on the COSMIC dataset  containing both germline and somatic variations. The role of the individual somatic mutations in cancer pathogenesis and progression cannot be easily characterized and generally requires the application of computational filtering procedures. Karchin et al.  recently developed a method based both on features of the variation and of the protein at hand and on genomic information such as the conservation of genomic sequences among different species and the SNP density within exons as reported in HapMap . By this their main result is the discrimination among driver and passenger mutations . Our goal in this paper is different and complementary. Indeed we focus on germline variations and describe a newly implemented method that, taking as input the protein sequence, its function (as described with its associated GO-terms), and mutation type, well discriminates cancer-associated germline variations in proteins from those related to other genetic disorders. Our results support the notion that information on protein function helps in improving the performance of predictors suited at interpreting the phenotypic effects of protein variations.
The GO-scores are computed with a formula previously defined . For each protein this score is the sum of the log-odds associated to its GO annotations. The log-odd of each GO term is the logarithm of the ratio between its occurrence in “Cancer” class and in “Other diseases” class. We computed three different GO-scores according to the three different GO sub-ontologies (Molecular Function=F, Cellular Component=C, Biological Process=P). We always computed the GO-scores in cross-validation. This means that using the 10 protein subsets split by similarity (see Dataset section), we generated 30 different sets of GO-probabilities: 10 for each training set times 3 (the 3 GO-sub-ontologies). All the proteins in each test set were annotated by computing the 3 GO-scores of the protein terms using the GO-frequencies obtained in the corresponding training set.
the variation type represented by a 20-valued vector where the wild type position is set to –1, the mutant residue position is set to 1 and all the other elements are set to 0 (indicated as “mut” in the Tables);
the evolutionary information of the variation obtained by extracting the 4 columns that represent the wild-type and the mutant residues as reported by PSI-BLAST PSSM/PROFILE output generated using the -Q option (indicated as “E” in the Tables);
a sequence profile window of dimension x centred into the mutated residue; the profile is encoded with 20*x elements vector (indicated as “Wx” in the Tables);
the GO-scores encoded with 3 elements representing the log-odd scores of the three GO sub-ontologies (indicated as “GO” in the Tables).
To asses the performance of the tested methods we counted in cross-validation the number of true positives (TP), true negatives (TN), false positive (FP), and false negatives (FN) with respect to either one of the two classes. We then computed the following indices:
Dataset of variations adopted for training/testing the method
Number of proteins
Number of variations:
Cancer and other diseases
We adopted a cross-validation procedure to evaluate the predictors. We split the dataset into 10 cross-validation subsets. Sequences in one subset share <25% sequence identity with proteins in the complementary sets, according to an all-against-all BLAST search with E-value <0.001. For validating predictors, we also collected a second dataset of 3392 variations labelled “Somatic cancer” from UniProtKB present in 1983 proteins not included in the training set. This dataset does not include variants detected in cancer cell lines.
Prediction of the disease type by protein similarity
The unbalanced distribution of the available data set reported in Table 1 shows that the majority of the proteins has variations associated to a single disease type (in this binary view, familial cancer and non cancer). This can be explained assuming that the protein itself and/or its biological role can carry a significant amount of information for the task at hand. We then evaluated the role of functional information, a property of the whole protein, in the prediction of the disease type. To label protein function we took advantage of the Gene Ontology (GO) annotation present in UniProtKB. Labelling of protein variants in the lack of any other source of information was done as in the previous experiment (see previous section), assigning all the variants in the same sequence to the same disease type.
We then applied a 10-fold cross-validation to compute the GO-scores on the 10 different subsets including the proteins of our data set (as described in the Methods section). By this, each prediction based on the protein GO-terms was done without including the GO-terms of the protein to be classified (nor all the other GO-terms of the proteins in the same test subset).
Prediction of the disease type by protein function
Most discriminative GO annotations
Cellular Component ( C )
Mismatch repair complex
Beta-catenin destruction complex
Lateral plasma membrane
Molecular Function ( F )
Mismatched DNA binding
Guanine/thymine mispair binding
Mispaired DNA binding
Protein serine/threonine kinase inhibitor activity
Cyclin-dependent protein kinase regulator activity
Cyclin-dependent protein kinase inhibitor activity
Protein kinase regulator activity
Kinase regulator activity
Ras GTPase activator activity
Biological Process ( P )
Cellular nitrogen compound biosynthetic process
Carboxylic acid catabolic process
Organic acid catabolic process
Amine catabolic process
Regulation of microtubule cytoskeleton organization
Regulation of microtubule-based process
Cellular amino acid catabolic process.
Prediction of the disease type with a SVM-based method
We also predicted with the most accurate SVM of Table 5 a set of 3392 variations associated to “Somatic cancer” in 1983 proteins not included in the training/testing set. On this set the predictor correctly discharges 96% of the variations with a false positive rate of only 4% mispredicted cases. This indicates that our method can indeed quite accurately discriminate between cancer related germline variations and somatic ones.
Cross-validation performance of a SVM-based predictor in cascade with SNPs&GO
Overall our work aims at filling the gap between predictors classifying variations as disease-associated or not and association studies among genotypes and phenotypes . In this paper we focus on discriminating cancer germline from other variations associated to genetic diseases. Our results indicate that protein function, when integrated with the variation information in a SVM based method, is a key feature for a correct classification.
Furthermore, when the method is applied to cancer-somatic variations it predicts most of them as non associated to cancer germline variations. Our predictor can therefore be applied to prioritize germline variations in proteomes of cancer cells.
We thank the following grants: PRIN 2009 project 009WXT45Y (Italian Ministry for University and Research: MIUR), COST BMBS Action TD1101(European Union RTD Framework Programme), and PON project PON01_02249 (Italian Ministry for University and Research: MIUR).
This article has been published as part of BMC Genomics Volume 13 Supplement 4, 2012: SNP-SIG 2011: Identification and annotation of SNPs in the context of structure, function and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S4.