Collective judgment predicts disease-associated single nucleotide variants

Background In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy. Results Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor. Conclusions Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at http://snps.biofold.org/meta-snp.


Introduction
The most common form of human genetic variation is single nucleotide polymorphisms (SNVs) [1]. Trivially, non-coding region SNVs are more common than coding. However, fewer, percentage-wise, non-coding variants have thus far been characterized as disease-causing than coding, non-synonymous SNVs (nsSNVs; e.g. HGMD [2]). This fact is likely the result of experimentally difficult and therefore limited exploration into the non-coding world. Whatever the reason, however, most of the existing computational tools study the effects of nsSNVs specifically [3][4][5][6].
Many human diseases are monogenic, i.e. caused by damage to a single gene [7]. Identifying SNVs causative of monogenic disease is fairly straightforward. These are always functionally disruptive and consistently present in the disease population, but less frequently in healthy controls [8]. Complex diseases, on the other hand, are generally caused by a combination of moderately deleterious mutations in different genes; often leading to a disruption of the broader functional networks involved. Any one of these SNVs is unlikely to be significantly visible in the overarching background of human variation [9,10].
In the last decade, several algorithms have been developed to predict disease-related and functionally deleterious variants [11][12][13][14][15][16][17][18][19][20][21]. Methods that aim to identify disease-associated nsSNVs (which cause single amino acid substitutions in the protein sequence) are a different set of tools from those that look for nsSNVs that disrupt protein molecular function [11,17,21]. The latter focus on a single protein, a hard enough task in itself, while the former need to identify the mutation effect on the phenotype of an entire organism. One of the biggest challenges facing the nsSNV-disease mapping methods is the collection of development/testing data sets; i.e. beyond monogenic disease mutations and coding variants found by genome-wide association studies (GWAS), the experimental identification of complex disease-associated mutations is very subjective. The majority of the current methods rely on the manually curated collections of disease-associated variants from OMIM [7], SwissVar [22], and, more recently, the dbSNP [23] clinical SNV collections. Once the data sets are collected, all methods use some combination of the affected protein sequence/ structure features and functional annotations to look for patterns indicative of disease involvement.
The overlap in development data sets and features of interest suggests that most methods should "pick up" similar patterns in the data. However, recent estimates [3] show that different tools vary significantly in the predictions they make, while each still attaining relatively high levels of accuracy. Method orthogonality, i.e. each one method getting a different set of variants right, may be one explanation to this phenomenon.
Given the relative lack of new annotated data sets and the high levels of accuracy already attained, moving the field forward has been difficult. Newly developed methods at best boast improved annotation speeds or incremental gains in performance, often at cost of limited applicability. In this study we aimed to take advantage of method orthogonality to complement each tool's predictions with those of the other tools. The meta-predictor that we developed (Meta-SNP) identifies disease-causing nsSNVs by coupling some of the leading methodologies in prediction of nsSNV-disease (PhD-SNP [16]) and nsSNV-function associations (PANTHER [18], SIFT [17], SNAP [11]).

Dataset and benchmarking
Training and testing machine learning approaches require appropriate representative set of reliably annotated data.
To develop a method for the detection of disease-associated nsSNVs we needed a large set of well-annotated disease-related (positive cases) and polymorphic (negative cases) variants. Although for Mendelian disease the annotation of disease-causing variations is reliable, the selection of polymorphic nsSNVs is still a problem. In this work, we consider as disease and polymorphic variants those annotated in SwissVar [22] as Disease and Polymorphism, respectively. Note that the SwissVar Polymorphisms may (and probably do) still carry undiscovered disease associations.
Our training set is composed of disease-related and polymorphism variants from the SwissVar database Octo  Table S1). The Consensus subset consists of the variations for which all four predictors returned identical predictions (46% of SV-2009 and 42% of NSV-2012). The Tie subset is the set of variants that were classified as disease-related and polymorphic by equally many predictors, two for each classification (14%/16%). The Majority is the subset of nsSNVs where three predictors agreed in the judgment and one disagreed (40%/42%).

Prediction methods
In this work we predicted the effect of nsSNVs using PANTHER, PhD-SNP, SIFT and SNAP. Note that PANTHER, SIFT, and SNAP annotate variants as disruptive of protein function or equivalent to wild-type, while PhD-SNP particularly recognizes disease-associated substitutions.
PhD-SNP is a Support Vector Machines (SVMs) based method trained to predict disease-associated nsSNVs using sequence information. The methods takes as input the information about the mutation, such as its sequence environment and profile at the mutated site, calculated by BLASTing [24] it against the UniRef90 database [25]. For each mutation, PhD-SNP returns an output score (ranged 0-1) that represents the probability of this nsSNV being associated with disease. The method considers 0.5 to be the threshold above which the nsSNVs are predicted to be disease-associated.
The PANTHER algorithm is based on a library of Hidden Markov Models (HMMs) obtained from the multiple sequence alignments of different protein families. PANTHER predicts the effect of nsSNVs in a two-step procedure. First, the affected protein is compared to all HMMs in the library to find the HMM of the query protein family. Then, this HMM is used to calculate the probability of the nsSNV disrupting the function of the affected protein. Note that when PANTHER is not able to map the affected protein to one of the families in its library, no output is returned.
SIFT uses evolutionary information to make predictions with regard to functional effects of nsSNVs. Our local installation of SIFT used the UniRef90 database for the necessary PSI-BLASTs. SIFT scores are normalized to range 0-1, where any score >0.05 represents a neutral substitution, while mutations scoring <0.05 are functionally deleterious.
SNAP is a neural network-based method that takes as input the biochemical features of the given substitution as well as predicted protein structural and functional features to differentiate neutral and non-neutral variants. The local installation of SNAP produces a raw score of -100 to 100, where all predictions >0 are non-neutral and <0 are neutral. Note the raw score is converted into a reliability index for all web-based predictions.

Implementation of the Meta-SNP algorithm
We trained Meta-SNP, a random forest-based binary classifier to discriminate between disease-related and polymorphic non-synonymous SNVs. Meta-SNP takes as input the output of the four predictors described above as an eight-element feature vector composed of two groups of four elements each. The first group is the set of raw output scores of the variant predictions from PANTHER, PhD-SNP, SIFT and SNAP. In case one of the input methods does not return a prediction, we used the method-defined default threshold for differentiating neutrals and non-neutrals as input to Meta-SNP (SNAP = 0, SIFT = 0.05, PhD-SNP = 0.5, PANTHER = 0.5).
The second group contains four elements extracted from the PhD-SNP protein sequence profile: (1 and 2) frequencies of the wild-type (F wt ) and mutant (F mut ) residues in the mutated site, (3) the total number of sequences aligned at the mutated site (N al ) and (4) the conservation index (CI) [26]. Sequence profile information modulates Meta-SNP predictions by the conservation of the mutated position. This information is redundant across the four component methods, so for Meta-SNP we used only one version of the sequence profile -that from PhD-SNP.
Meta-SNP is a 100-tree RandomForest WEKA [27] library implementation, trained on SV-2009 using 20-fold cross-validation. The predictor outputs the probability that a given nsSNV is disease-related, where scores >0.5 indicate that the given the variant is disease-causing.

Measures of performance
In all measures of performance (assuming that positives indicate disease and negatives indicate polymorphisms), TP (true positives) are correctly predicted disease-associated variants, TN (true negatives) are correctly predicted polymorphisms, FP (false positives) polymorphic variants annotated as disease-causing, and FN (false negatives) are disease-associated variants predicted to be polymorphic.
Predictor performance was evaluated using the following metrics: positive and negative predicted values (respectively PPV and NPV), true positive and negative rates (respectively TPR and TNR), and overall accuracy (Q 2 ; Eqn. 1) We also computed the Matthew's correlation coefficient MCC (Eqn. 2) as: For each prediction, the binary classification (disease/ polymorphism) is made at the output threshold of 0.5. Thus, if probability of disease classification, P(D), is >0.5 the mutation is predicted to be disease associated. If P(D) ≤0.5, the variant is predicted to be polymorphic. A reliability index (RI) for all predictions is calculated as follows: Thus, RI is ranged 0-10 for both negative (polymorphic) and positive (disease-associated predictions). Varying RI threshold for annotating variants allows trading off accuracy of predictions for the coverage of all disease-associated and polymorphic variants in any given set.
We also report the area under the receiver operating characteristic (ROC) curve (AUC), calculated by plotting the True Positive Rate (positive sensitivity) as a function of the False Positive Rate (1-negative sensitivity) at different probability thresholds of annotating a variant as disease-associated or polymorphic. All the same metrics (Eqn. 1 and 2) were used to calculate the pairwise similarities between predictors on the subsets of variants predicted by both methods.

Performances of four available methods
First we tested the accuracy of four stand-alone methods, PANTHER, PhD-SNP, SIFT and SNAP, on a large dataset of nsSNVs (SV-2009; Tables 1 and Additional File 1  Table S2). For this set, PANTHER and PhD-SNP are most accurate, reaching~75% overall accuracy (Eqn. 1) and~0.83 AUC (SIFT 70%/0.73, SNAP 64%/0.79 Q2/ AUC, respectively). Note that due to the lack of the appropriate number of homologous sequences SIFT and PANTHER did not return any predictions in 8% and 26% of the cases, respectively. Also note that even though the PhD-SNP results are obtained with a 20-fold cross-validation procedure, its performance estimates may be biased as it was trained on the SV-2009 dataset. SNAP's lower Q 2 value at high AUC suggests many false positive predictions. This is an expected outcome, as not all functionally deleterious mutations are disease associated [8,28,29].

Scoring consensus predictions
We first analyzed the similarities between PANTHER, PhD-SNP, SIFT and SNAP by calculating percentage consensus predictions and the correlation between all possible pairs of methods (Table 2). These values were used to visualize the similarities between the methods with two Unweighted Pair Group Method with Arithmetic Mean (UPGMA) trees (Figure 1). PANTHER and PhD-SNP algorithms returned the highest number of common predictions (76%, MCC = 0.52). On the other hand, PhD-SNP and SNAP had only 64% of the predictions in common (correlation 0.36).
In addition, we evaluated the accuracy of the PhD-SNP on Consensus, Majority and Tie subset of predictions (see Datasets and Benchmarking). We expected a decrease in level of accuracy from the set of variants where all methods agree in their predictions (Consensus) to those where methods largely disagree (Tie). This hypothesis is confirmed (Table 3) with the overall accuracy and MCC of PhD-SNP decreasing for these two from 87% to 61% and 0.73 to 0.16, respectively. An intermediate level of accuracy is attained on the Majority subset, where most of the predictors agree (70% Q 2 and 0.37 MCC).
To understand the difference in the performance achieved on the three subsets (Consensus, Majority and Tie) we evaluated the residue conservation in the mutated positions using the protein sequence profile calculated by PhD-SNP BLAST run (see Prediction methods). We compared the distributions of the wild-type and mutant residue frequencies (respectively F wt and F mut ) in the mutated positions for disease-related and polymorphic nsSNVs. In addition, we analyzed the differences in the distributions of the conservation index (CI). Our results show ( Figure 3 and Table S3) increasing overlap between the distributions of F wt , F mut and CI for disease-related and polymorphic nsSNVs from the Consensus ( Figure 3D,E,F) to the Majority ( Figure 3G,H,I) to the Tie ( Figure 3J,K,L) subset. As expected, an intermediate (average) distribution difference is observed for the whole SV-2009 dataset ( Figure 3A,B,C).

Performances of the meta-predictor
To improve the detection of deleterious variants, we developed a meta-predictor (Meta-SNP) that combines the outputs of PANTHER, PhD-SNP, SIFT and SNAP. Meta-SNP uses single predictor outputs as in input; it was trained and tested on the SV-2009 dataset using a 20-fold cross-validation procedure. Meta-SNP reaches 79% overall accuracy, 0.59 MCC and 0.87 AUC (Table 4). While Meta-SNP outpredicts all other methods for all data sets, an accuracy decrease from the Consensus to the Majority to the Tie subset is still observed (87%, 75%, 69% Q 2 , respectively). The AUC for Meta-SNP is also higher than that of the single methods for all of the SV-2009 subsets (Figure 4).
Meta-SNP was additionally tested on NSV-2012, a disease/polymorphism balanced subset of nsSNVs added to SwissVar from October 2009 to February 2012 and belonging to proteins not found in SV-2009 (see Methods). The results on this dataset confirm that Meta-SNP performs better than PhD-SNP and all other predictors (see Figure 5A, Tables 4 and Additional File 1 Table S4).  Note Meta-SNP performance on the whole set of variants added to SwissVar 2009-2012 is similar in overall accuracy and AUC to that achieved on the disease/polymorphism balanced NSV-2012 dataset, albeit, as expected, at a lower PPV.
We also compared the performance of Meta-SNP to that of CONDEL [30], another recently developed metapredictor. Meta-SNP is 4% more accurate (Q 2 ) than CON-DEL and achieves higher AUC ( Table 4). As expected, the Meta-SNP prediction accuracy still drops between the Consensus, Majority and Tie subsets of NSV-2012 (Q 2 : 87% to 77% to 68%, AUC: 0.91 to 0.83 to 0.72, respectively, see Table S4 and Figure 5B). Finally, the Meta-SNP reliability index (see RI in Methods) helps selecting more accurate predictions ( Figure 5C); e.g., the NSV-2012 predictions with RI≥5 are on average~87% accurate, albeit at the cost to recall (only~65% of the variants reach this score). Similar trends are observed for the Consensus, Majority and Tie subsets (Additional File 1 Figure S1).

Discussion
The results presented in this work show that combining predictors of nsSNV effects into a single unique metapredictor (Meta-SNP) improves the detection of diseasecausing variants. The Meta-SNP algorithm performs slightly better (3% gain in accuracy, Q 2 ) than PhD-SNP, the best of the component methods for picking diseaseassociations. Although this improvement can not be considered very high, the advantage of Meta-SNP over a single predictor is three-fold: (1) the use of four  Table 2. The trees were drawn using the drawtree package [31]. orthogonal methods makes Meta-SNP more robust to handling new data sets, which may not follow the same distribution as sets used for method development, (2) Meta-SNP produces a single score, rather than four separate scores, for the prediction of disease-related nsSNVs and (3) Meta-SNP significantly outperforms all component methods in classifying the mutations, which are "border-line", i.e. ones that are very difficult to  Overlap between the predictions returned by PANTHER (blue), PhD-SNP (red), SIFT (grey) and SNAP (green), generated using Venny [32].
classify as disease-associated or polymorphic with current computational means. As our observations (in Figure 3 and Additional File 1 Table S3) suggest, the overlap between distributions of evolutionary features of disease and polymorphic variants, especially for the Tie data set, may indicate either (1) the lack of resolution in experimental data (i.e. polymorphisms may actually be disease causing mutations, which have not yet identified as such), (2) inaccuracies in building evolutionary profiles (i.e. simple PSI-BLAST searches may not be enough for all cases), or (3) our inability to differentiate variants contributing to complex disease phenotypes (i.e. when more than one variant is necessary for the disease phenotype to become visible). In all of these cases, however, the computational algorithms that strongly rely on a single evolutionary model are unable to differentiate disease variants from polymorphisms. While the calculation of accurate alignments and profiles is key to the performance of the predictive methods, we should also focus on improving resolution of our experimental annotations and available data collections. Additionally, understanding the contribution of multiple correlated nsSNVs in one or many proteins will enable discrimination between diseaseassociated and polymorphic variants in unconserved sites. In the mean time, combining many methods into a single model, Meta-SNP, provides a new and significantly more accurate way of assessing disease-association of human variants, most often mis-predicted by single sequencebased methods.

Conclusion
We developed a meta-predictor (Meta-SNP) that integrates the PANTHER, PhD-SNP, SIFT and SNAP methods to predict disease-associated nsSNVs. To quantify the increase in accuracy achieved by the combination of the different methods we compared the performance of our meta-predictor against that reached by the single methods. Using a balanced set of 35,766 nsSNVs, the meta-predictor attains~3% higher accuracy, 0.03 higher AUC and 0.06 higher MCC with respect to PhD-SNP, the highest scoring of all stand-alone predictors. Although this overall increase in performance is not high, the performance is significantly improved on the~58% of the dataset where the component predictors disagree (Majority and Tie subsets). For these subsets the meta-predictor achieved~6% higher overall accuracy and 0.12 higher MCC with respect to PhD-SNP. Meta-SNP is robust for new data as it reached similar levels of accuracy on a set of 972 new nsSNVs in proteins not included in the initial training dataset.

Abbreviations used
Single Nucleotide Polymorphism: SNP; single nucleotide variant: SNV; nsSNV: non-synonymous single nucleotide variant; Q 2 : overall accuracy; TPR and TNR: true positive and negative rates; PPV and NPV: positive and negative

Additional material
Additional file 1: Collective judgment predicts disease-associated single nucleotide variants. Table S1. Composition of the datasets. Table S2. Performance of the four methods on the SV-2009 subsets. Table S3. Comparison of the distribution of sequence profile features. Table S4. Performances of the four methods on the NSV-2012 subsets. Fig. S1. Performance Meta-SNP as a function of the RI.