Better prediction of functional effects for sequence variants

Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method's improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method's carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web Definitions used Delta, input feature that results from computing the difference feature scores for native amino acid and feature scores for variant amino acid; nsSNP, non-synoymous SNP; PMD, Protein Mutant Database; SNAP, Screening for non-acceptable polymorphisms; SNP, single nucleotide polymorphism; variant, any amino acid changing sequence variant.

: Performance on independent data sets 4. Table SOM_3: Performance estimates on ALL data set 5. Figure SOM_1: Accuracy-Coverage cruves on ALL data set 6. Figure SOM_2: Score distribution for SNAP2 on ALL data set

Short description of Supporting Online Material
This SOM contains a detailed description of features and their extraction for use in the neural network predictor. We also included three tables (1) listing the cluster representatives from the AAindex database, that were selected as helpful features in SNAP2 noali , (2) a performance comparison on independent protein-specific data, namely the HIV-1 protease and the Escherichia Coli LacI repressor and (3)

Material
Input feature calculation. In order to use amino acid and protein properties in neural networks these have to be presented as normalized numerical values. The following section describes the exact calculation or extraction of these values.
Delta features. Where applicable, we calculated delta features that describe the change in certain features between the native amino acid and its variant. All delta features are encoded by two nodes per residue: one for the "severity" (absolute difference between wildtype and mutant value) the other for the "direction" ('1' if positive and '0' if negative) of change.
Biophysical properties. In addition to mass, volume, charge, hydrophobicity and the presence of C-beta branching amino acids (as already present in SNAP) we collected one representative for each cluster of correlated amino acid indices from the AAindex database 1 . These indices are matrices containing values for each amino acid (or pair of amino acids) that cover a variety of amino acid properties and features derived from these (Table SOM_1). We extracted the corresponding (already normalized) value for each residue in the window, resulting in w input values. Then we calculated the two-node delta feature. The first node was the absolute difference between the wildtype and the mutant value.
Binding residues. We used ISIS 2 to predict the protein-protein binding sites and DISIS 3 to predict the protein-DNA binding sites. We extracted both the binary prediction (binding/non-binding) and the raw prediction score for each residue in the window (21 * 2 = 42 input nodes).
Disordered regions. We used the META-Disorder predictor tool (MD; 4 ) tool to calculate a three-node disorder feature for all residues in the window: We extracted the binary per-residue prediction (disordered/not-disordered) and the prediction reliability.
Proximity to N-and C-terminus. We calculated the proximity of the variant position to each terminus individually as the normalized number of residues between terminus and the position of interest (2*1 = 2 input nodes).
Contact potentials. We extracted normalized distance-dependent statistical potentials (for contacts within 5 Ångstrøms=0.5nm) 5 . For both native amino acid and variant, we extracted the potential as a 20-node feature. Additionally, we calculated the delta values for this feature (difference between native and variant) for their eight (four residues before and after) sequence neighbors (20*2 + 8*2 = 56 input nodes).
Co-evolving positions. We estimated the co-evolution of positions in a multiple sequence alignment following the approach from 6 . For each position in the multiple alignment we used the OMES 7 algorithm to calculate the correlation Appendix p. 3 with any other position. The OMES method compares the observed co-occurrence of amino acid X at position i and amino acid Y at position j to the expected cooccurrence at positions i and j. This pairwise comparison yielded a ranking of all positions based on their pairwise correlation to any other position. From these, we extracted a six-node feature indicating the rank and the score (i.e. the deviation from the expectation value) for the three positions most correlated with the mutation position (2*3 = 6 input nodes).
Residue annotation. In addition to SWISS-PROT annotations and SIFT predictions as already used in SNAP we considered residue annotation from Pfam 8 and PROSITE 9 to describe native and variant amino acids: (i) We determined whether the position was part of a PfamA domain. If so, we collected metrics of domain conservation and the posterior probability of native and variant belonging to that domain (4 input nodes). (ii) From PROSITE we extracted a binary single-node feature for all residues in the window indicating whether the specific residue is part of a PROSITE pattern (21 input nodes).
Low-complexity regions. We used the SEG 10 algorithm to mask protein regions with low-complexity. From this masking, we extracted a feature of 21 binary input nodes indicating whether a mutation is in or close to a low-complexity region.

RICJ880113
Relative preference values of amino acids at C2 17

SIMK990101
Distance-dependent statistical potential (contacts within 0-5 Angstroms) 5 * We listed the best-performing input features, i.e. amino acid indices that were selected by the feature selection procedure.
Other indices from the corresponding clusters performed similarly. For each of these features both window-based and delta features were included into the final sequence-only network SNAP2 noali .   Figure SOM_1: Accuracy-Coverage curves for ALL data. These figures show performance on the ALL data set. Our new method SNAP2 (dark blue) outperforms its predecessor (SNAP, light blue), and SIFT (green) for both the variants that do not affect function (neutral, a) and for those that affect function (b). The x-axes indicate coverage/recall (Eqn. 1,2), i.e. the percentage of observed neutral (a) and effect (b) variants that are correctly predicted at the given threshold. The y-axes indicate accuracy/precision (Eqn. 1,2), i.e. the percentage of neutral (a) and effect (b) variants among all variants predicted in either class at the given threshold. The dark line (SNAP2 noali ) marks the performance of a SNAP2 version that does not use any information from sequence alignment. All results are computed on the test sets not used in training. A pink line marks the performance of a random predictor.