Prediction of disease-related mutations affecting protein localization

Background Eukaryotic cells contain numerous compartments, which have different protein constituents. Proteins are typically directed to compartments by short peptide sequences that act as targeting signals. Translocation to the proper compartment allows a protein to form the necessary interactions with its partners and take part in biological networks such as signalling and metabolic pathways. If a protein is not transported to the correct intracellular compartment either the reaction performed or information carried by the protein does not reach the proper site, causing either inactivation of central reactions or misregulation of signalling cascades, or the mislocalized active protein has harmful effects by acting in the wrong place. Results Numerous methods have been developed to predict protein subcellular localization with quite high accuracy. We applied bioinformatics methods to investigate the effects of known disease-related mutations on protein targeting and localization by analyzing over 22,000 missense mutations in more than 1,500 proteins with two complementary prediction approaches. Several hundred putative localization affecting mutations were identified and investigated statistically. Conclusion Although alterations to localization signals are rare, these effects should be taken into account when analyzing the consequences of disease-related mutations.


Background
Eukaryotic cells contain numerous compartments, such as cytoplasm, mitochondria, Golgi apparatus, and peroxisomes, all of which contain different protein constituents and have different functions. Proteins are typically directed to these compartments by short peptide sequences that act as targeting signals. For example, secretory, chloroplast and mitochondrial targeting peptides are located at the N terminus, whereas signals for other compartments can be within the amino acid sequence. Terminal signal peptides are typically cleaved during the protein translocation process.
Protein function depends on numerous factors. One important but often neglected property is its subcellular localization. Translocation to the proper compartment allows a protein to form the necessary interactions with its partners and take part in biological networks. For example, signalling and metabolic pathways are dependent on the location of the constituent proteins. Failure to be transported to the correct intracellular compartment can have detrimental effects, which appear in different ways. Either the reaction performed or information carried by the protein does not reach the proper site, causing either inactivation of central reactions or misregulation of, eg, signalling cascades, or the mislocalized protein is active, but has harmful effects by acting in the wrong place. Subcellular localization of proteins and peptides has long been investigated using numerous methods. Recently, high-throughput methods have been developed based either on the use of reporter genes/tags or by purification, fractionation and analysis of cellular compartments [1,2]. Information on protein localization is scattered throughout publications and numerous databases. Fortunately, central resources such as the Human Protein Reference Database (HPRD) [3], UniProt [4] and Gene Ontologies [5] now exist to integrate information from several sources. A problem with these databases, however, is that data quality and experimental methods vary. Further, some databases contain experimentally validated localization information whereas others also contain localization predictions. The picture is further complicated by the fact that a protein can be localized in more than one compartment, often depending on the state of the cell. Thus, databases that contain only experimentally validated data may not provide complete information for all proteins.
Numerous methods have been developed to predict protein subcellular localization (for review, see eg, [6]). The very first methods in the 1970's were developed to identify microbial signal peptides [7,8]. Now, methods and protocols exist for the prediction of over 10 cellular compartments and subcompartments. Although the actual prediction algorithms and methods differ, all are based on sequence signature patterns. Some general predictors are useful for all subcompartments, but the majority of methods are specific for individual compartments and organisms or groups of organisms. The reliability of individual methods is relatively high, close to 90% (see, eg, [9-11]) Disease-causing mutations result in abnormal cellular function through numerous mechanisms. To date, pathological mechanisms have been revealed for only a fraction of all known mutations. Mutation information has been collected and stored in locus-specific (eg, [12,13]) and general (such as Online Mendelian Inheritance in Man (OMIM) and Human Gene Mutation Database (HGMD)) databases. Many experimental methods are tedious, expensive and difficult to use. Disease-causing mutations are identified for diagnostic purposes, and thus most medical centers identify a genetic mutation(s) without acquiring further information about the protein. We and others have applied numerous bioinformatic methods to predict and explain the consequences of mutations. Recently, we discussed the applicability of some 40 analysis and prediction methods [14,15]. The effects and consequences vary depending on the site and type of mutation, with insertions and deletions usually leading to truncated proteins. These cases are easy to explain if a sub-stantial part of the protein is missing. To understand protein structure and function, however, missense mutations are most interesting because they often indicate residues that are critical for, and changes that are deleterious to, structure and/or function. Most mutations reduce protein activity, but increasing numbers of gain-of-function mutations [16,17] are also being identified. Relatively few detailed investigations have described protein mislocalization due to disease-related mutations or introduced genetic alterations. In addition, all such publications report a limited number of mutations in a single protein.
Targeting signals tend to be conserved and thus sensitive to alterations; therefore, we can assume that these methods can be applied to the analysis of point mutations.
Here we use bioinformatics to investigate the effects of known disease-related mutations on protein targeting and localization by analyzing 22,416 missense mutations. Several hundred putative localization mutations were identified with two complementary multiprediction approaches. The results indicate that although alterations to localization signals are rare, localization predictors should be added to the methods arsenal of a mutations analyst. Our results also suggest pathological mechanisms for a number of mutations and depict cases for further experimental investigation.

Results and discussion
We investigated the effects of disease-related mutations on protein localization by performing large-scale analysis and prediction with two different but complementary methods. Because we needed unambiguous mapping of DNA mutations to protein sequences, we performed filtering steps. We obtained experimentally identified protein localizations from HPRD [3], which is considered a highly accurate, consistent and reliable source of protein annotations.

Reliability of the individual localization predictors
Before approaching the mutation effect predictions, we wanted to test the applicability of the methods to the dataset. Because HPRD contains experimentally verified data, we compared the localizations to predictions for the wild type proteins. The analysis was made for localizations for which SP and/or WoLF PSORT make predictions.
The compartments with the largest numbers of proteins are plasma membrane, cytoplasm, nucleus, extracellular space, and mitochondria (see Additional file 1). Endoplasmic reticulum (ER) had the highest number of proteins as a secondary classification (see Additional file 2). Proteins were distributed unequally among the different compartments. Although disease-related proteins form a special group, they still reflect the overall properties of all proteins. One reason for the poor behaviour of certain predictors is likely the fact that they are usually not used alone i.e. other programs are used to sort the data to localization routes before applying these tools. Overall, the methods obtained good precision at the cost of recall (false negatives). In summary, the individual methods can be applied with relatively high accuracy and precision to localization predictions. Methods, which predict the localization at the end of a complex pathway, are less reliable when applied directly to sequences.

Reliability of the combined localization predictors
The SP predicted localization for 12 possible compartments and WoLF PSORT (animal version) predicted ten localizations. The results for the two approaches and comparison to experimental data for the wild type proteins are shown in Tables 2 and 3. Several parameters were calculated to describe the prediction performance. For the SP, altogether 60.9% (966/1586) of predictions were correct. Seventy proteins received two predictions in TargetP and thus two routes in SignalP. In these cases both predictions are included. No mitochondrial periplasmic space proteins were predicted and the false negative rate is very high for these cases. The precision and recall of Golgi, transmembrane is low as well as the values for peroxisomal localization. Accuracy and precision are usually clearly better than the recall values, which is in line with the results for individual predictors ( Table 1). The results for gPM and mPM were combined to those for plasma membrane, since these localizations were predicted only for 3 and 6 proteins, respectively.
In the case of WoLF PSORT, 33.7% (1696/5095) gave correct predictions (Table 3). There are a number of dual predictions, eg for proteins, which shuttle between cytosol and nucleus. Results for these predictions were considered as correct only if the protein was found from both compartments. Values for accuracy ranged from 0.69 to 0.98 (average 0.854), whereas recall ranged from 0 to 0.84 (average 0.375). Peroxisomal proteins clearly had the lowest prediction accuracy. The results for WoLF PSORT do not allow a direct comparison with the SP, because WoLF PSORT considers combined predictions to be correct when one of the predictions is correct. Actually, just six classes had a substantial number of predicted proteins. The overall accuracy is almost identical for the two protocols whereas SP has clearly better precision and somewhat higher MCC score. The recall is slightly better for WoLF PSORT.
In conclusion, detailed analysis of the prediction performance indicates that the subcellular localization predictors still have much to improve. However, because the accuracy of individual predictions are rather high, these methods are indeed applicable to systematic analysis of mutations even though the precision, recall and MCC are clearly suboptimal. The more steps there are in the analysis the lower the expected accuracy (and other parameter values). Thus, if the analysis is based on five consecutive steps (as in SP) in which each step has 90% accuracy the final expected accuracy would be 59% (0.9 5 ).

Analysis of mutation effects
As the results above indicate, the subcellular localization of individual compartments of the investigated proteins can be predicted with rather high accuracy and also multipredictors provide useful data. The effect of mutations on protein localization was tested for all 22,416 missense mutations. In this analysis we looked for differences in predicted localization compared with that for wild type forms. Even if the prediction of the compartment was incorrect, a change in the predicted localization due to mutation might indicate the mutation mechanism and be a tp, the number of positive cases that were correctly predicted; tn, the number of negative cases correctly predicted; fp, the number of positive cases incorrectly predicted; fn, the number of negative cases incorrectly predicted. b average value.  Abbreviations for statistical parameters as in Table 1. a C, cytosol; CK, cytoskeleton; G, Golgi compartment; M, mitochondrial; N, nuclear; P, peroxisomal; PM, plasma membrane; S, secreted b total number. Underline sign indicates multiple predictions. c average value The numbers separated by the slash sign are for how many proteins the wild type localization has been correctly predicted, and the number of analyzed mutations, respectively. a C, cytosol; Gtm, Golgi, transmembrane; Mma, mitochondrial matrix; Mps, mitochondrial, periplasmic space; Mtm, mitochondrial transmembrane; N, nuclear; P, peroxisomal; PM, plasma membrane; S, secreted. Slash sign indicates alternative predicted localizations. The numbers separated by the slash sign are for how many proteins the wild type localization have been correctly predicted, and the number of analyzed mutations, respectively. a C, cytosol; CK, cytoskeleton; G, Golgi compartment; M, mitochondrial; N, nuclear; P, peroxisomal; PM, plasma membrane; S, secreted. Slash sign indicates alternative localization predictions and underline sign multiple predicted localizations.
Amino acid distribution for the two prediction schemes useful for further studies. Similar effect has been useful also in some other bioinformatics predictions such as protein secondary structures.
The SP predicted that 203 mutations would alter protein localization. Results in Table 4 and in Additional file 3 show the distribution of mutations in the different subcompartments for the mutations and proteins in which the mutations appear, respectively. The numbers represent correctly predicted proteins and the total number of mutations for each category. The most common original compartments for proteins whose localization changed on account of the mutation were plasma membrane, Golgi transmembrane, nucleus, and cytoplasm. Most common among the mutant sublocalizations were plasma membrane, Golgi transmembrane, and nucleus. The single most common predicted mutation type was from plasma membrane (for wild type localization) to Golgi transmembrane-altogether 47 cases, 17 of which had the correct prediction for the wild type form. Although the number of correct wild type predictions was not directly related to the mutation predictions, the numbers varied widely-as an extreme case C to N prediction, with 19 of 20 having the correct wild type prediction for 11 proteins out of 12. The range of changes to localizations of mutations varied from one to five, the highest being for Gtm proteins. Similarly, the predicted range of mislocalizations from one subcompartment to others varied from one to six, with cytoplasmic proteins being redirected to six different compartments when mutated.

Schematic illustration of the analysis of protein localization with the Scandinavian protocol
Results for the mutations and proteins analyzed by WoLF PSORT are shown in Table 5 and Additional file 4, respectively. To avoid excessive partitioning of the results to very small groups, only the results for the highest prediction score are indicated. About 50% of the wild type proteins had the correct localization. Altogether, WoLF PSORT found 183 cases with predicted alteration caused by mutation. The highest number of mutation-based rerouting to other compartments was for proteins whose wild type form was predicted to localize to the cytoplasm. Extracellular, cytoplasm, plasma membrane, nuclear and mitochondria are the most common localizations for mutant proteins. In comparison to SP, WoLF PSORT had somewhat lower numbers in target compartments. The changes with the largest number of mutations were CN to C, and N to C, which are related predictions. WoLF PSORT may suffer from using BLAST as part of its algorithm. In the case of SP, the search for homologues was not implemented, however, that was not possible to do for WoLF PSORT.
The results for the identified changes in protein localization due to missense mutations are shown in Additional file 5 and Additional file 6. The two prediction approaches, SP and WoLF PSORT, agreed on 17267 (77%) of the total 22,416 mutations when all predictions of WoLF PSORT were taken into account. Of the two approaches, 203 and 183 mutations were predicted to alter the target compartments of mutant proteins, affecting 105 and 92 proteins, respectively. 18 of these proteins were common for the two methods, and in these proteins the protocols agreed on 12 mutations to affect proteins localization. The two methods predicted the same compartment mislocalization in seven cases. This indicates that neither of the methods was able to detect all putative localization mutations. Similar result calling for use of several tools was apparent when splice site prediction tools were tested for mutation analysis [18].
We can estimate the number of expected mutations in localization sites. Our data set contains 1,516 proteins, which consist of 1,054,823 amino acid residues, and which have 2373 localizations based on HPRD. The length of the targeting peptides varies from a few residues to close to 30. If we use an average value of eight residues for the targeting peptide, we should see 403 (2373*8/ 1054823*22416) mutations in localization signals. This number is almost exactly what was observed.
The distribution of the amino acid changes in the predicted localization alterations is shown in Fig. 1. The amino acid distributions for mutations were compared with information for all human proteins taken from Codon Usage Tabulated from GenBank (CUTG) [19]. The distribution of all the mutations was significantly biased compared to random distribution in all amino acid types except for D and H (see Additional file 7). The results are in line with previous mutation distribution studies for numerous proteins and secondary structural elements within them [20][21][22], including mutations in the protein kinase family [23] and in immunodeficiencies [12]. These studies indicated highly skewed distribution for disease mutations, which varies also between secondary structural elements. Data for the SP indicated that mutations are most common in R ( Fig. 2 and Additional file 8). Arginine is coded by six synonymous codons, four of which contain a CpG dinucleotide, a well known mutational hot spot [24]. Also G, L and M are frequently mutated. The most common mutant residues were R, C, and P, of which arginine is the most common. Arginine was usually replaced by C (14 of 50 cases), making this the single most frequent mutation type. Eight of 9 mutations to W were from R, and 7 of 8 mutations in Q were from arginine. Arginine was mutated altogether to 10 other residues, i.e., all except two (K and M) of the possible substitutions with single nucleotide changes. Arginine was also the most common resulting residue from mutations in other codons, and it was the residue type with the highest number of original residues, 9. Somewhat surprisingly, no localization mutations were identified in Q, which however occurred 8 times as a mutant residue.
Of note, only two mutations to A and three mutations to F were predicted to be disease-related. H, Q and E were the least frequently mutated residues. These results follow somewhat the general amino acid distribution with prominent exceptions like arginine.
WoLF PSORT results show some differences from SP, which may have originated from the prediction algorithm. R, G, and L were the most commonly mutated residues. However, arginine did not show the clear overprediction as in the SP data. D, H and K were the least mutated residues (Fig. 2 and Additional file 9). Mutations to G and R both appeared in seven original residue types, whereas S was mutated from eight original residues. These were also the residues that had the highest number of mutant residue types. Only one change to H, two to N, D or I were predicted to be related to diseases.

Comparison to known mislocalization mutations
Our results predicted localization changes that underlie many different types of diseases, including those involving signal transduction, metabolism, immunodeficiencies, eye diseases, developmental disorders and cancers (see Additional file 5 and Additional file 6). Some diseaserelated mutations, which have been confirmed to affect protein localization, have been described. These cases are usually sporadic in the literature. Because no database is available for such mutations, we performed a literature search and identified a number of cases.
Mutations in SHOX, homeobox-containing gene, cause idiopathic short stature, Leri-Well dyschondrosteosis and Langer mesomelic dysplasia. The substitution R173C prevents the transport of the SHOX-encoded protein to the nucleus and its subsequent function as a transcription activator [25]. Both the SP and WoLF PSORT correctly predicted the mislocalization and the effect of the mutation.
AIRE, autoimmune regulator, is a nuclear protein and transcriptional regulator. Wild type AIRE appears both in nuclear dots, as evenly distributed in the nucleus, and in the cytoplasm. Several mutations have been shown to affect the distribution of AIRE between compartments [26][27][28]. Mutations R14L, T16M, A21V and Y85C were correctly predicted to affect protein localization by the SP and L28P and L29P by WoLF PSORT predictor. However, the predicted changes were not accurate, because the SP has a change from cytoplasmic and mitochondrial matrix to cytoplasm and WoLF PSORT from secreted to mitochondrial.
Similar results were obtained for BSND mutations. Barttin, encoded by BSND, is involved in Bartter syndrome, a renal tubular salt-wasting disease. Barttin localizes to the plasma membrane, whereas mutant forms are retained in the ER [29]. R8L was predicted by SP to change the localization from Golgi transmembrane to plasma membrane. A milder form, G10S, which appears in both the ER and the plasma membrane, was not predicted to affect localization. We also consider this kind of prediction useful because a localization change is forecast due to the mutation. Thus, the predictions can give a hint of the possible mechanism, even though the final validation must be obtained experimentally.
We did predictions for cases, which according to liture affect the localization in ATP7B mutations in Wilson disease [30], ABCA1 mutations in Scott syndrome [31], RPS19 mutations in Diamond-Blackfan anemia [32], ABCA1 mutations in Tangier disease [33], and laminin A/ C mutations in heritable dilated cardiomyopathy [34]. However, the predictions agreed with the experimental data only for the FXYD2 mutation in hereditary primary

Conclusion
Applicability of protein localization prediction methods were tested in detecting changes in localization due to point mutations. Altogether 374 mutations were predicted by at least one method to affect protein localization. Because disease mutations are unequally distributed throughout protein sequences, having a higher occurrence in structurally/functionally important sites, we can expect the number of localization mutations to be higher than calculated. The expected number is 403 mutations. Localization mutations are rare events, but they should be taken into account when predicting consequences of mutations. A service for SP predictions will be released in the near future as part of the Pathogenic-Or-Not -Pipeline (PON-P, http://bioinf.uta.fi/PON-P).  [12]. The dataset was filtered to include only genes for which cDNA sequence was available. The experimental localization(s) of each identified protein was collected from the Human Protein Reference Database (HPRD, http://www.hprd.org/) [3] (4.5.2007). We excluded proteins for which the experimental localization was unknown. After these filtering steps, 1,516 proteins remained, which contained altogether 22,416 missense mutations (on average 14.8 per protein). Altogether, we identified 2,373 localizations, indicating that the average per-protein localization was ~1.6 for the 1,516 proteins we investigated. The proteins had 34 primary localizations (Table 1) and altogether 56 localizations (Additional file 2).

Mutation, localization and sequence data
We identified both the first (most common) and all localizations for each protein. The wild type protein sequences were translated from cDNA sequences obtained from HGMD. The disease-related mutations were introduced into the protein sequences one by one and analyzed individually. Programs and scripts for the analysis were written in Java or Perl languages.

Localization prediction methods
First, predictions were made separately for certain localizations and then by two strategies for combined predictions. Groups in Stockholm, Sweden, and Lyngby, Denmark, whose long-term efforts have resulted in methods for numerous tasks in subcellular localization prediction, recently published a protocol to combine different predictions developed by them and others into a comprehensive prediction scheme [49]. This Scandinavian protocol (SP) is rather complicated and requires the use of numerous separate prediction tools. To facilitate the analysis, we developed a program that automatically runs all the predictions, parses the results, and provides the outcome of the prediction. The flow chart for the analysis steps and programs is in Fig 2. As a modification, nuclear localization signals were predicted only with the Pre-dictNLS program. Because we analyzed human proteins, it was not necessary to investigate chloroplast localization or prokaryotic predictions. Some of the programs use database searches to identify homologues to strengthen the predictions. We had to omit this step because the wild type sequences in the databases are identical to the mutant sequences, apart from the single missense mutation. Because sequence conservation is indicative of protein colocalization [50,51], database searches would have selected the wild type sequence for prediction and thereby hampered the analysis of mutants. Also the step for β-barrel prediction was omitted.  [65]http://au.expasy.org/ prosite/ were run over the Internet. Altogether, this procedure could predict 12 different localizations (Fig 1).
In the SP protocol, first the TargetP assigns whether the proteins go to mitochondia or secretory pathway or not. The mitochondrial proteins are classified further to transmembrane, periplasmic space or matrix based on the analysis of transmembrane and signal peptide sequences. Transmembrane proteins are predicted via two routes and are then classified to those ending in Golgi transmembrane or plasma membrane. Signal peptide(s) containing proteins are classified to different compartments whether they contain transmembrane region(s), signal peptide, are myristoylated, have GPI anchors or are predicted to endoplasmic reticulum.
The other method we applied, WoLF PSORT, is an integrated program that makes predictions for 10 subcellular compartments [66]. WoLF PSORT was run locally with default parameters. WoLF PSORT program was downloaded from http://wolfpsort.cbrc.jp/ and run locally.
We ran each prediction strategy for both wild type and mutated sequences and determined whether the mutation(s) changed the localization prediction. Both protocols may predict multiple localizations for a protein-for example nucleus and cytosol for a protein that is trans-ported between nucleus and cytosol. Thus, all highestscore predictions provided by the programs were taken into account. In SP, if TargetP had problems to resolve the localization for a protein predicted to mitochondria with poor reliability coefficient (RC) (value 4 or 5) then the protein was predicted also with SignalP and it gets two alternative localizations (Fig 1).
The quality of the predictions was measured by four parameters: accuracy, recall, precision, and the Matthew's correlation coefficient (MCC) as follows: where tp is the number of positive cases that were correctly predicted, tn is the number of negative cases correctly predicted, fp is the number of positive cases incorrectly predicted, and the fn is the number of negative cases incorrectly predicted.