Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals

Fukasawa, Yoshinori; Leung, Ross KK; Tsui, Stephen KW; Horton, Paul

doi:10.1186/1471-2164-15-46

Research article
Open access
Published: 20 January 2014

Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals

Yoshinori Fukasawa^1,2,
Ross KK Leung³,
Stephen KW Tsui³ &
…
Paul Horton^1,4

BMC Genomics volume 15, Article number: 46 (2014) Cite this article

2805 Accesses
11 Citations
1 Altmetric
Metrics details

Abstract

Background

Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites.

Results

Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations.

Conclusion

We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail.

Background

Since proper subcellular localization is a prerequisite for protein function, there is a high demand for accurate and complete localization annotation of all proteins [1]. Although proteomics data has allowed large scale determination of protein localization for model organisms [2, 3], no experimental evidence is available for the vast majority of organisms. Although sequence similarity can be a good indicator of identical localization site [4], distant similarity is not [5], and thus for many proteins we must rely on computer prediction.

In cells, the localization of proteins is largely determined by “zip-code” like sorting signals, encoded in their amino acid sequence [6]. Unfortunately these sorting signals seem to be only very loosely determined, accepting very diverse sequences, subject to some constraints on their physico-chemical properties [7].

Among those signals, the most well-known sorting signal is the signal peptide of secretory path proteins. A typical signal peptide spans 15–30 amino acids near the N-terminus. Signal peptides typically show three distinct blocks: the n-region containing positively charged residues, the h-region mainly consisting of hydrophobic residues, and the c-region which includes polar uncharged residues and a weakly conserved cleavage motif [8].

Similarly, the targeting signals of mitochondria and chloroplasts are also N-terminally coded [7], and cleaved after import to their final location. In the mitochondria matrix, the N-terminal signal is usually cleaved off by the Mitochondrial Processing Peptidase MPP [9, 10], while the corresponding chloroplast targeting N-terminal signals are processed by an analogous protease in the chloroplast stroma [10]. Like signal peptides, these signals are often poorly conserved and difficult to align properly between orthologs [11]. Although some consensus motif has been reported for mitochondrial targeting signals [12, 13], it is information poor and produces too many false positives to be used for reliable prediction.

To date, an impressive number of methods have been developed for protein sorting prediction. For example, in 2004 a survey already listed dozens of methods employing fifteen broad categories of features [14]; from commonly used ones such as amino acid composition [15–19] (and many more) to rare categories such as sequence periodicity [20] and mRNA expression level [21]. Sequence similarity as defined by programs such as BLASTP has been explored as a feature for signal peptide detection [22]. Among these features, amino acid composition is attractive due to its simplicity. The significant correlation between amino acid composition and sub-cellular location is partially causative and partially due to indirect effects such as adaption of surface residues to the pH of the protein’s localization site [23].

The one feature conspicuously missing from this list has been evolutionary sequence conservation, despite the fact that it has seen extensive use in sequence analysis from the prediction of transcription factor binding sites [24], to short linear motifs in proteins [25] and functional RNA [26]. Although profile feature methods which indirectly reflect evolutionary conservation have been employed [27], sequence conservation per se has not – presumably because sorting signals are indeed not well conserved at the sequence level. Here, we propose that instead of looking for sequence conservation of sorting signals, a more effective approach is to exploit their high evolutionary sequence divergence.

In this paper we first describe our datasets of yeast, animal and plant proteins with their orthologs, divergence and other features we used for classification, and the classifiers we employed. Then, we present a simple statistical feature analysis followed by performance evaluation of localization prediction for various combinations of features, classifiers and datasets. Unfortunately, combining other features with our sequence divergence did not lead to a systematic improvement in overall performance. However we show that consideration of sequence divergence is critical for correct prediction in certain cases and can sometimes flag non-cleaved or misannotated targeting signals. Finally we discuss future directions and conclude.

Methods

Sorting signal classes

We mainly focused on the two most common N-terminal sorting signals: Signal Peptide s (SP), targeting proteins to the endoplasmic reticulum and Matrix Targeting Signal s (MTS) which target proteins to the matrix (inner compartment) of the mitochondria. In the plant dataset, we also consider Chloroplast Transit Peptide s (CTP). All of these signals reside near the N-terminus but in general have different properties and are effectively discriminated by the cell. In some cases however, the N-terminal “signal” can be ambiguous. In particular many examples are known in which the same amino acid sequence directs some copies of a protein to the mitochondria and others to the chloroplast [28, 29]. Nevertheless these examples still constitute only a small percentage of proteins and therefore we simplify the analysis by treating N-terminal sorting signal identification as a simple three- or four-way classification problem: {MTS, SP, (CTP), no signal}. Other types of N-terminal sorting signals exist, for example the PTS2 signal targeting proteins to the peroxisome [30], but the number of proteins using such signals is much smaller than those using the SP, MTS or CTP signals.

The sorting signal class labels we use in our datasets are partially based on direct experimental evidence. In the dataset of S.cerevisiae, we used UniProtKB/Swiss-Prot [31] to assign localization class labels, augmented by MTS containing proteins determined in the proteomics experiment of Vögtle et al. [32]. Because only a small number of SP’s have been directly confirmed experimentally, we also included proteins whose SP is inferred in the database and predicted positive by SignalP [33]. We used proteins annotated to localize to the cytosol or nucleus as proteins without N-terminal signals. To reduce bias in training and accuracy estimation, we used BLASTClust 2.2.22 [34] to remove redundant sequences with a setting of 20% identity. For proteins in human and a few plant species we adopted the dataset of Predotar [35] and for plants augmented that small number by experimental proteomics data determined in the mass spectrometry experiment of Huang et al. [11].

Dataset

Organisms used

We gathered protein sequences from 11 relatively diverse and well annotated representative species of the three phylogenetic divisions: yeast, mammal and plant respectively (Table 1). The 11 mammal species and most of the plant species are annotated reference proteomes in UniProt, but a few of the plant species are only included in UniProt as complete, but not fully annotated, proteomes. Note that our “plant” dataset contains the unicellular green algae Chlamydomonas reinhardtii, which is not a typical plant but is classified in the “viridiplantae” kingdom.

Table 1 List of species used to define orthologs in each phylogenetic category

Full size table

In each of the three divisions we designated one species as the “reference” species. We used information in proteins from the non-reference species only for computation of sequence divergence (via ortholog multiple sequence alignments). We chose S.cere., H. sapiens, and A. thaliana as the reference species for yeast, animals and plants respectively, because they have the most complete annotation. However for plants even A. thaliana has rather limited annotation of SPs, so in order to increase the plant dataset size we used other species as the reference species in some cases.

Ortholog determination

We performed some experiments on hand curated ortholog sets downloaded from the Yeast Gene Order Browser (YGOB) [36], but also computed ortholog sets for each of the three phylogenetic divisions.

Automatic identification of orthologs is a complex subject for which many sophisticated methods have been developed, the most suitable one being application dependent [37]. For this study, we adopted a simple procedure based on reciprocal best hits (RBHs) [38]. Formally, proteins P and P^′ from species S and S^′ respectively, are RBHs if P is more similar to P^′ than any other protein in S^′ and P^′ is more similar to P than any other protein in S. We define the ortholog set of a reference species protein as all of its RBHs. When computing RBHs it is important that proteins from as many organisms as possible are included; but in the end we only have use for those ortholog sets in which the reference species is annotated, so in general we discarded the rest. However, in the case of plant, we attempted to rescue those discarded sequences by also trying O. sativa, G. max and C. reinhardtii in turn as the reference species.

In computing the similarity scores for RBH we chose to use global alignment rather than local alignment. Our motivation for this was: 1) sorting signals often appear on the N- or C-terminal region of proteins, so differences in those regions may indicate a different localization of the “ortholog”, and 2) for multiple domain proteins, strong similarity in one domain may not imply the same localization site (or signal). We used the heuristic but fast USEARCH [39] program with its default parameters to compute the global similarity scores. Table 2 summarizes the datasets.

Table 2 The number of ortholog sets by localization class in each phylogenetic division

Full size table

Multiple alignment

We computed multiple alignments for each of the 4 orthologs sets (1 curated and 3 automatic) by aligning with the MAFFT program [40], using “LINSI”, its most accurate mode. Hereafter, we denote these alignments as “orthoMSA” in general, and as “autoOrthoMSA” when specifically referring to multiple alignments of automatically generated ortholog sets. The number of sequences in the automatically generated ortholog sets generally differs from the YGOB based sets, however, it seems that the distribution of the divergence score stabilizes when the number of sequences exceeds three (Figure 1), therefore we decided to include ortholog sets with at least four sequences.

Features for classification

Column entropy score

Several measures have been suggested for scoring evolutionary sequence conservation (or conversely divergence) [41, 42]. Here we adopt a simple Shannon entropy based score. The Shannon entropy H(i) of the i th column of an orthoMSA is defined as:

H (i) = - \sum_{j \in A} F (i, j) {log}_{2} F (i, j) .

(1)

where A denotes the set of 20 amino acid characters plus gap characters, and F(i,j) denotes the frequency of character j in column i of an orthoMSA. Note that when multiple gap characters are present in a column, we consider each to be a unique character. For example, the entropy of an orthoMSA column ‘{L, L, I, -, -}’ is computed as one character (the ‘L’) with frequency 0.4 and three characters with frequency 0.2, because we treat the two ‘-’ characters as distinct. We adopted this treatment of gap characters so that the divergence of orthoMSA columns with many gaps is considered high (we also tried using straight entropy, but the results, not shown, were slightly worse). The range of this divergence score runs from 0 to log2n, where n is the number of sequences.

Divergence based features

For many orthoMSA’s, the entropy often varies widely from column to column. Therefore, we defined a number of evolutionary divergence features based on a smoothed entropy score, ${\bar{H}}_{i, j}$ , defined as the average entropy score for columns in the interval [ i,j]. For example we define the local divergence (LD) of an orthoMSA at position k as ${\bar{H}}_{k - 10, k + 10}$ . Another feature we defined is NCdiff, the average difference in divergence between in the first 20 residues and residues 80 to 99. Our motivation for this definition was the hope that subtracting the divergence from residues 80 to 99 would approximately normalize the feature when comparing proteins with different overall rates of evolution. These features are summarized in Table 3.

Table 3 List of entropy derived features

Full size table

Physico-chemical propensities

To explore the possibility of combining sequence divergence with standard features used in protein localization prediction, we defined three features computed from the first 20 or 40 N-terminal residues of each S.cere. protein: 1) the number of positively charged residues (#pos), 2) the number of negatively charged residues (#neg), and 3) the average hydrophobicity as measured by the Kyte-Doolittle [43] index (Hphob).

Amino acid composition

Amino acid composition is another standard feature for protein localization. We tested this feature computed on the first 20 residues, the first 40 residues, and the entire protein sequence.

Classifiers

Majority class classifier

The majority class classifier unconditionally predicts all examples to belong to the most common class. Its accuracy is equal to the fraction of examples belonging to the most common class.

J48

J48 is a version of the C4.5 decision tree induction algorithm of Quinlan [44, 45], implemented in the Weka software package [46]. We used the default value of 0.25 for the confidence factor, which controls the complexity of the induced tree.

Support vector machine

The Support Vector Machine (SVM) [47] is perhaps the most popular classifier in current bioinformatics work. In its basic form it is a linear, binary classifier, but it has been extended to non-linear, multiclass classification. In this project, we used the LIBSVM implementation [48]. We used the Gaussian radial basis kernel function with default γ value (1.0/# number of features). We used 50.0 for the SVM cost parameter C, because with the default cost parameter (1.0) prediction by RBF kernel failed for some features. In our study we conducted binary and 3-class classification. For multiclass discrimination LIBSVM adopts the “one-versus-one” method, in which a separate SVM is learned for each pair of classes, and majority voting among those SVM’s is used when classifying examples [49].

Measuring the influence of divergence features

As reported in the Results section, we performed a post-hoc analysis of proteins for which the divergence features greatly influenced the prediction outcome. To do this we needed to compare 6 numbers (three SVM scores {MTS vs SP, MTS vs none, SP vs none} each computed with and without the divergence features) into a measure of how much the divergence features influenced the prediction. Because the SVM scores are not given directly as probabilities and each individual SVM addresses a different subset of classes, it is not trivial to derive a well-principled way to do this. As described in more detail in the Additional file 1, we chose to define this in terms of exponential loss-based decoding [50]. We do not claim that this is necessarily the best measure, but it appears to give reasonable results. Fortunately, for our purposes it is enough that truly large differences are assigned in a roughly suitable order.

Quantifying feature importance

We used the so called “information gain” to quantify the importance of each feature. Information gain is a simple measure of the predictive power of a feature in isolation (i.e. without consideration of its relationship to other features), defined as:

I (C, F) = H (C) - H (C | F) .

(2)

where C and F denote class and feature respectively. H(C) the denotes information theoretic entropy of the overall distribution of the class labels, while H(C|F) denotes the conditional entropy of the class label when feature F is given. A larger information gain indicates greater predictive power. Because the divergence based features have a large number of possible values, we first binned those values into a smaller number by the method of Fayyad & Irani [51].

Classification performance evaluation

Accuracy is not always the most meaningful measure of performance for skewed datasets (i.e. datasets with a very uneven number of examples from different classes) [52]. Therefore we report several measures in addition to accuracy.

Matthews correlation coefficient

The Matthews correlation coefficient, MCC [53, 54], is a measure of performance for binary classification defined as follows:

\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FN) (TP + FP) (TN + FP) (TN + FN)}}

(3)

where “T” and “F” stand for “true” and “false”, while “N” and “P” stand for “negative” and “positive”. Equivalently, MCC can be defined as the Pearson’s correlation coefficient of the binary vector of class labels compared to the binary vector of predicted class labels. MCC ranges from 1.0 for perfect prediction to -1.0 for perfect inverse prediction. Note that the MCC of the majority class classifier is identically zero, as is the expected value of MCC under random prediction.

Area under the ROC curve

The Area under the curve (AUC) for a receiver operating characteristics (ROC) graph is a widely used metric to evaluate binary classification accuracy [55]. The usual way to generate an ROC plot is to rank instances by their predicted scores with increasing threshold values, plotting true positive rate (y-axis) versus false positive rate (x-axis). AUC ranges from 0 to 1.0, with perfect prediction yielding 1.0 and perfectly wrong prediction 0.0. AUC can be interpreted as the probability that a classifier is able to distinguish a randomly chosen positive example from a randomly chosen negative example [56]. For this task, the majority class classifier gives no information over coin flipping and therefore can be considered to yield an AUC of 0.5.

Results

Feature analysis

N-terminal sorting signals are evolutionary divergent

It is well known that N-terminal sorting signals exhibit relatively low sequence conservation [57]. As shown in Figure 2, this phenomenon is particularly clear for the mitochondrial heat shock protein, mtHSP70, in which the main part of the protein is highly conserved but the N-terminal region is highly divergent. Figure 3 quantifies this trend for the proteins in the YGOB ortholog set.

Estimate of importance of each feature

As a rough estimate of feature importance, we computed the information gain for each feature (Figure 4). The two highest scoring features are the physico-chemical features #neg and Hphob, but the LD features near the N-terminus also show information gain significantly greater than zero.

Sequence divergence is not redundant to physico-chemical trends or amino acid composition

To be promising as a feature for prediction, it is desirable that evolutionary sequence diversity not be perfectly correlated with other features. To investigate this we plotted LD(13), the divergence feature with the highest information gain, against Hphob, #neg and the arginine composition (the three highest scoring standard features in the 40 residue N-terminal region) (Figure 5). Although there may be some relationship, the feature pairs do not appear highly correlated.

Divergence predicts the presence of N-terminal signals

We tested whether sequence divergence can be used to distinguish between proteins with an N-terminal localization signal (MTS or SP) and those with none. As shown in Table 4, for this binary classification task, sequence divergence alone allows for significantly higher prediction accuracy than randomized control experiments or the majority class fraction (66.0%) in the yeast dataset.

Table 4 Performance of N-signal vs N-signal-free protein binary classification

Full size table

Divergence distinguishes SP vs. MTS vs. N-signal-free

Although the sequence divergence profile of SP’s and MTS’s appear similar when averaged (Figure 3), we found that sequence divergence is still somewhat effective for the three-way classification of SP vs MTS vs N-signal-free. As shown in Table 5 the performance with divergence features is slightly better than the majority class fraction (66.0%) and also slightly improves the performance when added to the physico-chemical features in N-terminal 40 residues or amino acid composition in either N-terminal 40 or full length (Additional file 1).

Table 5 Performance of 3-way classification using SVM classifier

Full size table

The ratio of examples in our dataset is 8.5:3.4:1, for N-signal-free, MTS and SP containing proteins respectively. Skewed datasets are known to complicate both learning and performance evaluation [52]. Therefore we also measured performance on a dataset with uniform class occupancy, created by randomly discarding all but 53 proteins from each class. As shown in Table 6, in this experiment the divergence feature only performance (63%) is much higher than the majority class fraction (33%), and the divergence features also contribute more to the performance when combined with the standard features (Table 6).

Table 6 Performance on balanced dataset for MTS vs SP vs N-signal-free protein prediction using SVM classifier

Full size table

We further tested the prediction power of divergence features when combined with classical features computed on a 20 residue N-terminal instead of 40 (which might be too long for the SP class). In this experiment, divergence features improved the performance only slightly when combined with other standard features (Table 7). We also computed the confusion matrix for this dataset (Table 8) and the other datasets investigated in the study (Additional file 1: Tables S14–S25).

Table 7 Performance of 3-way classification using SVM classifier (feature length 20)

Full size table

Table 8 Confusion Matrix from 3-way classification using SVM classifier (feature length 20)

Full size table

Divergence computed from automatically generated ortholog sets is consistent with the hand curated dataset

Although the YGOB based dataset convincingly demonstrates that the divergence score has discriminative power for N-terminal signal prediction, it covers only 11 yeast species and requires hand curation. Thus as described in the Methods section, in this work we adopted a simple procedure based on reciprocal best hit relationships to obtain automatically generated ortholog sets as well (Table 2).

In yeast, the average divergence score at each positions is similar to the score from the YGOB ortholog set, and the overall tendency looks similar for animals and plants (Figure 3). Interestingly, CTP shows a high and longer region of elevated divergence, consistent with previous observations that CTPs tend to be longer than MTSs [11]. Additionally, we note that the score range of the human autoOrthoMSA’s is significantly different from those of yeast or plants. This is expected because divergence amongst yeast sequences is at least as large as that of the chordates [58], so divergence in mammals should be smaller.

Divergence computed from autoOrthoMSA also predicts N-terminal signals

First, we confirmed whether or not divergence features can be applied to a simple binary classification: discrimination between N-terminal signal containing proteins and N-signal-free proteins. Although the ratio of positive to negative examples in each dataset differs, the result of prediction by divergence features alone is higher than majority class classifier for all datasets (Table 9).

Table 9 Performance of N-signal vs N-signal-free protein binary classification on automatically collected orthologs

Full size table

Next, we tested the predictive power of divergence in three-way classification on a dataset balanced to have equal class frequency (Table 10). It is evident that on balanced datasets, divergence also shows significant predictive power in distinguishing between the two different kinds of N-terminal signals, even for the relatively closely related mammal species.

Table 10 Performance for 3-way classification using SVM classifier on automatically collected orthologs

Full size table

In plants, the divergence score can also discriminate between the three possible kinds of N-terminal signals better than random. However, there are only 15 experimentally validated SPs in this phylogenetic category (Table 2). Since this small sample size leads to a high statistical variance, we also computed the performance on balanced 3-way classification of MTS vs CTP vs N-signal-free (Table 11).

Table 11 Performance on balanced plant dataset using SVM classifier on automatically collected orthologs

Full size table

In the Additional file 1 we list cross-validated performance estimates on various combinations of datasets and features. From these we draw two conclusions: in most cases divergence features slightly improve prediction when combined with standard features and in general computing standard features on the N-terminal 20 residues leads to higher accuracy than computing on 40 residues.

Post-hoc analysis of proteins for which divergence strongly influences the prediction result

In this section we discuss proteins for which the use of divergence features strongly affects the results. The ortholog MSA’s of all proteins mentioned in this section are available in the Additional file 2.

Divergence features may help flag misannotation

Prior to this work, evolutionary divergence has not been applied systematically to N-terminal signal prediction. However we expected that it might be able to capture interesting examples not revealed by other features. To investigate this, we ranked instances whose SVM prediction changes drastically depending on whether or not divergence features are used. Because of its rich annotation, we focused on S.cere., using the automatically defined ortholog set. The prediction result of 43 proteins changed depending on whether divergence features were added to conventional features. For these 43 proteins, we used the SVM numerical scores to rank the size of the effect as explained in the Additional file 1 (ranked list in Additional file 3: Table S1). The ortholog set multiple sequence alignments for these proteins are also available in the Additional file 2 in html form. In general, prediction differences are observed between the MTS and N-signal-free classes. The most highly affected protein is mitochondrial alanine tRNA ligase, ALA1 (P40825), which is predicted to have an MTS when sequence divergence features are used. Upon closer inspection we discovered that the sequence we used for this protein should in fact have been labeled as an MTS containing protein, but our dataset based on an earlier version of UniProtKB/Swiss-Prot contained mistaken annotation which holds for an alternative translation start site. Thus in this case sequence divergence yields the correct answer.

PTP1 (P25044) is another protein whose prediction changes from N-signal-free to MTS when divergence is considered. Following UniProtKB/Swiss-Prot, we treated it as a cytoplasmic protein, but there is no reference given for this annotation. Moreover PTP1 is identified as a mitochondrial protein by two large-scale experiments. This is suggestive that it may have a mitochondrial localization, although even in that case it would not necessarily have an MTS. Hopefully future work will clarify if this is another case in which divergence features flagged misannotations in our dataset.

Divergence features may help detect mitochondrial proteins with non-classical MTS signals

FMP52 (P40008) is a protein included in our dataset for which the SVM with standard features predicts an MTS but the SVM with divergence features predicts N-signal-free. As shown in Figure 6, FMP52’s N-terminal region is not divergent like typical MTS’s, especially very near the N-terminus. FMP52 is indeed a mitochondrial protein, but upon closer scrutiny we discovered a previous report that it strongly associates with the outer membrane [59] — and therefore is unlikely to have a matrix targeting MTS. Moreover, FMP52 is one of the non-MTS containing proteins in the yeast proteomic analysis [32]. Swiss-Prot does annotate FMP52 with an MTS (1-44), but we could not find a reference or supporting information for this MTS annotation; therefore, we conclude that it is unlikely to have MTS. CYM1 (P32898) is another interesting example which has been reported to localize in the intermembrane space and not to be processed by mitochondrial proteases [60]. Since MTS is a cleavable targeting signal for the matrix, the intermembrane space localization and lack of proteolytic cleavage of CYM1 suggests its N-terminal signal is not a typical classical MTS.

MrpL19 (P53875) is another case in which sequence divergence features highlight a ribosomal mitochondrial protein which does not appear to have a classical MTS signal. According to both UniProtKB/Swiss-Prot annotation and a large-scale proteomics experiment [32] MrpL19 has an MTS, but the annotated “MTS” is unusually long and lacks an arginine in position -2, which is normally observed in MPP cleavage sites [9]. Moreover the N-terminal sequence of MrpL19 is very well conserved not only in yeasts but even in bacteria. Indeed the three dimensional structure of rplK, a homolog of MrpL19 in E.coli, has been solved and it is evident that the two proteins have a similar structured N-terminal. Taken together the evidence suggests that MrpL19 may not have an N-terminal mitochondrial localization signal, but rather be imported via an alternative pathway.

On the other hand, we also observed ribosomal mitochondrial proteins whose N-terminal is poorly conserved. One example is MrpL32 (P25348), which cannot be predicted as having an MTS by standard tools such as TargetP [61] or Predotar [35], nor by our SVM’s trained without divergence features. MrpL32 shows a high divergence in its N-terminal region (Figure 7) and is predicted to have an MTS by our SVM when using divergence features. A literature search revealed that MrpL32 does indeed have an MTS, but it is unusual in the sense that it is cleaved by the protease m-AAA [62, 63] instead of MPP. Mrp7 (P12687) is a similar case. Like MrpL32, Mrp7 is also a component of a large ribosomal subunit and is not predicted to have an MTS by TargetP, Predator, nor by our SVM without divergence features, but is predicted to have an MTS when divergence features are used. In UniProtKB/Swiss-Prot, Mrp7 is annotated as having an MTS, and indeed the processing of Mrp7 by MPP has been reported multiple times [32, 64]. So in this case high sequence divergence allows an MTS to be correctly predicted.

Another case worth discussing is IMO32 (P53219), which has recently been reported to be processed by the intermediate protease Oct1 (after MPP) in the matrix [65]. It is unusual in that its inferred MPP cleavage site represents a rare exception to the almost invariant presence of arginine at the -2 position. IMO32 is predicted as an MTS by Predator [35] and our SVM when we use divergence, but not by our SVM without divergence features, nor by TargetP [61].

Discussion

Although strong sequence similarity is a widely used indicator of co-localization, characteristically low sequence conservation in signal sequence regions has not been utilized for prediction. Other authors have noted the low sequence conservation of N-terminal sorting signals such as MTS sequences [66], but our work reported here is the first investigation of the utility of sequence divergence as a predictive feature for N-terminal sorting signals.

Our method requires defining an ortholog set for each gene. The YGOB curated dataset for 11 yeast species is a reliable way to obtain orthologs, but this kind of database is not available for most species. We show that a simple reciprocal best hit method identified orthologs with sufficient reliability for the purposes of computing sequence diversity. One avenue for future research is to relax the requirement of global alignment reciprocal best hit designed to find orthologs, and simply use for (possibly paralogous) homologous sequences. In this study we chose to focus on orthologs because paralogs often have distinct localization sites. For example, Rosso et al. [67] describe the interesting case of the human glutamate dehydrogenases GLUD1 and GLUD2. These paralogs result from a gene duplication event, but GLUD1 localizes to both the cytosol and the mitochondria while GLUD2 localizes exclusively to the mitochondria. Interestingly, the N-terminal region of GLUD2, which functions as an MTS, has evolved faster than GLUD1 [67].

Since we made a few somewhat arbitrary choices when defining divergence features, we performed an post hoc analysis to see if simply tuning those parameters would significantly affect the prediction accuracy. Namely, we investigated the effect of the changing the window length and position of the downstream normalizing window used to define NCdiff, but found that prediction accuracy is not strongly dependent on the exact value of these parameters (Additional file 1: Figures S1,S2). Another potential weakness of our method is the simple entropy based definition we used for sequence divergence, which ignores the phylogenetic relationship of the species involved. Many sophisticated measures have been proposed to quantify the degree of sequence conservation [42]. We did experiment with some of them, such as the Jensen-Shannon divergence [68] to try to improve prediction, but without success (results not shown). However we did not extensively explore the possibilities and believe that the simple entropy score employed here probably can be improved upon.

On the other hand, we did provide quantitative evidence that the entropy divergence score has considerable predictive power by itself. The examples ALA1 and FMP52 show that divergence can flag proteins (typically mitochondrial ones) with misannotated MTS information and give a hint regarding which compartment of the mitochondria they localize to. Examples like MrpL32, show that when the predictions of standard predictors are inconsistent with the degree of sequence divergence, non-typical MTS’s, processing proteases or alternative mitochondrial localization pathways may be indicated.

One weakness in our datasets is that many of our SP proteins are not experimentally validated, but rather annotated as SP proteins due to UniProtKB/Swiss-Prot annotation and prediction from amino acid sequence with SignalP [33] in the yeast dataset. This unfortunate circularity (predicting predictions) is unavoidable because: 1) only a handful of SP’s have been experimentally verified, and 2) the presence of SP’s cannot be reliably inferred exclusively from localization site for most S.cere. proteins. It may be reasonable to assume that secreted proteins all have SP’s, but S.cere. secretes very few proteins (the Swiss-Prot derived WoLF PSORT [69] dataset lists only six). Proteins which localize to the E.R. or Golgi body generally posses SPs, but many proteins annotated as E.R. or Golgi are non-SP containing peripheral membrane proteins, which localize to the periphery of these organelles. However, the risk of incorrect conclusion resulted from employing non-verified SP data is small. First, this problem only applies to the SP class, as recent proteomics data has provided direct measurement of many MTS’s [11, 32]. Second, given the intense study of S.cere. and the continued scrutiny of UniProtKB/Swiss-Prot by the research community, we find it unlikely that a large fraction of the SP proteins in our dataset are incorrectly labeled. Third, our argument is not completely circular. SignalP prediction is based on physico-chemical features but not divergence (or conservation) for prediction, and the results shown in Figure 5 suggest physico-chemical features do not correlate very closely with sequence divergence.

Conclusion

We find it rather remarkable that the accuracy of balanced 3-way prediction can be improved to more than 50% just by using simply defined sequence divergence features, while otherwise completely hiding the amino acid sequence of the protein. Although we readily admit the limited scope of this work, it is the first to quantitatively explore sequence divergence as a feature for localization signal prediction. We feel confident that our observation will stand the test of time, as more and more organisms are fully sequenced.

Note

A preliminary version of this work appeared as a conference proceedings paper [70].

References

Eisenhaber F, Bork P: Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol. 1998, 8: 169-170. 10.1016/S0962-8924(98)01226-4.
Article CAS PubMed Google Scholar
Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y, Cheung KH, Miller P, Gerstein M, Roeder GS, Snyder M: Subcellular localization of the yeast proteome. Genes Dev. 2002, 16 (6): 707-719. 10.1101/gad.970902.
Article CAS PubMed Central PubMed Google Scholar
Huh WK, Falvo JV, Gerke LG, Carroll AS, Howson RW, Weissman JS, O’Shea EK: Global analysis of protein localization in budding yeast. Nature. 2003, 425 (6959): 689-691.
Article Google Scholar
Imai K, Nakai K: Prediction of subcellular locations of proteins: where to proceed?. Proteomics. 2010, 10 (22): 3970-3983. 10.1002/pmic.201000274.
Article CAS PubMed Google Scholar
Nair R, Rost B: Sequence conserved for subcellular localization. Protein Sci. 2002, 11 (12): 2836-2847.
Article CAS PubMed Central PubMed Google Scholar
Blobel G, Dobberstein B: Transer of proteins across membranes. I. Presence of proteolytically processed and unprocessed nascent immunoglobulin light chains on membrane-bound ribosomes of murine myeloma. J Cell Biol. 1975, 67 (3): 835-851. 10.1083/jcb.67.3.835.
Article CAS PubMed Google Scholar
Schatz G, Dobberstein B: Common principles of protein translation across membranes. Science. 1996, 271 (5255): 1519-1526. 10.1126/science.271.5255.1519.
Article CAS PubMed Google Scholar
von Heijne G: Patterns of amino acids near signal-sequence cleavage sites. Eur J Biochem. 1983, 133: 17-21. 10.1111/j.1432-1033.1983.tb07424.x.
Article CAS PubMed Google Scholar
Gakh O, Cavadini P, Isaya G: Mitochondrial processing peptidases. Biochim Biophys Acta. 2002, 1592: 63-77. 10.1016/S0167-4889(02)00265-3.
Article CAS PubMed Google Scholar
Teixeira PF, Glaser E: Processing peptidases in mitochondria and chloroplasts. Biochim Biophys Acta. 2013, 1833 (2): 360-370. 10.1016/j.bbamcr.2012.03.012.
Article CAS PubMed Google Scholar
Huang S, Taylor NL, Whelan J, Millar AH: Refining the definition of plant mitochondrial presequences through analysis of sorting signals, N-terminal modifications, and cleavage motifs. Plant Physiol. 2009, 150 (3): 1272-1285. 10.1104/pp.109.137885.
Article CAS PubMed Central PubMed Google Scholar
Saitoh T, Igura M, Obita T, Ose T, Kojima R, Maenaka K, Endo T, Kohda D: Tom20 recognizes mitochondrial presequences through dynamic equilibrium among multiple bound states. EMBO J. 2007, 26 (22): 4777-4787. 10.1038/sj.emboj.7601888.
Article CAS PubMed Central PubMed Google Scholar
Yamamoto H, Itoh N, Kawano S, Yatsukawa Y, Momose T, Makio T, Matsunaga M, Yokota M, Esaki M, Shodai T, Kohda D, Hobbs AE, Jensen RE, Endo T: Dual role of the receptor Tom20 in specificity and efficiency of protein import into mitochondria. Proc Natl Acad Sci U S A. 2011, 108: 91-96. 10.1073/pnas.1014918108.
Article CAS PubMed Central PubMed Google Scholar
Horton P, Mukai Y, Nakai K: Protein localization prediction. The Practical Bioinformatician. Edited by: Wong L. 2004, 5 Toh Tuck Link. Singapore 596224: World Scientific, 193-215.
Chapter Google Scholar
Nakashima H, Nishikawa K: Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequences. JMB. 1994, 238: 54-61. 10.1006/jmbi.1994.1267.
Article CAS Google Scholar
Yuan Z: Prediction of protein subcellular locations using Markov chain models. FEBS Lett. 1999, 451: 23-26. 10.1016/S0014-5793(99)00506-2.
Article CAS PubMed Google Scholar
Cedano J, Pérez-Ponsa JA, Querol E: Relation between amino acid composition and cellular location of proteins. JMB. 1997, 266 (3): 594-600. 10.1006/jmbi.1996.0804.
Article CAS Google Scholar
Reinhardt A, Hubbard T: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 1998, 26 (9): 2230-2236. 10.1093/nar/26.9.2230.
Article CAS PubMed Central PubMed Google Scholar
Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19 (13): 1656-1663. 10.1093/bioinformatics/btg222.
Article CAS PubMed Google Scholar
Sakiyama N, Runcong K, Sawada R, Sonoyama M, Mitaku S: Nuclear localization of proteins with a charge periodicity of 28 residues. Chem-BioInformatics J. 2007, 7: 35-48.
Article CAS Google Scholar
Drawid A, Gerstein M: A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. JMB. 2000, 301 (4): 1059-1075. 10.1006/jmbi.2000.3968.
Article CAS Google Scholar
Frank K, Sippl MJ: High-performance signal peptide prediction based on sequence alignment techniques. Bioinformatics. 2008, 24 (19): 2172-2176. 10.1093/bioinformatics/btn422.
Article CAS PubMed Google Scholar
Andrade MA, O’Donoghue SI, Rost B: Adaptation of protein surfaces to subcellular location. J Mol Biol. 1998, 2 (1998): 517-525.
Article Google Scholar
McCue LA, Thompson W, Carmack CS, Ryan MP, Liu JS, Derbyshire V, Lawrence CE: Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res. 2001, 29 (3): 774-782. 10.1093/nar/29.3.774.
Article CAS PubMed Central PubMed Google Scholar
Davey NE, Shields DC, Edwards RJ: Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics. 2009, 25 (4): 443-450. 10.1093/bioinformatics/btn664.
Article CAS PubMed Google Scholar
Martinsen L, Johnsen A, Venanzetti F, Bachmann L: Phylogenetic footprinting of non-coding RNA: hammerhead ribozyme sequences in a satellite DNA family of Dolichopoda cave crickets (Orthoptera, Rhaphidophoridae). BMC Evol Biol. 2010, 10: 3-10.1186/1471-2148-10-3.
Article PubMed Central PubMed Google Scholar
Nair R, Rost B: Better prediction of sub-cellular localization by combining evolutionary and structural information. PROTEINS. 2003, 53 (4): 917-930. 10.1002/prot.10507.
Article CAS PubMed Google Scholar
Yogev O, Pines O: Dual targeting of mitochondrial proteins: mechanism, regulation and function. Biochim Biophys Acta. 2011, 1808 (3): 1012-1020. 10.1016/j.bbamem.2010.07.004.
Article CAS PubMed Google Scholar
Christopher C, Small I: A reevaluation of dual-targeting of proteins to mitochondria and chloroplasts. Biochim Biophys Acta. 2013, 1833 (2): 253-259. 10.1016/j.bbamcr.2012.05.029.
Article Google Scholar
Tsukamoto T, Hata S, Yokota S, Miura S, Fujiki Y, Hijikata M, Miyazawa S, Hashimoto T, Osumi T: Characterization of the signal peptide at the amino terminus of the rat peroxisomal 3-ketoacyl-CoA thiolase precursor. J Biol Chem. 1994, 269 (8): 6001-6010.
CAS PubMed Google Scholar
Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A: UniProtKB/Swiss-Prot. Methods Mol Biol. 2007, 406: 89-112.
CAS PubMed Google Scholar
Vögtle F, Wortelkamp S, Zahedi R, Becker D, Leidhold C, Gevaert K, Kellermann J, Voos W, Sickmann A, Pfanner N, Meisinger C: Global analysis of the mitochondrial N-proteome identifies a processing peptidase critical for protein stability. Cell. 2009, 139 (2): 428-439. 10.1016/j.cell.2009.07.045.
Article PubMed Google Scholar
Bendtsen J, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340 (4): 783-795. 10.1016/j.jmb.2004.05.028.
Article PubMed Google Scholar
Dondoshansky I: Blastclust (NCBI Software Development Toolkit). 2002
Google Scholar
Small I, Peeters N, Legeai F, Lurin C: Predator: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics. 2004, 4 (6): 1581-1590. 10.1002/pmic.200300776.
Article CAS PubMed Google Scholar
Byrne KP, Wolfe KH: The yeast gene order browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 2005, 15 (10): 1456-1461. 10.1101/gr.3672305.
Article CAS PubMed Central PubMed Google Scholar
Altenhoff AM, Dessimoz C: Inferring orthology and paralogy. Evolutionary Genomics: Statistics and Computational Methods. Methods in Molecular Biology. Edited by: Anisimova M. 2012, USA: Humana Press, 259-277.
Chapter Google Scholar
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 1999, 96 (6): 2896-2901. 10.1073/pnas.96.6.2896.
Article CAS PubMed Central PubMed Google Scholar
Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26 (19): 2460-2461. 10.1093/bioinformatics/btq461. [USEARCH]
Article CAS PubMed Google Scholar
Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002, 30 (14): 3059-3066. 10.1093/nar/gkf436.
Article CAS PubMed Central PubMed Google Scholar
Mayrose I, Graur D, Ben-Tal N, Pupko T: Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol Biol Evol. 2004, 21 (9): 1781-1791. 10.1093/molbev/msh194.
Article CAS PubMed Google Scholar
Johansson F, Toh H: A comparative study of conservation and variation scores. BMC Bioinformatics. 2010, 11: 388-10.1186/1471-2105-11-388.
Article PubMed Central PubMed Google Scholar
Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157: 105-132. 10.1016/0022-2836(82)90515-0.
Article CAS PubMed Google Scholar
Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1: 81-106.
Google Scholar
Quinlan JR: C4.5: Programs for Machine Learning. 1993, San Francisco: Morgan Kaufmann Publishers Inc.
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsl. 2009, 11: 10-10.1145/1656274.1656278.
Article Google Scholar
Vapnik VN: The Nature of Statistical Learning Theory. 1995, New York: Springer-Verlag New York, Inc.
Book Google Scholar
Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (3): 1-27.
Article Google Scholar
Hsu C, Lin C: A comparison of methods for multiclass support vector machines. Neural Netw, IEEE Trans. 2002, 13 (2): 415-425. 10.1109/72.991427.
Article Google Scholar
Allwein EL, Schapire RE, Singer Y: Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res. 2001, 1: 113-141.
Google Scholar
Fayyad UM, Irani KB: Multi-interval discretization of continuous-valued attributes for classification learning. International Joint Conference on Artificial Intelligence. 1993, 1022-1027.
Google Scholar
He H, Garcia EA: Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009, 21 (9): 1263-1284. [http://portal.acm.org/citation.cfm?id=1591901.1592322]
Article Google Scholar
Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975, 405 (2): 442-451. 10.1016/0005-2795(75)90109-9.
Article CAS PubMed Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000, 16 (5): 412-424. 10.1093/bioinformatics/16.5.412.
Article CAS PubMed Google Scholar
Fawcett T: An introduction to ROC analysis. Pattern Recognit Lett. 2006, 27 (8): 861-874. 10.1016/j.patrec.2005.10.010.
Article Google Scholar
Argarwal S, Graepel T, Harbrich R, Har-Peled S, Roth D: Generalization bounds for the area under the ROC curve. J Mach Learn Res. 2005, 6: 393-425.
Google Scholar
Williams EJ, Pal C, Hurst LD: The molecular evolution of signal peptides. Gene. 2000, 252 (2): 313-322.
Article Google Scholar
Dujon B: Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet. 2006, 22 (7): 357-387. 10.1016/j.tig.2006.05.002.
Article Google Scholar
Zahedi RP, Sickmann A, Boehm AM, Winkler C, Zufall N, Schönfisch B, Guiard B, Pfanner N, Meisinger C: Proteomic analysis of the yeast mitochondrial outer membrane reveals accumulation of a subclass of preproteins. Mol Biol Cell. 2006, 17 (3): 1436-1450.
Article CAS PubMed Central PubMed Google Scholar
Kambacheld M, Augustin S, Tatsuta T, Muller S, Langer T: Role of the novel metallopeptidase Mop112 and saccharolysin for the complete degradation of proteins residing in different subcompartments of mitochondria. J Biol Chem. 2005, 280 (20): 20132-20139. 10.1074/jbc.M500398200.
Article CAS PubMed Google Scholar
Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007, 2 (4): 953-971. 10.1038/nprot.2007.131.
Article CAS PubMed Google Scholar
Nolden M, Ehses S, Koppen M, Bernacchia A, Rugarli EI, Langer T: The m-AAA protease defective in hereditary spastic paraplegia controls ribosome assembly in mitochondria. Cell. 2005, 123 (2): 277-289. 10.1016/j.cell.2005.08.003.
Article CAS PubMed Google Scholar
Bonn F, Tatsua T, Petrungaro C, Riemer J, Langer T: Presequence-dependent folding ensures MrpL32 processing by the m-AAA protease in mitochondria. EMBO J. 2011, 30 (13): 2545-2556. 10.1038/emboj.2011.169.
Article CAS PubMed Central PubMed Google Scholar
Grohmann L, Graack HR, Kruft V, Choli T, Goldschmidt-Reisin S, Kitakawa M: Extended N-terminal sequencing of proteins of the large ribosomal subunit from yeast mitochondria. FEBS Lett. 1991, 284: 51-56. 10.1016/0014-5793(91)80759-V.
Article CAS PubMed Google Scholar
Vögtle FN, Prinz C, Kellermann J, Lottspeich F, Pfanner N, Meisinger C: Mitochondrial protein turnover: role of the precursor intermediate peptidase Oct1 in protein stabilization. Mol Biol Cell. 2011, 22 (13): 2135-2143. 10.1091/mbc.E11-02-0169.
Article PubMed Central PubMed Google Scholar
Doyle SR, Kasinadhuni NR, Chan CK, Grant WN: Evidence of evolutionary constraints that influences the sequence composition and diversity of mitochondrial matrix targeting signals. PLoS ONE. 2013, 8 (6): e67938-10.1371/journal.pone.0067938.
Article CAS PubMed Central PubMed Google Scholar
Rosso L, Marques AC, Reichert AS, Kaessmann H: Mitochondrial targeting adaptation of the hominoid-specific glutamate dehydrogenase driven by positive Darwinian selection. PLoS Genetics. 2008, 4 (8): e1000150-10.1371/journal.pgen.1000150.
Article PubMed Central PubMed Google Scholar
Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics. 2007, 23 (15): 1875-1882. 10.1093/bioinformatics/btm270.
Article CAS PubMed Google Scholar
Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier C, Nakai K: WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007, 35 (Web Server issue): W585-W587.
Article PubMed Central PubMed Google Scholar
Fukasawa Y, Leung RK, Tsui SK, Horton P: Evolutionary sequence divergence predicts protein sub-cellular localization signals. Proceedings 5th IEEE International Conference on Systems Biology. 2011, IEEE Publishing, 307-312.
Google Scholar

Download references

Acknowledgements

This work was supported by a JSPS KAKENHI, Grant-in-Aid for JSPS Fellows (grant number 12J06550) and a Monkasho KAKENHI Grant-in-Aid for Scientific Research (grant number 23300112).

Author information

Authors and Affiliations

Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
Yoshinori Fukasawa & Paul Horton
Japan Society for the Promotion of Science, Tokyo Chiyoda, Japan
Yoshinori Fukasawa
Hong Kong Bioinformatics Centre and School of Biomedical Sciences, Chinese University of Hong Kong, Shatin, China
Ross KK Leung & Stephen KW Tsui
Computational Biology Research Center, Advanced Industrial Science and Technology, Tokyo, Japan
Paul Horton

Authors

Yoshinori Fukasawa
View author publications
You can also search for this author in PubMed Google Scholar
Ross KK Leung
View author publications
You can also search for this author in PubMed Google Scholar
Stephen KW Tsui
View author publications
You can also search for this author in PubMed Google Scholar
Paul Horton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Horton.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YF performed most of the study and wrote much of the manuscript. RL helped with initial attempts at automatic ortholog set determination. PH conceived of the study and wrote some of the manuscript. All authors contributed to discussion and have read and approved the final manuscript.

Electronic supplementary material

Additional file 1:Supplementary Text. Contains the supplementary text with tables and figures. (PDF 324 KB)

12864_2013_5703_MOESM2_ESM.zip

Additional file 2:MSA’s of proteins for which sequence divergence changes predicted localization signals. Contains links to ortholog multiple sequence alignments of each protein in Additional file 3: Table S1. (ZIP 316 KB)

12864_2013_5703_MOESM3_ESM.csv

Additional file 3:List of proteins for which sequence divergence changes predicted localization signals. A tab separated values file listing proteins and their prediction scores with and without the use of divergence features. (CSV 3 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Fukasawa, Y., Leung, R.K., Tsui, S.K. et al. Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals. BMC Genomics 15, 46 (2014). https://doi.org/10.1186/1471-2164-15-46

Download citation

Received: 31 July 2013
Accepted: 06 January 2014
Published: 20 January 2014
DOI: https://doi.org/10.1186/1471-2164-15-46

Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals

Abstract

Background

Results

Conclusion

Background

Methods

Sorting signal classes

Dataset

Organisms used

Ortholog determination

Multiple alignment

Features for classification

Column entropy score

Divergence based features

Physico-chemical propensities

Amino acid composition

Classifiers

Majority class classifier

J48

Support vector machine

Measuring the influence of divergence features

Quantifying feature importance

Classification performance evaluation

Matthews correlation coefficient

Area under the ROC curve

Results

Feature analysis

N-terminal sorting signals are evolutionary divergent

Estimate of importance of each feature

Sequence divergence is not redundant to physico-chemical trends or amino acid composition

Divergence predicts the presence of N-terminal signals

Divergence distinguishes SP vs. MTS vs. N-signal-free

Divergence computed from automatically generated ortholog sets is consistent with the hand curated dataset

Divergence computed from autoOrthoMSA also predicts N-terminal signals

Post-hoc analysis of proteins for which divergence strongly influences the prediction result

Divergence features may help flag misannotation

Divergence features may help detect mitochondrial proteins with non-classical MTS signals

Discussion

Conclusion

Note

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us