The conservation pattern of short linear motifs is highly correlated with the function of interacting protein domains
© Ren et al. 2008
Received: 04 March 2008
Accepted: 01 October 2008
Published: 01 October 2008
Skip to main content
© Ren et al. 2008
Received: 04 March 2008
Accepted: 01 October 2008
Published: 01 October 2008
Many well-represented domains recognize primary sequences usually less than 10 amino acids in length, called Short Linear Motifs (SLiMs). Accurate prediction of SLiMs has been difficult because they are short (often < 10 amino acids) and highly degenerate. In this study, we combined scoring matrixes derived from peptide library and conservation analysis to identify protein classes enriched of functional SLiMs recognized by SH2, SH3, PDZ and S/T kinase domains.
Our combined approach revealed that SLiMs are highly conserved in proteins from functional classes that are known to interact with a specific domain, but that they are not conserved in most other protein groups. We found that SLiMs recognized by SH2 domains were highly conserved in receptor kinases/phosphatases, adaptor molecules, and tyrosine kinases/phosphatases, that SLiMs recognized by SH3 domains were highly conserved in cytoskeletal and cytoskeletal-associated proteins, that SLiMs recognized by PDZ domains were highly conserved in membrane proteins such as channels and receptors, and that SLiMs recognized by S/T kinase domains were highly conserved in adaptor molecules, S/T kinases/phosphatases, and proteins involved in transcription or cell cycle control. We studied Tyr-SLiMs recognized by SH2 domains in more detail, and found that SH2-recognized Tyr-SLiMs on the cytoplasmic side of membrane proteins are more highly conserved than those on the extra-cellular side. Also, we found that SH2-recognized Tyr-SLiMs that are associated with SH3 motifs and a tyrosine kinase phosphorylation motif are more highly conserved.
The interactome of protein domains is reflected by the evolutionary conservation of SLiMs recognized by these domains. Combining scoring matrixes derived from peptide libraries and conservation analysis, we would be able to find those protein groups that are more likely to interact with specific domains.
Selective protein-protein interactions are important for cellular functions and are often mediated by protein domains that recognize specific primary sequences within target proteins called Short Linear Motifs (SLiMs). Accurate prediction of SLiMs has been difficult because they are short (often < 10 amino acids) and highly degenerate. A major advance in SLiM identification came with a peptide library-based technique that can map the sequence motif recognized by an SH2 domain without prior knowledge of in vivo interaction sites . Similar peptide library experiments have been performed to map the motifs recognized by other domains. Motifs discovered through polypeptide library screening have shown high levels of agreement with reported domain interaction sites [1, 2]. This became the basis for Scansite [3, 4], a bioinformatics program developed to predict SLiMs in query proteins that are recognized by specific protein domains. Other bioinformatic approaches, like those available in Minimotif-Miner , QuasiMotifFinder , MCS  and a tree-based scoring  applied evolutionary conservation as well as other sequence filters to assess the functional relevance of a hit.
Invariant features in SLiMs recognized by SH2, SH3, PDZ and S/T Kinase domains
SH3 Type 1
SH3 Type 2
In order to evaluate the specificity of motif prediction, we compared the SH2 selectivity values (which is calculated using enrichment values from peptide library screening) of SLiMs in proteins from reported binding groups to the SH2 selectivity values of SLiMs in proteins from groups that are not reported to bind. We found that less than 40% of non-binding SLiMs have a selectivity value > 5, whereas over 80% of binding SLiMs have a selectivity value greater than 5. Higher selectivity values correspond to a higher specificity of interaction (Figure 2B). These results demonstrate that predicting domain binding to SLiMs based on motifs from peptide library experiments is effective.
Taking into consideration all binding partners, we found that for 20 of the 21 SH2 motifs, Tyr-SLiMs recognized by SH2 domains (selectivity value ≥5) have a higher average ln(CR) score than those not recognized by SH2 domains (selectivity value < 5); 11 of these are statistically significant (p < 0.05). In the receptor kinase and phosphatase group, 8 cases showed a significant increase in ln(CR) score. However, no significant increase in ln(CR) score was observed in the cell cycle control protein group (Figure 3A, right panel).
Molecular functional classes frequently reported to interact with SH2, SH3 or PDZ domains, or to be phosphorylated by S/T kinases
Cell surface receptor
Guanine nucleotide exchange factor
GTPase activating protein
Cell surface receptor
Phospho ratio #
Cell cycle control protein
Transcription regulatory protein
For SH3-recognized SLiMs (Figure 3B, second panel), conservation was strongest in cytoskeletal and cytoskeletal-associated proteins. calcium binding proteins, RNA binding proteins, tyr-kinases/phosphatases and guanine nucleotide exchange factors also had strong conservation signals. The conservation signal was almost absent in other functional classes. This is largely consistent with those frequently reported SH3 interacting protein groups (Table 2).
Consistent with biochemical evidences that PDZ domains frequently interact with membrane proteins, we found that PDZ domain-recognized SLiMs (Figure 3B, third panel) are specifically conserved in membrane proteins including channels, integral membrane proteins, cell surface receptors, G protein/G protein coupled receptors and membrane transport proteins. The frequent interacting partners of PDZ domain containing proteins are channels, adhesion molecules and cell surface receptors (Table 2). Our results suggest that those membrane proteins such as integral membrane proteins were probably less well studied but nevertheless play an important role in interaction with PDZ domain.
As shown in Figure 3B, fourth panel, the proteins containing SLiMs recognized by S/T kinases in the basophilic group (basophilic S/T kinases in this study included AKT, PKA, PKC, SRPK2, Clk2, NIMA, PhK, CamK2, SLK and MAPKAPK2) seem to be involved in a wider variety of cellular functions than proteins with SLiMs recognized by SH2, SH3 and PDZ domains. S/T kinase domain-recognized SLiMs were conserved in proteins involved in signal transduction (adaptor proteins and Ser/Thr kinase/phosphatases), in cytoskeletal-associated proteins, in proteins related to transcription and cell-cycle control, and also in some membrane proteins. However, the proteins containing conserved SLiMs recognized by proline-dependent Ser/Thr kinases (including CDK2, CDC2 and CDK5) were more specifically involved in transcription and cell-cycle control, with almost no conservation signal from other functional categories. The conservation pattern of SLiMs recognized by S/T kinases is highly consistent with protein functional groups with high serine phosphorylation ratio (Table 2).
Remarkably, most functional classes of proteins with a significant conservation signal were highly specific for the signal within one group of domains, but not in other groups. For example, receptor kinase/phosphatase group show conservation signal only in SH2 domain group and transcription factors only in Ser/Thr kinase domain group (Figure 3B) Nevertheless, a few protein functional classes exhibited a significant conservation signal in multiple groups of domains, such as adaptor molecules and cytoskeletal-associated protein groups; this corresponds to the fact that these proteins participate in multiple signaling pathways involving interactions with more than one domain.
Using SH2 domain-interacting SLiMs as a model, we applied our method of conservation analysis to study additional aspects of SLiM conservation. Specifically, we investigated the conservation of SLiMs in proteins that interact with two different protein domains in a signaling pathway, and we studied the relationship between conservation of SLiMs and sub-cellular localization.
On the other hand, many tyrosine kinases (including the well-known Src family kinases) and adaptor molecules have both SH2 and SH3 domains, and it has been suggested that proteins containing multiple SH3 binding sites are more likely to be tyrosine phosphorylated and bind to SH2 domains as supported by biochemical studies [10, 11]. Consistent with this reasoning, SH2-recognized Tyr-SLiMs in signal transduction proteins that have more than ten PXXP SH3 binding motifs are significantly more conserved than SLiMs without this selection (Figure 6A). However, this trend is not observed in SLiMs in functional classes other than signal transduction protein class (Figure 6A), which agrees well with the fact that most SH2-binding proteins are signal transduction proteins.
We further divided signal transduction groups into subgroups according to sub-cellular localization. Under selections for both the kinase motif and SH3 binding motifs, a high level of SLiM conservation was most manifest in signal transduction proteins localized to the cytoplasm or plasma membrane, but conversation of SLiMs was weaker for those proteins localized to the nucleus (Figure 6A). This is consistent with biochemical evidence that tyrosine phosphorylation occurs mainly in the cytoplasm and plasma membrane (the ratios of proteins that bind to SH2-containing proteins in the cytoplasm, plasma membrane and nucleus are 16.1%, 11.4% and 4.7% respectively, according to Hprd). Conservation profiles for different functional classes of proteins with or without SH3 and Tyr-Kinase domain selection are shown in Figure 6B.
These findings support the hypothesis that tyrosine kinases and SH3 domains are frequently coupled to SH2 domain signaling. The coupling between a tyrosine kinase and SH2 domains is expected, since an SH2 domain can only bind to a Tyr-SLiM after the tyrosine residue has been phosphorylated by a Tyr-kinase. However, the coupling between SH2 and SH3 domains might be less direct. Either a sequential model or a cooperative model, depending on whether the target tyrosine residue is phosphorylated before the interaction, may be used to explain the coupling between SH2 and SH3 domains (Figure 6C). In the sequential model, PXXP motifs recruit SH3 domain containing Tyr-kinases, which in turn phosphorylate the tyrosine residues in the target protein. The pYXXX motif can then recruit an SH2 domain (Figure 6C, upper panel). In the cooperative model, the SH2 and SH3 domains in a single kinase or adaptor molecule bind to the pYXXX motif and the PXXP motif, respectively, to increase the strength of the interaction (Figure 6C, lower panel). Both of these models may explain the coupling between SH2 and SH3 domains. Early in tyrosine phosphorylation-mediated signal transduction, most tyrosine residues are not phosphorylated, so the sequential model may prevail. However, after more tyrosine residues in signaling proteins become phosphorylated, the cooperative model may become increasingly relevant.
Protein-protein interactions mediated by SLiMs have a widespread influence on cellular functions[12, 13]. In this study, we examined these interactions by combining scoring matrixes derived from peptide library and conservation analysis. We detected signals of evolutionary conservation in SLiMs in proteins from functional classes that are known to participate in the signal transduction of a specific protein domain. Further, our analysis of membrane proteins indicated that only the cytoplasmic side is involved in SH2 signaling in both Type I and II membrane proteins. Our results also suggest that tyrosine kinase and SH3 domains are coupled with SH2 domain signaling in signal transduction proteins.
It was recently reported that several bacterially secreted cytotoxins contain multiple repeated Tyr-SLiMs with high affinity for both tyrosine kinases and SH2 domains [14–17]. Many of these cytotoxins are phosphorylated upon entry into host cells and bind to a variety of SH2 proteins. For example, the CagA protein secreted by Helicobacter pylori can be phosphorylated by Src and associates with Shp2  and Csk  SH2 domains, which is essential for cellular changes induced by the bacteria. The strong cellular response initiated by these SH2 binding Tyr-SLiMs further supports our assumption that SLiMs are under continuous evolutionary selection to preserve functional sites and eliminate harmful mutations. Recent work on the negative selection of SH3 domain-recognized sequences  also suggests that SLiMs may undergo strong evolutionary selection.
While most protein functional classes with strong conservation signal are known to be involved in the signaling of respective domains, there are a few exceptions, which may represent undiscovered but functional binding sites. For example, Although less than 3% structural and cytoskeletal proteins have been recorded to bind to SH2 proteins, their Tyr-SLiMs selected by SH2 domains had significantly increased CR scores. It has been reported that alpha-Tubulin, a cytoskelatal protein, binds to the Fyn SH2 domain , and that the intermediate filaments of the cytokeratin type are reported to undergo tyrosine phosphorylation . In the latter case, further evaluation is necessary to determine whether the phosphorylation leads to SH2 binding.
Another interesting observation is that DNA binding proteins also have conservation signal in their potential SH2 binding sites. Although tyrosine phosphorylation is generally believed to be less common in the nucleus, more and more evidences for the tyrosine phosphorylation of DNA binding proteins are reported as in the case of KRC DNA binding protein , estrogen receptor , TFII-I  and more examples provided in . Since many SH2-containing proteins were reported to enter nucleus such as Fes , SHC , Nck  and Vav . SH2 domains may mediate functional interactions with DNA binding proteins. Similar to SH2 domain, we observed that DNA binding proteins also have conservation signal in potential PDZ binding sites. Although most reported interactions mediated by PDZ domains are restricted to membrane proteins, proteins that contain PDZ domain (for example, LIM-kinase 1  and Par3 ) were reported to enter nucleus suggesting they may mediate protein-protein interactions in the nucleus. Whether these observations represent a new trend of research is worth investigation.
Although our results from conservation analysis correlated well with biochemical data in general, our method is still prone to error. First, our motif prediction is based on in vitro peptide scanning techniques, which may be biased due to differences between in vitro and in vivo conditions. Second, we assumed that each position of the SLiM contributed equally to binding, and only SLiMs that were conserved at each position were assumed to be conserved. To improve this method in the future, different weights could be assigned to each position, and amino acid similarity could be considered. Finally, evolutionary conservation can only provide indirect clues regarding function. For example, some SLiMs may only be important for a few species, and these would not have been detected in our analysis.
Our results indicate that the conservation pattern of SLiMs recognized by SH2, SH3, PDZ, and S/T kinase domains highly correlates with the function of these domains. As motifs recognized by other domains are better defined, conservation analysis will be able to provide valuable clues as to their functional roles, as well as possible preferences for their sub-cellular localization or for their coupling with other domains and even structural implications. For example, in a recently published paper , the authors show that SLiMs are more likely to be conserved in disordered protein regions. Recently, peptide array based technology has been developed and is becoming increasingly available [33, 34]. New technologies are expected to make motif discoveries easier and potentially more accurate. Currently, many of the motifs discovered are only defined as regular expressions, which usually provide less information than those motifs defined from the result of peptide library screening. Nevertheless, it should be possible to retrieve useful information from those less well-defined motifs using more sophisticated algorithms in the future.
This study systematically studied the evolutionary conservation of SLiMs recognized by SH2, SH3, PDZ and S/T Kinase domains which reflected the interactome of these domains. Specifically, SLiMs within certain protein functional groups that are frequently involved in the interaction with that domain are significantly more conserved than those SLiMs within other groups. Study of manually extracted SH2 interaction sites in 11 most studied receptor tyrosine kinases provided experimental evidence that Tyr-SLiMs reported to interact with SH2 are significantly more conserved than those that do not. Furthermore, by analysis of SLiMs in membrane proteins and under selection of two different domains, we show that this conservation analysis can also provide useful information about the sub-cellular localization of the interaction and domain coupling.
We selected 7,248 human proteins for our protein functional classification analysis and 8,682 proteins for our cellular process classification analysis, using the following criteria: (1) The protein had SwissProt annotated sequence; (2) The protein had a molecular function or cellular process annotated by the Human protein reference database (Hprd) ; (3) The molecular function or cellular process of the protein was within 34 well-represented functional classes of proteins in Hprd.
Human protein sequence data are from the SwissProt database, downloaded from ftp://ftp.ncbi.nih.gov in November 2005. Protein-protein interactions, and classifications for protein molecular functions, biological processes and sub-cellular localizations are from the Hprd dataset . This is a non-redundant manually-curated protein database, and data was downloaded in November 2005 from http://www.hprd.org. Phosphorylated sites were obtained from the Phospho.ELM database  provided by Francesca Diella in December 2005. We excluded several sequence regions unlikely to contain SLiMs (globular domains, coiled-coils, collagen regions and signal peptides, as annotated in SwissProt), because no more than 15% of known SLiMs [12, 37, 38] occur in these regions.
Using human protein sequences selected as described above, we did pair-wise local alignments generated by BLAST  against 12 higher eukaryotic species (Canis familiaris, Bos taurus, Mus musculus, Rattus norvegicus, Gallus gallus, Xenopus tropicalis, Tetraodon nigroviridis, Danio rerio, Strongylocentrotus purpuratus, Drosophila melanogaster, Apis mellifera, and Caenorhabditis elegans) to obtain homologous sequences for the respective human proteins. Species were selected according to their unique evolutionary positions (four mammals, four non-mammal vertebrates and four invertebrates) and sequence availability in the RefSeq database . Sequence data for all non-human species were from the RefSeq database downloaded from ftp://ftp.ncbi.nih.gov in June 2006 except Tetraodon nigroviridis, which was from the NCBI Entrez non-redundant protein sequence database downloaded from ftp://ftp.ncbi.nih.gov in June 2006. We applied two cutoff levels to avoid inclusion of insignificant hits: a score cutoff of 50 bits, and an overlap cutoff of 50%, as applied in Inparanoid . If more than one homologous sequence was obtained from a single species, the one with the lowest E-value was selected. Unlike Inparanoid  or COG (Cluster of Orthologous Groups) , which consider all species as equal entries, we compared sequences of all other species to those of human, because most biochemical data we used including protein interaction data and protein classification data were from human. Therefore, we only considered the best hit from non-human species as homologous to the human query protein, but not necessarily mutually best matches between human and non-human species or non-human species themselves. We have not removed low complexity regions because SLiMs frequently occur within them.
SLiM occurrences were defined based on invariant features for each domain as listed in Table 1 (except Thr-SLiMs were not included in the analysis for Ser/Thr kinases domains because only peptide library mapped motifs for Ser-SLiMs were available). All occurrences in the proteins that matched these invariant features were included in the analysis. For example, all sequences with the pattern YXXX were selected. For a particular protein sequence, we assumed that the sequence identity rate between a reference species (human in this study) and a species i is p(i) (equal to the number of identical sites divided by the total number of sites aligned. In cases where gaps occur in the alignment sequence of species i, the number of gaps was subtracted from the number of sites aligned as the final alignment length), and that the SLiM under study is n amino acids in length (in cases where the SLiM is at the terminus of a protein and is only partially available, the available length was considered). If the SLiM is under the same evolutionary selectivity as the full-length protein, then the probability that the SLiM is conserved between the two species should be:P1(i) = p(i)n
The probability that the SLiM is unconserved should be:P2(i) = 1- P1(i) = 1-p(i)n
The SLiM is considered unconserved if any gap occurs within its sequence alignments.
Here we define Relative Conservation (CR) between human and the ith species as:
a. if the SLiM is conserved:CR(i) = 1/P1(i) = 1/p(i)n ;
b. if the SLiM is unconserved:CR(i) = P2(i) = 1-p(i)n ;
A CR score greater than 1 indicates the SLiM is CR times more conserved than the average level of the protein. A score smaller than 1 indicates 1/CR times greater variability between species. Note that the number k may be different for different SLiMs according to the pair-wise Blast results.
This method may not be suitable for SLiMs longer than 10 amino acids, since it assumes that most residues in the SLiM could influence the interaction. This may not be the case in longer sequences where only a small subset of the residues is critical to binding. This method was first developed in our lab and has demonstrated its effectiveness in another research where SLiMs were found to be more conserved in disordered protein regions.
For a putative SLiM, the selectivity value for domains were calculated as the product of enrichment values from peptide library experiments [43, 44]. For example, to calculate the Src SH2 selectivity value of the SLiM YENF, we found the enrichment values for E(Y+1) and N(Y+2) for Src SH2 (Table 3) are 2.5 and 2.4, respectively. No enrichment value for F(Y+3) was found (thus Y+3 does not contribute to the final value) and the selectivity value is the product of the two enrichment values (2.5 × 2.4 = 6.0). The enrichment values for SH3 domain recognized motifs were assigned based on amino acid sequence of peptides expressed by SH3-binding phage clones .
Enrichment values for the Src SH2 domain
We thank Wenchao Zhou and Uros Midic for comments on this manuscript. We are grateful to Dr. Michael B. Yaffe, Dr. Tieliu Shi, Dr. Longhou Fang, Li Zhuo and Dan Du for useful criticism and discussions. This work was supported by the grants from National Natural Science Foundation of China (No. 30730055 and No. 30623002) and from the National Key Scientific Program of China (No. 2007CB914504) and also from the National High Technology Research and Development Program of China (No. 2006AA02A308).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.