Effector prediction in host-pathogen interaction based on a Markov model of a ubiquitous EPIYA motif
© Xu et al. 2010
Published: 01 December 2010
Skip to main content
© Xu et al. 2010
Published: 01 December 2010
Effector secretion is a common strategy of pathogen in mediating host-pathogen interaction. Eight EPIYA-motif containing effectors have recently been discovered in six pathogens. Once these effectors enter host cells through type III/IV secretion systems (T3SS/T4SS), tyrosine in the EPIYA motif is phosphorylated, which triggers effectors binding other proteins to manipulate host-cell functions. The objectives of this study are to evaluate the distribution pattern of EPIYA motif in broad biological species, to predict potential effectors with EPIYA motif, and to suggest roles and biological functions of potential effectors in host-pathogen interactions.
A hidden Markov model (HMM) of five amino acids was built for the EPIYA-motif based on the eight known effectors. Using this HMM to search the non-redundant protein database containing 9,216,047 sequences, we obtained 107,231 sequences with at least one EPIYA motif occurrence and 3115 sequences with multiple repeats of the EPIYA motif. Although the EPIYA motif exists among broad species, it is significantly over-represented in some particular groups of species. For those proteins containing at least four copies of EPIYA motif, most of them are from intracellular bacteria, extracellular bacteria with T3SS or T4SS or intracellular protozoan parasites. By combining the EPIYA motif and the adjacent SH2 binding motifs (KK, R4, Tarp and Tir), we built HMMs of nine amino acids and predicted many potential effectors in bacteria and protista by the HMMs. Some potential effectors for pathogens (such as Lawsonia intracellularis, Plasmodium falciparum and Leishmania major) are suggested.
Our study indicates that the EPIYA motif may be a ubiquitous functional site for effectors that play an important pathogenicity role in mediating host-pathogen interactions. We suggest that some intracellular protozoan parasites could secrete EPIYA-motif containing effectors through secretion systems similar to the T3SS/T4SS in bacteria. Our predicted effectors provide useful hypotheses for further studies.
As a complex and interesting relation between organisms in ecology and evolution, host-pathogen interaction is a basis of infectious diseases . Pathogens span a broad spectrum of biological species, including viruses, bacteria, fungi, protozoa, and multicellular parasites. In all these cases, a pathogen causing an infection usually exhibits an extensive interaction with the host during pathogenesis. The cross-talks between a host and a pathogen allow the pathogen to successfully invade the host organism, to breach its immune defence, as well as to replicate and persist within the organism. One of the most important and therefore widely studied groups of host- pathogen interactions is the interaction between pathogen protein (effector) and host cells. Effectors are secreted from pathogens' secretion systems. So far five types of secretion systems have been identified (Types I-V). Among them, T3SS (Type III Secretion System) and T4SS (Type IV Secretion System) can cross bacterial cell walls and host eukaryotic cell membranes to deliver effectors into host cells directly without going through extracellular matrix . Those effectors can manipulate host cell functions once entering host cell . Identifying effectors and exploring their molecular mechanisms not only are critical to understanding the disease mechanisms but also provide theoretical foundations for infectious disease diagnosis, prognosis and treatment [3, 4].
A well-studied effector is Cytotoxin-associated gene A (CagA), a most important virulence factor in Helicobacter pylori (H. pylori), which is one of the major pathogens of upper gastrointestinal diseases (e.g., peptic ulcer and gastric cancer) . CagA can be delivered into gastric epithelial cells by the T4SS of H. pylori. Recent studies of CagA sequences found that they have a variable region within which the EPIYA (glutamic acid-proline-isoleucine-tyrosine-alanine) motif repeats from once to seven times. Tyrosine in the EPIYA motif can be phosphorylated in the host cell. The phosphorylated CagA protein binds to a phosphatase SHP-2, which will interfere with the signal transduction pathway of the host cell and manipulate cell growth, differentiation and apoptosis [6–8]. This interference causes a restructure of the host cell cytoskeleton, cell scattering as well as invasive growth of cells, and formation of hummingbird phenotype with gastric epithelial cells. Such a process not only is considered an important strategy of interaction between H. pylori and host cell, but also is the most significant mechanism of pathogenesis and carcinogenesis of H.pylori [9–11].
Experimentally determined tyrosine-phosphorylated effectors and their motifs
Locus of protein
Motif (phosphorylated Y position)
Distribution of protein sequences containing the EPIYA motif
Number of motif repeats in one protein
Number of protein sequences
We found that the repeats of EPIYA motif in a protein are highly non-random. As the probability of one protein sequence having a copy of EPIYA motif is 1.13E-02 (104,116/9,216,047), the expected probabilities of one protein sequence containing 24 copies of EPIYA motif are (1.13E-02)2=1.28E-04, (1.13E-02)3=1.44E-06, and (1.13E-02)4=1.63E-08, respectively, assuming the combination of the motif in a sequence is random. The observed probabilities of one sequence containing multiple copies of EPIYA motif are much larger than the expected probabilities as shown in Table 2. Hence, the repeats of the EPIYA motif may have been resulted from evolution with biological significance. This is also reflected in Table 1, where most effectors with known EPIYA motif have 2-7 motif repeats. Thus, we suggest that multiple copies of EPIYA motif in the same protein are more likely to be functional than single motif occurrence.
Distribution of EPIYA-motif containing proteins at genus and species levels (as of July 6th 2009)
Number of genuses
Number of species
With copies of motif≥1 (%)
With copies of motif≥2 (%)
With copies of motif≥1 (%)
With copies of motif≥2 (%)
We listed top 10 species and genuses with most EPIYA-motif containing proteins for the groups of archaea, viruses, bacteria, protista, fungi, metazoa and viridiplantae (Additional File 1). In archaea, Methanococcus is the genus that includes the most EPIYA-motif containing proteins. In viruses, Potyvirus is the highest in number of EPIYA-motif containing proteins among genuses while Bovine Viral Diarrhea Virus is the highest among species. The top four genuses (and the corresponding species) in bacteria are Helicobacter (Helicobacter pylori) , Clostridum (Clostridum botulinum, Clostridum perfringens) , Bacillus (Bacillus cereus) and Anaplasma (Anaplasma phagocytophilum) . Plasmodium (Plasmodium falciparum) and Tetrahymena (Tetrahymena thermophila) are the top genuses in protista. In fungi and viridiplantae, the corresponding top genuses are Candida (Candida tropicalis) and Oryza (Oryza sativa) , respectively. Two well-studied genuses Drosophila (Drosophila melanogaster) and Homo (Homo sapiens) take the top two in metazoa. It should be noted that the data in Additional File 1 are biased, with widely studied species such as Helicobacter pylori having the same gene sequenced many times, while some other species have incomplete proteomes. Nevertheless, this table in Additional File 1 provides some interesting reference for known and putative pathogens with effectors.
Known intracellular bacterial pathogens or bacteria containing III/IV type secretion system, and intracellular parasitic protozoan
Number of species
Number of species
Distribution of top 40 protein sequences containing at least two copies of EPIYA motif
Number of proteins (number of genuses)
zinc finger protein
TPR repeat protein
dynein heavy chain
elongation factor 2
FAT tumor suppressor homolog 3
paternally expressed 3
26S proteasome regulatory subunit
cell division protein
centaurin, delta 3
fat tumor suppressor homolog 2
guanine nucleotide exchange factor
cytochrome c oxidase subunit VI
polysaccharide biosynthesis protein
translation initiation factor
Although many of these predicted effectors are false positives and the EPIYA motif may not be functional in them, a significant portion of them is likely to be true effectors. As known effectors, ankyrin and TPR (tetratricopeptide repeat) are related to protein-protein interaction [21, 22]. Considering the sequence similarity of the above proteins, 44 sequences of ankryin are highly similar among each other and come exclusively from Anaplasma phagocytophilum, Wolbachia endosymbiont and Ehrlichia sp., all of which belong to Rickettsiales. Except the sequences from Haliangium ochraceum (ZP_03879805 and ZP_03880192), other sequences of TPR repeat-containing proteins are also similar and they are from Trichodesmium erythraeum, Stigmatella aurantiaca, Acaryochloris marina, Cyanothece sp. and Microcoleus chthonoplastese. For the 20 hypothetical proteins, YP_034066 and YP_001610012 (Bartonella), YP_153762 and YP_002563468 (Anaplasma), XP_001623017 and XP_001636029 (Nematostella), ZP_01620341 and ZP_01622571 (Lyngbya), XP_001468598 and XP_001686356 (Leishmania) are similar pairs in sequences (with more than 30% sequence identity in each pair), and two proteins in a pair are from the same genus. The EPIYA motif in these proteins is highly conserved during evolution, and it may play similar roles as the motif in CagA.
Among proteins containing at least four copies of EPIYA motif (Additional File 2) with 286 sequences in total, most of them are from bacteria, especially from intracellular bacterial pathogens or extracellular bacterial pathogens with T3SS or T4SS, and some are from protist, e.g., intracellular protozoan parasites. Four out of eight known effectors (CagA, Ankyrin, BepD, and Tarp) are found in these sequences, and thus other proteins from bacteria and protista in Additional File 2 may also be effectors. An interesting observation is that the percentage of protein sequences having the EPIYA motif in archaea is the highest among all groups (see Table 3), but none of these archaeal proteins contain four or more copies of EPIYA motif. Previous studies revealed that CagA sequences with more EPIYA-motif occurrences are more virulent . Since archaea and other organisms have relationships of either mutualism or commensalsim and till now there is no clear evidence for the existence of archea parasites [23, 24], it is unlikely that the archaeal proteins containing the EPIYA motif act as pathogen effectors. Compared to other groups, archaea is not well studied, but we can still find some interesting examples, such as Methanobrevibacter smithii (ranking 5th in the species list in Additional File 1). It is the most common commensal archaea in the human gut and plays an important role in digesting polysaccharides, while it may not benefit the host directly. We speculate that EPIYA- motif containing proteins of Methanobrevibacter smithii may have some biological functions in this commensal interaction.
Many functions listed in Table 5 may reflect the fact that these proteins may have multiple functions other than phosphorylation-induced signalling control in the host cell. Some of these proteins may mimic host protein functions. For example, it was suggested that CagA functions as a prokaryotic mimic of the eukaryotic Grb2-associated binder (Gab) adaptor protein . Some of the predicted effectors may mimic singling proteins, such as HPK (histidine protein kinase) listed in Table 5, which is an important part of two-component signal transduction system that recognizes and transmits environmental signals . Some known effectors induce protein expressions with increased expression of RNA polymerase . It is not surprising to see a significant number of proteins in Table 5 are related to protein synthesis, such as RNA polymerase, elongation factor, and helicase. It is noted that CagA itself also contains an RNA polymerase domain based on a BLAST search. These connections also suggest the ancestor proteins of the predicted effectors. Many of the proteins listed in Table 5 are ancient house-keeping genes. The predicted effectors might have evolved from these house-keeping genes by mimicking the host genes. Furthermore, over evolution some of the effectors or their ancestors might have evolved into genes with different functions unrelated to host-pathogen interactions, such as EPIYA-motif containing proteins in archaea and metazoan.
Sequences containing KK and R4 motifs in known effectors
Bartonella tribocorum: Since BepH contains the EPLYAQVNK (YP_001610013, Y-8) motif (KK motif), we predicted it as a phosphorylation effector like BepD-F secretory proteins.
Lawsonia intracellularis: Lawsonia intracellularis is an obligate intracellular bacterial pathogen, which infects a wide range of animals, mainly pigs, and causes proliferative enteropathy - a type of contagious diseases [30, 31]. Its symptoms are acute, including diarrhea, loss of appetite and stunting. After an initial close association with the cell membrane of the enterocytes, Lawsonia intracellularis is endocytosed into host cell . Infected host cells are inhibited in maturation, continue to undergo mitosis and proliferation, and at last form hyperplastic crypts, but the mechanism is unknown . The genome sequence of Lawsonia intracellularis indicates that it may possess a type III secretion system, which may assist the bacterium during cell invasion and evasion of the host's immune system and could be a mechanism for inducing cellular proliferation [34, 35], but its effectors secreted by T3SS was never reported. Current database contains 20 proteins of Lawsonia intracellularis with the EPIYA motif and all of them are from strain PHE/MN1-00. The maximum sequence identity between any two of these 20 proteins is 22% and most of them are enzymes, e.g. ATP synthase. Among them, in the HMM search result by using the R4 motif, we found that hypothetical protein L10666 (YP_595041) contains two copies of EPIYA motif (EPIYAEIKT Y-149, EPIYAEIKT Y-186), which are similar to the R4 and Tir motifs, respectively. Thus, we speculate that this protein might be the effector of Lawsonia intracellularis to interact with intestinal epithelial cells.
Ehrlichia sp.: It belongs to the same family Ehrilichiaceae as Anaplasma . Ankyrin of Ehrlichia sp. and ankyrin of Anaplasma share 89% sequence identity. Ankyrin (T08612) of Ehrlichia sp. contains six copies of 9-mer motifs including the KK and R4 motifs, and thus it is a likely effector of Ehlichia to interact with host.
Wolbachia: Wolbachia belongs to Rickettsiales. Wolbachia is a symbiotic bacterium existing in the sex organ of many insects. Though ankyrin (AAY54257) of Wolbachia and ankryin of Anaplasma share only 15% sequence identity, they contain almost exactly the same motifs. Hypothetical protein WD0942 (NP_966676), which is not similar to ankyrin in sequence, has two motifs, and one of them is EPIYATVPK(Y-318) similar the KK motif. EsorChan1 (AAP34173) contains the motif EPIYDEVYD (Y-77) similar to the Tir motif. Therefore, the above three proteins, especially the first two, are potential effectors of Wolbachia .
Pasterurella multocida: As the major pathogen to cause swine infectious atrophic rhinitis, it secretes toxin filamentous hemagglutinin containing six copies of EPIYA motif. Based on the BLAST search results, we found that the filamentous hemagglutinin (AAK61595) of Pasterurella multocida, filamentous hemagglutinin of Bordetella pertussis and Bordetella Parapertussis share ~30% sequence identity . Filamentous hemagglutinin, the major virulence factor of Bordetella pertussis, not only has adhesion function, but also plays a critical role in immunomodulation. Since filamentous hemagglutinin has the sequences EDIYATINK (Y-2792), which is similar to the KK motif, EHIYADIRD (Y-2550) and ENLYAEISD (Y-2651), both of which are similar to the R4 motif, and EHLYAEINE (Y-2387), which is similar to the Tir motif, we suggest that filamentous hemagglutinin being the effector of Pasterurella multocida and it might be secreted by the TPS (Two-Partner Secretion) system . PfhB2 (NP_244996) has four sequences that are similar to KK, R4 and Tir motifs, and thus it might be another candidate of effector in Pasterurella multocida.
Haemophilus ducreyi: Haemophilus ducreyi is a facultative anaerobic Gram- negative coccobacillus and could cause the sexually transmitted disease chancroid. Large supernatant protein2 (NP_873623) of Haemophilus ducreyi has six copies of EPIYA motif. Its sequence and filamentous hemagglutinin of Bordetella pertussis share 41% sequence identity. Its sequences EPVYADLHF and EPVYADLRF are similar to the R4 motif. Hence, we suggest large supernatant protein2 (NP_873623) is a potential effector of Haemophilus ducreyi and it could be secreted by T4SS . The effector can lead to immunosuppression, inhibition of proliferation, and permanent changes in host cells [42–44].
Haemophilus somnus: Haemophilus somnus can survive in host cells and is the cause of a variety of systemic diseases in cattle, including thrombotic-meningoencephalitis, pneumonia, arthritis, myocarditis, septicemia and other reproductive diseases [45, 46]. Cysteine protease domain YopT-type (YP_001784809) and filamentous hemagglutinin of Bordetella pertussis share 42% sequence identity. The sequence EPIYATLDK (Y-2933) in YP_001784809 is similar to the KK motif, EHIYEQIGE (Y-2358) similar to the Tarp motif, and EPVYDKVSA (Y-2287) similar to the Tir motif. Thus, YP_001784809 might be the effector of Haemophilus somnus to cause immunosuppression .
Chlamydophila pneumonia: Hypothetical protein CPj0472 (NP_300527) contains three copies of EPIYA motif. EPIYANTPE (Y-647) is similar to the KK motif, EPIYEEIGG (Y-346) is similar to the Tir motif and EPIYDEIPW (Y-681) is similar to the R4 motif. Although we did not find any similar protein through BLAST search, hypothetical protein CPj0472 (NP_300527) is a good candidate for the effector of Chlamydophila pneumonia.
Leishmania major: Leishmania major could parasitize into phagocyte of human or other mammals and is responsible for the disease leishmaniasis, which is a serious zoonosis. Leishmania major have 6 proteins containing at least two copies of EPIYA motif and 4 of them are proteins with unknown functions. Among these 6 proteins, Cytochrome C oxidase subunit VI (XP_001683136) contains two copies of EPIYA motif. One is at position Y-107 with sequence EPLYQPVKK, which is similar to the KK motif. Another one is at position Y-130 with sequence EPLYDVDAA, which is similar to the Tir motif. Hence, XP_001683136 might be an effector. Hypothetical protein (XP_001686159) has three copies of EPIYA motif and the sequences are all EPLYAVTIE, which is similar to KK and R4 motifs. Hypothetical protein (XP_001686160) also has three copies of EPIYA motif and the sequences are all EPLYAVTID, which is similar to the R4 motif. In addition, hypothetical protein XP_001686159 and XP_001686160 share 43% sequence identity. Hypothetical protein (XP_001686356) has 29 copies EPIYA motif (the one with most EPIYA motifs in our data) and all sequences are the same as EPLYAVTLE, which is similar to the R4 motif. Microtubule-associated protein (XP_001687515) contains two copies of EPIYA motif. One is at Y-1543 and another is at Y-1589. The sequences for both of them are ESIYAKDYK, which is similar to the KK motif. Thus, we predict hypothetical protein (XP_001686159), hypothetical protein (XP_001686160), hypothetical protein (XP_001686356) and Microtubule-associated protein (XP_001687515) might also be the effectors of Leishmania major. For another potential effector hypothetical protein (XP_001683914), although it contains two copies of EPIYA motif (ESLYE is at Y-1006 and EHLYD is at Y-1047), they are not similar to KK, R4, Tarp or Tir motif and hence less likely to be an effector than the above five proteins.
Plasmodium falciparum: Plasmodium falciparum can invade human liver cells and RBC to cause dangerous infection malaria. It contains many proteins with the EPIYA motif and 47 proteins with at least two copies of EPYIA motif. Among them, Plasmodium exported protein (XP_001347309) has three copies of motif which are all similar to the Tarp motif. The sequences and the corresponding pY sites are ESIYKNKLK (Y-331), ESIYKNKLK (Y-359), and ESIYKNKLE (Y-387). Thus we predict it as the effector of Plasmodium falciparum. Conserved Plasmodium protein (XP_001347469) has eight copies of EPIYA motif, RNA pseudouridylate synthase (XP_001350676) has nine copies of EPIYA motifs and hypothetical protein (XP_001351018) has three copies of EPIYA motif, but none of them contains any of KK, R4, Tir and Tarp motifs, and therefore is less likely to be the effector than Plasmodium exported protein (XP_001347309).
We applied the subcellular localization prediction for all the predicted bacterial effectors above by using CELLO v.2.5  (http://cello.life.nctu.edu.tw). All the associated bacteria are “gram negative”. As a real effector should be secreted from a gram-negative bacterium and then enter a eukaryote host, we perform subcellular localization by using both gram-negative bacterium and eukaryote, respectively. When using gram-negative bacteria as hosting species, 9 out 11 effectors were predicted as extracellular or outer-membrane (Additional File 8). When reapplying the prediction by choosing eukaryotes as the organisms, all 11 effectors are predicted to have nuclear localization (Additional File 8). The above results show that most of our predicted effectors have expected localization attributes as effectors, which provides some supporting evidence for our effector predictions.
In this paper, we showed that the EPIYA motif might be a ubiquitous functional site for effectors that play an important role in pathogenicity for mediating host-pathogen interactions. Most known effectors have more than one copy of EPIYA motif. The predicted effector sequences of pathogens from the same genus are likely homologous, and those from different genuses are rarely homologous although they often share common motifs. Most pathogens are intracellular bacteria or long-term chronic infection of extracellular bacteria, e.g., H. pylori. Usually effectors are secreted by T3SS or T4SS to enter host cells, and then interfere signal transduction pathway of the host cell to disturb host cell functions, which mainly involve actin polymerization, cell proliferation, apoptosis and immunosuppression, so as to improve the abilities of survival and propagation of microorganism with host-pathogen interaction.
Our study predicted many putative effectors. We grouped the phosphorylated EPIYA motifs into four types, KK, R4, Tir and Tarp based on the sequence features of the five amino acids after Y, and then we used them individually to build the HMM. After using the HMMs to search our database and considering the known pathogenic characteristics of pathogens, we predicted some effectors of bacteria and also suggested that using our method will discover more effectors with the EPIYA motif. Besides the discovery in bacteria, we also found that there were many protein sequences containing the EPIYA motif in protist pathogens. Intracellular protozoan parasites can live in host cells to survive and reproduce by subverting of host cell signalling , to induce downstream effects, e.g., inhibiting apoptosis of host cells, restructuring of the host cell cytoskeleton, and so on. However, the pathogen mediators responsible for this modulation are still unknown . Based on this study, we hypothesize that during the interaction process between protist and host, there is a secretion system that can secrete effectors to disturb the signal transduction pathway of infected host and to control the apoptosis of host cells.
Our predictions provide useful hypotheses for further studies on exploring pathogenic mechanisms in the host-pathogen interactions. It also has the scientific and clinical implications for prevention and treatment of infectious diseases, as it may provide some guidance for vaccine/drug development. Having said that, it is noted that the EPIYA-motif containing protein does not exist in all intracellular bacteria, and therefore EPIYA-motif mediating interaction is only one type of various host- pathogen mechanisms. Furthermore, our prediction result is based on computation and definitely contains false positives, and thus it requires further experimental validations.
Protein sequence data: We used the NR (non-redundant) protein database at the National Center for Biotechnology Information (NCBI) in this study. All protein sequences in the FASTA format were downloaded from the NCBI site (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz; as of July 6th 2009; 9,216,047 sequences). We excluded “other” sequences and “unclassified” sequences” in the database (as labelled in http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Root).
Taxonomy data: The taxonomy data was obtained from the NCBI website (http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi; as of July 6th 2009).
A hidden Markov model was built by using Hmmer 2.3.2  (http://hmmer.janelia.org). We used selected sequences to run the command hmmbuild.exe for building and calibrating the HMM. We then used the HMM to run the command hmmsearch.exe for searching protein sequences. We used a natural cutoff of HMM score such that the last of the all known motifs is retrieved.
We used Perl (release ActivePerl 5.8.8) as the programming language to analyse the data and build the database. We applied SAS 9.0 (http://www.sas.com) as the statistical analysis tool and chose p<0.01 as the significant threshold.
BioEdit 7.0 (http://www.mbio.ncsu.edu/BioEdit/bioedit.html), Lasergene 7 (http://www.dnastar.com/products/lasergene.php), and Blast  (http://blast.ncbi.nlm.nih.gov/Blast.cgi) were used to compare and analyse the protein sequences. Sequence logos were constructed using Weblogo .
This work was partially supported by International Exchange and Cooperation Office of Nanjing Medical University, China. It was also supported in part by US National Institute of Health [grant number R21/R33 GM078601]. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 3, 2010: The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.