Prediction of bacterial type IV secreted effectors by C-terminal features

Background Many bacteria can deliver pathogenic proteins (effectors) through type IV secretion systems (T4SSs) to eukaryotic cytoplasm, causing host diseases. The inherent property, such as sequence diversity and global scattering throughout the whole genome, makes it a big challenge to effectively identify the full set of T4SS effectors. Therefore, an effective inter-species T4SS effector prediction tool is urgently needed to help discover new effectors in a variety of bacterial species, especially those with few known effectors, e.g., Helicobacter pylori. Results In this research, we first manually annotated a full list of validated T4SS effectors from different bacteria and then carefully compared their C-terminal sequential and position-specific amino acid compositions, possible motifs and structural features. Based on the observed features, we set up several models to automatically recognize T4SS effectors. Three of the models performed strikingly better than the others and T4SEpre_Joint had the best performance, which could distinguish the T4SS effectors from non-effectors with a 5-fold cross-validation sensitivity of 89% at a specificity of 97%, based on the training datasets. An inter-species cross prediction showed that T4SEpre_Joint could recall most known effectors from a variety of species. The inter-species prediction tool package, T4SEpre, was further used to predict new T4SS effectors from H. pylori, an important human pathogen associated with gastritis, ulcer and cancer. In total, 24 new highly possible H. pylori T4S effector genes were computationally identified. Conclusions We conclude that T4SEpre, as an effective inter-species T4SS effector prediction software package, will help find new pathogenic T4SS effectors efficiently in a variety of pathogenic bacteria.

Background Type IV secretion system (T4SS) is a membraneassociated multi-component transporter complex, which plays important roles both in horizontal DNA transfer between different bacteria and in bacterial pathogenesis by translocating pathogenic substrates (DNA or protein) into host plant, animal or human cells [1,2]. A large number of T4SSs have been identified in a variety of bacterial species [1,2]. In many cases T4SSs have been implicated in protein delivery during the infection process, such as Helicobacter Cag-T4SS in human gastric ulcer and cancer, Legionella Dot/lcm-T4SS in http://www.biomedcentral.com/1471-2164/15/50 methods. Additionally, in a given bacterial strain, most effectors are scattered throughout the genome rather than clustered in a narrow genomic region. Moreover, the validated effectors in different species are signifi cantly diverse in sequence. Therefore, the bioinformatic methods so far used, based essentially on sequence com parison, can hardly reveal new effectors.
Recently, two groups of investigators performed largescale screening of T4S effectors by bioinformatic analysis [17,18]. Integrating multiple features including gene G + C content, sequence conservation, within-genome gene organization, regulatory elements, signal sequence com position, etc., Burstein et al., for the first time, set up a machine learning method to predict and experimentally identify new T4S effectors from Legionella pneumophila [17]. The prediction accuracy was considerably high, but the method developed is merely suitable for T4S protein prediction in Legionella or closely-related species, since the training sequences are all from Legionella and the features about sequence conservation, gene organization and regulatory elements are specific for Legionella. In addition, a similar training pipeline is infeasible to de velop T4S effector predictors for a broader range of bac teria, because the numbers of validated T4S effectors in most other bacterial genera, not like in Legionella (more than 100), are so small (0 ~ 5) that the training data can not provide reliable feature information. In another study, based on the weak sequence similarity with Legionella effectors, Chen et al. identified a group of effectors in Coxiella burnetii [18]. Most effectors, espe cially those in the distantly-related species, however, are of no or very low sequence similarity. Therefore, new effectors without sequence similarity cannot be captured through sequence alignment.
We have focused on Helicobacter pylori to predict T4S effectors for insights into the pathogenesis of the distinct infections caused by these bacteria. H. pylori may elicit human gastritis and gastric ulcer, and this pathogen is also associated with gastric cancer [4]. In the pathogen esis, Cag-T4SS plays key roles as an important virulence factor in the bacterial interaction with human stomach cells [3,4]. To date, only one effector, CagA, has been identified, although several lines of evidence have indi cated that there should be other effectors that participate in bacterial infection and pathogenesis [4,19,20]. No ex perimental, sequence alignment or comparative genomic methods are available for identifying new effectors. The only CagA protein could not provide any statistic infor mation about its sequence features as a T4S effector either.
Numerous reports have indicated that, in many differ ent bacteria, the C-terminal peptide sequences of T4S effectors are necessary for their secretion [21][22][23][24][25]. Do these amino acid sequences share any common composition or structural features among different effec tors in different bacterial species? Could such features, if any, be used to develop an inter-species T4S effector predictor? Such a generally-suitable prediction tool would be especially useful for identification of new effec tors in species like H. pylori, which is supposed to have multiple effectors that are not experimentally validated yet and lacks a sufficient number of within-species vali dated effectors for species-specific effector feature ex traction. Recently, many inter-species prediction tools have been developed to predict Type III secreted (T3S) effectors [26][27][28][29][30][31][32], but no similar software tool has been developed for T4S effector prediction. In this research, we collected a full set of T4S effectors and made sys tematical comparisons of their C-terminal sequencebased and position-specific amino acid compositions, motifs, secondary structures and solvent accessibility properties. Based on these features, we developed a series of machine learning methods to classify T4S effec tors and non-effectors. To our knowledge, this is the first inter-species T4S protein prediction tool, which can be applied to different bacteria and is especially useful for bacteria that have limited effector information for species-specific bioinformatic analysis.

Results
Sequence-based amino acid composition (Aac) differences between C-termini of T4S and non-T4S proteins The T4S proteins were annotated from literature, while the non-T4S proteins were randomly selected from the genome-encoding proteins removed of known T4S pro teins and their homologs (Methods). The size of non T4S proteins was twice of T4S proteins. The GC content of the nucleotide sequences encoding the T4S proteins was roughly equal to that of non-T4S encoding nucleo tide sequences (Methods).
Comparisons were performed on sequence-based Aac of C-terminal 100 positions (C100) between T4S and non-T4S sequences. Most amino acid species were not equally distributed in the two types of sequences, with glutamic acid, serine, lysine, threonine, asparagine and proline enriched and isoleucine, glycine, valine, tyrosine, tryptophan, methionine, leucine, phenylalanine and ala nine depleted in T4S sequences (p < 0.05, Bonferroni corrected binomial test and t-test; Figure 1A). The relative enrichment ratio of Aac was calculated for each amino acid species which showed statistical difference between T4S and non-T4S sequences. Glutamic acid and serine had the largest enrichment in T4S sequences, whereas isoleucine, tyrosine and glycine were enriched in non-T4S sequences (tryptophan and methionine were not considered because of their low occurrence in both types of sequences) ( Figure 1A). The relative enrichment ratios of biased amino acids between T4S and non-T4S http://www.

T4S
Non-T4S All: 175(50.4%) All: 82(11.8%) Figure 1 Sequence-based Aac difference between T4S and control proteins for C-terminal 100-aa positions. (A) Single-residue composition difference. The different amino acids were listed along the horizontal axis while the length of bars represented the frequency of the corresponding amino acid. T4S and non-T4S proteins were represented in black and gray, respectively. Amino acid with significant different compositions between effectors and non-effectors were indicated with a star above the bar (Bonferroni-corrected Student's t test and binomial test, p < 0.05). The logarithm of amino acid frequency ratio was also shown, with red representing preference and black representing depletion in effectors. (B) Continual and spanned bi-residues with statistically significant composition difference between effectors and non-effectors (Bonferroni-corrected Student's t test and binomial test, p < 0.05). 'Px' represented 'Position x'. 'X' represented any type of amino acid. The amino acid at the last position was in red if the corresponding bi-residue was preferred and in black if depleted in T4S sequences. (C) Distribution of motifs in T4S and non-T4S proteins.
Tri-residue (tAac) and quart-residue (qAac) composi tions were further compared, so as to refine the conserved motifs buried in T4S signal sequences. Taking into ac count of the bi-residue composition preference property described above, an consensus method disclosed three de generate motifs, [PQRS]S', which were significantly enriched in T4S se quences (p < 0.05, Bonferroni-corrected binomial test).
In total, more than 50% (175/347) of the T4S sequences contained at least one of these three motifs, whereas only 12% (82/694) of the non-T4S sequences contained one or more of them ( Figure 1C and Additional file 3: Table S2). The motifs existed in effectors of different bacteria with IVA or IVB T4SS (Additional file 3: Table S2).
The patterns with more than four residues were quite degenerate, and represented by very few T4S sequences (data not shown).

Distinct position-specific Aac profiles in C-termini of T4S effectors
Besides sequence-based Aac preference in T4S signal peptides, the position-specific Aac profiles were also compared between T4S and non-T4S sequences. As shown in Additional file 4: Figure S2 and Figure 2, T4S sequences showed apparently different amino acid com position profiles from non-T4S sequences. These differ ences were most striking for C-terminal 1-50 (especially 1-25) positions (Additional file 4: Figure S2). More posi tions in T4S effectors exhibited specific amino acid pref erence, while in non-T4S sequences, different species of amino acids appeared more evenly distributed at each position (Figure 2A and B). Consistent with the sequence-based observations, glutamic acid, serine and lysine were also frequently preferred in T4S sequences ( Figure 2A). Leucine was enriched in both T4S and non T4S sequences (Figure 2A  species did not show equal preference. Some amino acids were enriched while some others depleted signifi cantly ( Figure 3A; Additional file 5: Table S3). Trypto phan and cysteine were most generally depleted in T4S C-termini. Additionally, leucine (enriched), methionine (depleted), serine (enriched), glutamic acid (enriched or depleted) and histidine (depleted) were also frequently biased in the composition ( Figure 3B; Additional file 5: Table S3). The total number of amino acids with signifi cant position-specific composition difference between T4S and non-T4S proteins was much smaller than that of theoretically biased amino acids in T4S proteins, demonstrating that there are many common amino acid composition biases between the two types of proteins (Additional file 5: Table S3). However, the difference be tween T4S and non-T4S proteins was even more pro nounced at the C-terminal 30 positions ( Figure 3C). The most profound composition difference between T4S and non-T4S in most positions was the frequency bias of glutamic acid (enriched or depleted), followed by those of serine (enriched), aspartic acid (enriched or depleted), proline (enriched or depleted), threonine (enriched) and phenylalanine (enriched or depleted) ( Figure 3D). It should be noted that, leucine was also frequently biased (depleted) in T4S sequences compared with its compos ition in non-T4S sequences, indicating the larger enrich ment in the latter ( Figure 3B and D). Other amino acids, e.g., cysteine, tryptophan, methionine and histidine, did not contribute much to the composition bias, as they are depleted in both T4S and non-T4S proteins ( Figure 3B and D). Notably, glutamic acid, though enriched in most C-terminal positions of T4S proteins when compared with non-T4S proteins, showed significant depletion in C-terminal 1-4 positions of T4S proteins and was sig nificantly enriched at positions 9 to 19 continuously (Additional file 5: Table S3). Some of the amino acids enriched or depleted in T4S sequences (e.g., serine, threonine, proline and glutamic acid) could be related with the secondary structure and hydrophilicity, two possibly important secondary features related with Occurrence of amino acids signal recognition [26,30]. The biological relevance of the biases of the amino acids remains to be clarified.

Structural flexibility of the C-termini of T4S effectors
The primary peptide sequence determines its secondary structure (Sse) and solvent accessibility (Acc), which may be associated with the specificity of signal recogni tion. Therefore, we compared the Sse and Acc compos ition in each C-terminal position of T4S effectors with those of the non-T4S proteins. As expected, T4S effec tors showed a position-specific Sse preference pattern apparently different from that of the non-T4S proteins in the C-terminal region, especially at the C-terminal 40 positions (Additional file 6: Figure S3A and B). In con trast to helices in the non-T4S sequences, coils are more common in most regions of the T4S sequences, indicat ing that they are more flexible (Additional file 6: Figure  S3A and B). Besides, β-strands were less frequently adopted by T4S sequences (Additional file 6: Figure S3A and B). T4S and non-T4S sequences also showed dif ferent position-specific Acc profiles, with more posi tions being exposed in the C-termini of T4S sequences (Additional file 6: Figure S3C and D). The distinct Sse and Acc profiles adopted by the C-terminal region of T4S effectors were similar to those of N-terminal re gion of type III secreted (T3S) proteins, indicating pos sibly similar signal recognition mechanisms between the type IV and type III secretion systems [26].
When twenty T4S C-terminal peptides were randomly selected for 3D structure prediction, six peptides were predicted with high accuracy. The C-terminal ends of all the six peptides form helices or coiled coils, always ex posed outside (Additional file 7: Figure S4). A structure alignment showed that these six peptides could form a cluster with quite similar structures (37% structure simi larity, <10 Å; Additional file 8: Figure S5A). Most inter estingly, though without similarity at the sequence level, Legionella VipE (YP_096808.1) and YP_094180.1 had an extremely similar 3D structure, with a mirror symmetry for the C-terminal end parts (76% structure similarity, <5 Å; Additional file 8: Figure S5B). Legionella YP_094076.1 and Coxiella YP_001597263.1 also showed 74% similarity, and these four proteins, VipE, YP_094180.1, YP_094076.1 and Coxiella YP_001597263.1, had 52% structure similarity (<10 Å; Additional file 8: Figure S5C and D). The 3D struc ture similarity suggested that the high-order structure could exert important function in specific T4S signal recognition.

Inter-species prediction of T4S effectors based on Aac and structural features
It is interesting to determine whether the distinct Aac (sequence-based and position-specific), motifs, Sse and Acc profiles can be used for distinguishing T4S proteins. Support Vector Machine (SVM) based machine learning models were therefore trained with different features and/or their combination, and comparison was per formed on their classification power. SVM was adopted since it often generates high classification accuracy and especially high specificity [26][27][28]31]. Additional file 9: Table S4 showed the parameters optimized for different models.
As shown in Table 1, the decision model based only on motifs detected above had the worst distinguishing power, with an average accuracy of 75.6%. The distin guishing power was similar among the models based on sequential Aac, bi_residue composition (bAac), their combination and the combination of significantly biased Aac and bi_Aac between T4S and non-T4S peptides, in terms of sensitivity, specificity, accuracy, AUC and MCC values ( Table 1). The SVM model based on positionspecific, single-profile bayesian (SPB) features only per formed a little better than the sequence-based models ( Table 1). The Bi-Profile Bayesian (BPB) model, however, considerably outperformed both the SPB model and the sequence-based models (Table 1 and Figure 4A). Interestingly, the combination of SPB Aac features and sequential Aac features could greatly improve the clas sifying performance, which was comparable to that of BPB Aac model (Table 1 and Figure 4A). Inclusion of secondary structure and solvent accessi bility improved the distinguishing power of both sequence-based models and position-specific Bayesian models. The model based on sequential joint features of Aac, Sse and Acc outperformed any other pure sequen tial features-based model ( Table 1)  outperformed all other models in terms of any evalu ation parameter (Table 1 and Figure 4B). The five-fold cross-validation sensitivity, specificity, accuracy, AUC and MCC of this model could achieve 89.14%, 97.14%, 94.57%, 0.9883 and 0.8770, respectively (Table 1).
We also tested the influence of different signal sequence length on model performance. Among the models based on C-terminal 25aa, 30aa, 40aa, 50aa and 100aa (C25, C30, C40, C50 and C100, respectively), C100 models apparently outperformed the others (data not shown). Since the models based on combined SPB Aac and se quential Aac features (T4SEpre_psAac), BPB Aac fea tures (T4SEpre_bpbAac) and position-specific joint features of Aac, Sse and Acc (T4SEpre_Joint) showed the best performance on classification of T4S and non T4S sequences, the rest parts of the research will only use these three models based on C-terminal 100aa sig nals. To further confirm the classification performance of these three models, we changed the size of negative dataset (from 2-fold to 6-fold size of the positive dataset, Additional file 10: Text S1), and assessed the per formance with 5-fold and 10-fold cross validation. As shown in Additional file 11: Table S5 and Additional  file 12: Table S6, the prediction performance was im proved slightly when the negative dataset with larger size (Additional file 11: Table S5) was used and quite stable when 5-fold (Additional file 11: Table S5) or 10 fold (Additional file 12: Table S6)

(B)
It is also important to observe the inter-species ef fector discriminating power of the models. A Leave-One genus-Out strategy was proposed previously and adopted here. As shown in Figure 5, T4SEpre_Joint ex hibited the best inter-species prediction performance, while T4SEpre_psAac performed worst among the three software tools. For most genera, T4SEpre_Joint could re call all or nearly all known effectors without any prior knowledge about the targeted genus ( Figure 5A) and at very high prediction specificity ( Figure 5B). The specifi city of T4SEpre_Joint for Brucella appeared lower be cause the total number of negative control proteins was only 4, and in fact, merely one of them was misclassified ( Figure 5B). It is worth pointing out that only 73 training effectors remained after all the 274 Legionella effectors were excluded, and the T4SEpre_Joint model with such limited training data (21% of the original training data) could still correctly recognize most of the known Legionella effectors (222/274, 81%). One genus, Ochrobactrum, was an apparent exception: the models based on the ef fectors of other genera could at best recall 2/5 of the known effectors ( Figure 5A, T4SEpre_bpbAac). There are two types of T4SSs, type A and type B. It is interesting to observe the inter-category discretion power of these models. The effectors were therefore  assigned to two subsets, type A T4SS substrates and type B T4SS substrates. The negative controls were divided into two parts accordingly. Models were trained with ei ther one type of sequences and were further used to clas sify the other type of sequences. As shown in Figure 5, whereas T4SEpre_bpbAac and T4SEpre_psAac also showed some performance, T4SEpre_Joint showed the best classification power. The relatively low recall rates of type B effectors (67.85% for T4SEpre_Joint) with the model based on type A effectors were due to the ex tremely limited number of type A effectors (36/347, 10.4%) ( Figure 5A). Again, the specificity of different models on either type was very high, further demonstrat ing the reliability of inter-species prediction with all these three software tools ( Figure 5B).
Taking together, the results demonstrated that the fea tures purely extracted from C-terminal sequences could well distinguish T4S effectors and non-effectors. The models, especially T4SEpre_Joint, showed an excellent inter-species prediction performance.

New T4S effector candidates in H. Pylori and salmonella typhiumium
H. pylori is reported to encode multiple T4S effectors [4,19,20], among which only one, CagA, has been experi mentally validated. As a result, direct statistic feature analysis for H. pylori effectors is impossible. It has been a big challenge to look for new effectors in H. pylori. We therefore used T4SEpre (Additional file 13), the interspecies T4S effector prediction software containing 3 highly-efficient models (T4SEpre_Joint, T4SEpre_bp bAac, and T4SEpre_spAac), to screen the H. pylori genome (NC_000915) for possible T4S effectors.
T4SEpre_Joint, T4SEpre_bpbAac and T4SEpre_spAac identified 58, 78 and 37 T4S effectors respectively (Additional file 14: Table S7). In total, 25 candidates were predicted by T4SEpre_Joint and at least one other model, which composed the most potentially true effec tors ( Table 2). The genes encoding these effector candi dates were widely scattered throughout the genome. Among these candidates, CagA was a known effector http://www.biomedcentral.com/1471-2164/15/50  Table S7; Table 2, italic). It should be noted that ~70% of the T4S candidates were hypothetical pro teins with unknown function ( Table 2). Previous studies have demonstrated that many proteins with unknown function were likely to function as pathogenic effectors [27]. Therefore, these proteins deserve further experimen tal validation analysis. As a control, we also made a whole-genome T4S effector prediction from Salmonella typhimurium LT2, a strain which has never been reported with a functional protein-transporting T4SS. As shown in Additional file 15: Table S8, T4SEpre_Joint, T4SEpre_bpbAac and T4SEpre_spAac identified 57, 81 and 27 T4S effectors re spectively. Dividing by the total number of genomeencoding proteins (S. tyhimurium LT2, 4423; H. pylori, 1573), the percentages of positive T4S proteins predicted in S. tyhimurium (1.29, 1.83 and 0.61, respectively) were lower than in H. pylori (3.69, 4.96 and 2.35, respectively). Furthermore, the prediction results of the three software tools were combined to increase prediction specificity, as performed in H. pylori. We found only 13 proteins were predicted by both T4SEpre_Joint and at least one other software tool (Additional file 15: Table S8). This positive ratio (0.29%, 13/4423) was also much lower than that in H,pylori (1.59%, 25/1573). Similar to the distribution of T3S signals among different bacteria, it is not surprising to find T4S signal containing proteins in strains without protein-transporting T4SSs such as S. typhimurium LT2, though the number of positive proteins could be much smaller [27,29,30]. Three proteins in LT2 predicted to be positive T4S effectors by all the three tools meanwhile (STM1870, STM2074 and STM2256; Additional file 15: Table S8). Among them, STM1870 is particularly interest ing. It was predicted by all the three models with the high est scores and hence most likely represents a true T4S effector (Additional file 15: Table S8). In a previous report, http://www.biomedcentral.com/1471-2164/15/50 STM1870 was also found to contain a T3S signal [27]. The function of STM1870 remains to be clarified. STM2074 is annotated as a histidinol phosphatase and STM2256 encodes a cytochrome c-type subunit. These two proteins are more likely to represent false positives predicted by the software tools, but the possibility could not be excluded either that, they contain the T4S signal sequences and could be translocated through the T4SS conduit to host cells if there was a functional T4SS in Salmonella.

Discussion and conclusion
Bacteria encode diverse protein secretion or transloca tion systems to effectively interact with host cells. Type III and type IV secretion systems play especially import ant roles in gram-negative bacteria [2,9,33,34]. Through comparative genomic analysis, Guglielmini et al. found more bacteria than expected could encode potential protein-exporting T4SSs [9]. This is an interesting find ing, indicating that these bacteria potentially interact with host cells by injecting effector proteins through T4SSs. It is much easier to detect whether these T4SSs are assembled and functional than to analyze how they could function. Identifying possible effectors is the deter minant step to solve the latter problem. Currently, the most effective way to identify new T4S effectors is to validate candidates predicted according to the common features of known effectors encoded by the same or closely-related bacteria [17,18]. However, for most spe cies that have T4SSs, only a small number of effectors have been identified to date. Due to the small sample pool of known T4S effectors, no reliable features could be generalized from them. The species-specific methods described above therefore could not be adopted directly either. The number of newly discovered effectors is in creasing for a limited number of representative species, e.g., L. pneumophila, but very few new effectors are be ing identified for other important species, e.g., H. pylori. These factors prompted us to develop an inter-species T4S effector prediction method.
In this study, we focused on sequence and structurederived features. Through sequence-based single-, bi-, tri-residue Aac and motif analysis, we found distinct composition preference in C-terminal sequences of T4S effectors relative to control proteins. Glutamic acid and serine were most strikingly preferred by T4S effector se quences ( Figure 1A, B and C). Position-specific Aac comparison demonstrated significant biases in the com position of glutamic acid and serine in a number of posi tions. Unlike serine, which always showed preference in T4S sequences, glutamic acid was preferred in most posi tions but depleted in C-terminal positions 1-4 ( Figure 3). In the C-terminal sequences of more than 50% effectors, three possible motifs were identified, which always contained one (or more) glutamic acid or serine as the consensus residue(s) ( Figure 1C). It is interesting to exam ine whether and how these two amino acids or the motifs play roles in the specificity of type IV secretion recogni tion. The biological meaning of other Aac preference also remains to be clarified.
We also tried to observe the different secondary struc ture and solvent accessibility determined by the different Aac features between T4S and control proteins. The T4S effectors had much more flexible and exposed Cterminal regions than the control proteins (Additional file 6: Figure S3). We had similar observation for the Nterminal sequences of type III secreted effectors reported previously [26]. It is not clear whether this is a common property of protein secretion signal sequences. Interest ingly, 3D structure modeling revealed similar tertiary structure of the T4S C-terminal sequences (Additional file 8: Figure S5). Due to the relatively low accuracy and heavy computation cost of de novo structure prediction, it is not feasible to predict the structure of all T4S effec tors with high precision. However, it is still interesting to observe the structure basis of specific type IV secretion recognition.
A variety of computational models were trained based on the different types or combinations of features. Three of them, T4SEpre_Joint trained on joint features of position-specific Aac, Sse and Acc, T4SEpre_bpbAac trained on Bi-Profile Bayesian Aac, and T4SEpre_psAac trained on both position-specific (Single-Profile Bayesian) and sequence-based Aac features, considerably outper formed the others in terms of sensitivity, specificity, accur acy, AUC and MCC (Table 1 and Figure 4). Additionally, T4SEpre_Joint also exhibited an ideal inter-species predic tion power. Due to the lack of known effectors in most bacterial species, Legionella effectors represented the over whelming majority of the training data (89%). Remarkably, the T4SEpre_Joint model trained on the sequences of the other species (21% of the original training data) could still correctly recall ~ 81% of the known Legionella effectors ( Figure 5). Even with the fewer training data (type A effec tors and control proteins, 10.4% of the original training data), T4SEpre_Joint could correctly recognize ~ 68% of the relatively independent type B effectors ( Figure 5). Though with lower distinguishing performance than T4SEpre_Joint, T4SEpre_bpbAac and T4SEpre_psAac re vealed different features of T4S effectors. These three tools, therefore, may be combined in practice for T4S ef fector prediction.
Prediction of Sse and Acc is relatively time-consuming for all bacterial proteins. We therefore only used T4SEpre_bpbAac and T4SEpre_psAac to screen T4S sig nals in all the bacteria with possible protein-delivery T4SSs [9]. We found all the bacterial chromosomes con taining protein-exporting T4SSs encode possible T4S http://www.biomedcentral.com/1471-2164/15/50 effectors. On average, up to 5% genes encode T4S effec tors (data not shown). We further focused on H. pylori, for which all the three T4SEpre models were adopted to predict possible new effectors other than CagA. A total of 25 genes were predicted by both T4SEpre_Joint and at least one other model. Notably, nearly 70% of the pre dicted genes encoded hypothetical proteins with unknown functions (Table 2). Besides, many genes, especially those with higher prediction scores, contained at least one of the three types of T4S motifs. These genes and others with high prediction values provide a valuable list of effector candidates for pathogenic study of H. pylori.
An ideal computational model could predict all the true positive effectors (highest sensitivity) without any false positive effector (highest specificity). However, it is infeasible to develop such a perfect model. In practice, we have to make a balance between sensitivity and speci ficity to cope with different situations. For example, in bacteria with many known effectors such as Legionella, the prediction specificity has to be sacrificed to increase the sensitivity, so as to find more new effectors. How ever, to identify effectors from bacteria with few known effectors such as H. pylori, it is recommended to in crease prediction specificity at a cost of sensitivity. The higher specificity will ensure the fewer false positives and the lower experimental cost. The three software tools proposed here all exhibited quite high prediction specificity (93 ~ 97%). It should be pointed out that, even with the highest cross-validation specificity 97%, ~86 false positives would be predicted from a genome encod ing 2850 non-effector proteins. The sensitivity of T4SEpre_Joint is 89% at the specificity of 97%, so about 134 effectors can be correctly predicted assuming there are 150 effector proteins in the same genome. Therefore, in a genome encoding 3000 total proteins and 150 (5%) T4S effectors, T4SEpre_Joint will predict 220 candidates, 61% (134/220) among which are true positives. In order to further increase the specificity, we suggested the fol lowing two strategies as we adopted in H. pylori effector prediction: (1) combining all the three tools and looking for the effectors predicted by both T4SEpre_Joint and at least one other software tool, and (2) increasing the pre diction threshold value to 0.5 or higher. From our obser vations, the true positives are more often predicted by combining multiple models, and with higher prediction scores. Therefore, both the strategies should decrease the ratio of false positives in the prediction results.
The T4S proteins were also predicted from bacteria without known protein-transporting T4SSs (e.g., S. typhimurium LT2, Additional file 15: Table S8). It is not unexpected that some proteins also contain T4S signals in such bacteria. Löwer and Schneider [29] and Arnold et al. [30] independently found there were T3S signals in proteins of bacteria without known Type III Secretion Systems (T3SSs). In a previous study, we also demon strated that T3S signals could exist in proteins of gramnegative bacteria without T3SSs, gram-positive bacteria and even yeasts [27]. Being similar with T3S signals, it makes sense that some proteins in bacteria without protein-delivery T4SSs may happen to have T4S signal sequences. Strictly, a protein containing a T4S signal se quence does not necessarily represent a T4S effector. A T4S effector must have the signal sequence, be encoded in a host strain bearing a functional protein-transporting T4SS, and can be co-expressed with T4SS apparatus genes [27]. A tentative hypothesis is, however, as in S. typhimurium LT2, the number of total proteins with T4S signals in bacteria without protein-transporting T4SSs should be much smaller than strains with func tional protein-transporting T4SSs.

Datasets
Experimentally validated T4S effectors were retrieved from literature and their putative orthologs were extracted from genome annotation files. In total, we analyzed 1913 effectors from 10 genera, including Agrobacterium, Anaplasma, Bartonella, Bordetella, Brucella, Coxiella, Ehrlichia, Helicobacter, Legionella and Ochrobactrum. The T4S signal peptide, i.e., the C-terminal 100-aa fragment, was extracted from each effector sequence. Pairwise align ment was performed for the 100-aa T4S signal peptides with JAligner implementing Smith-Waterman algorithm (http://jaligner.sourceforge.net/). The ratio between the similarity score of pairwise sequences and self similarity score was calculated. Conserved paralogs or orthologs were identified when a pair of sequences had an abovestated similarity score ratio higher than 0.30. For each orthologous or paralogous cluster, only one representative was selected as the training sequence. This homologyfiltering procedure reduced the number of T4S peptides to 347. The non-redundant peptides constitute the posi tive training dataset. Non-T4S proteins were randomly se lected from the same strains where the positive training sequences were originated, followed by removal of the known T4S effectors and their homologs. The C-terminal 100-aa peptide fragment was also extracted from each non-T4S protein, and the same homology-filtering pro cedure was performed. Finally, for each strain, the ratio of non-T4S: T4S peptides was set as 2:1, and the GC content for encoding nucleotides was generally maintained equal or similar between the two types of sequences (T4S 40% vs. Non-T4S 41%) [26]. The 347 T4S and 694 non-T4S sequences constituted final positive and negative dataset, respectively (Additional file 16: Text S2). For 5 fold (or 10-fold) cross-validation, the negative and positive training datasets were pooled as the final training dataset, which was evenly split into five (or ten http://www.biomedcentral.com/1471-2164/15/50 for 10-fold cross-validation) sub-datasets, each containing the same number of positive/negative samples.
To observe whether the size of negative dataset influ ence the classifying prediction performance, another in dependent negative dataset was prepared (Additional file 10: Text S1). The proteins were randomly selected from different bacteria (from all the bacteria classes listed in NCBI Genome database). The C-terminal 100 amino acids were extracted from each protein, and then a simi lar homology-filtering strategy was performed to get rid of the known effector homologs and redundant homologs of included negative sequences. Finally, 2082 nonredundant negative sequences were included (6-fold size of the positive dataset). These negative sequences were combined with the positive T4S sequences to form an independent training dataset. For the new sequences, Sse and Acc were predicted with the same procedures described before.

Extraction of sequence-based and position-specific Aac features
Sequence-based Aac was calculated for each T4S or non-T4S sequence. Each of the 20 amino acid species was counted for its occurrence within the C-terminal 100, 50 and 30 positions (C100, C50, and C30 respect ively). An Aac frequency vector was obtained for each sequence, and the vectors for all sequences composed a frequency matrix. The composition of each amino acid species was compared between T4S and non-T4S se quences with Student's two-tail t-test and a binomial distribution-based statistic test. The resulted p-value was further adjusted by Bonferroni multiple testing correc tion [35]. The significance level was set as p < 0.05 for both tests. For each amino acid species with significant bias, the log ratio of average composition was calculated between the two types of sequences, which represented the relative advantage of the amino acid composition in T4S (positive) or non-T4S (negative) sequences, with a larger absolute value for a more striking advantage. The bi-residue (bAac) and tri-residue (tAac) compositions were calculated with a similar procedure. Putative and conserved motifs were screened with MEME [36], followed by an iterative calculation of the frequency of possible motifs derived from single Aac, bAac or tAac preference.
The position-specific Aac features were extracted as follows. Let vector S = s 1 , s 2 , s 3 ,…, s n denote a peptide se quence in which s represents amino acid while 1, 2,… or i represents position and n represents sequence length. For m sequences, the position-specific occurrence of a certain amino acid A is described as: p(A i ) = f (A i )/m i , in which f (A i ) denotes the frequency of amino acid A at position i. For each position, the p(A i ) of different amino acids form a position set, and for a sequence S with n amino acids, n values (extracted from each position set) comprise a composition vector. A binomial distribution B i (m, p aa ) was modeled for each amino acid species at each position, where p aa was set as p(A i ) of negative dataset or 1/20 (ideal random situation) for different comparison purpose. A Bonferroni-corrected binomial test was performed based on the distribution model to find out the significantly preferred or un-favored amino acids at corresponding position of T4S sequences. The significance level was also set as p < 0.05.
Secondary structure, solvent accessibility and tertiary structure SCRATCH was used to predict the secondary structure (Sse, represented as a combination sequence of 'C', 'H' or 'E' of each sequence where 'C' meant coil, 'H' meant helix and 'E' meant strand) and solvent accessibility (Acc, a combination of 'b' or 'e', representing 'buried' or 'exposed' respectively) [37]. Tertiary structure of T4S peptides were predicted with I-TASSER [38]. The struc tures with TM-score ≥ 0.5 were further analyzed for their structural similarity using MultiProt [39].

Models and performance assessment
Sequence-based Aac features were directly represented by the frequency of each amino acid species ('Seq_Aac') or each bi-residue ('Seq_bAac'). The combination of all the 'Seq_Aac' and 'Seq_bAac' features or those signifi cantly preferred/depleted in T4S peptides led to the fea tures of model 'Seq_Aac, bAac' or 'Seq_Sig', respectively. The sequence-based joint Aac, Sse and Acc features were extracted with the strategy described in Yang et al., [28]. Position-specific Single-Profile and Bi-Profile Bayesian features were extracted with the same pipeline for the type III secreted effector prediction model BPBAac [26]. The combination of sequence-based Aac and position-specific Single-Profile Aac features formed the features of model 'Pos_Aac _SPB + Seq_Aac'. Position-specific joint Aac, Sse and Acc features were extracted according to Wang et al., [27]. The feature values for each training sequence formed a vector. The vectors were further trained with an R package 'e1071' implementing SVM (http://cran.r-project.org), with ra dial basis kernel function. The parameters for SVM were optimized using grid search based on 10-fold crossvalidation.
The model performance was evaluated and compared with a five-fold cross-validation and Leave-One genus-Out strategy [26]. Accuracy (A), Specificity (Sp), Sensi tivity (Sn), Receiver Operating Characteristic (ROC) curve, the area under ROC curve (AUC) and Matthews Correlation Coefficient (MCC) were utilized to assess the predictive performance. In the following formula, A denotes the percentage of both positive instances (T4S) and negative instances (non-T4S) correctly predicted. Sn (true positive rate) and Sp (true negative rate), respect ively, represent the percentage of positive instances (T4S) and the percentage of negative instances (non T4S) correctly predicted. An ROC curve is a plot of Sn versus (1 − Sp) and is generated by shifting the decision threshold. AUC gives a measure of classifier perform ance. MCC takes into account true and false positives and false negatives and is generally regarded as a bal anced measure which can be used even if the classes are of very different sizes. where, and denote the number of true positives, true negatives, false positives and false negatives, respectively.

Genome-wide prediction of T4S effectors
The proteins were deduced from the H. pylori genome (NC_000915) and S. typhimurium LT2 (NC_003197) DNA sequences downloaded from the NCBI Genome database. The sequences were screened for possible T4S effectors with three independent models in the T4SEpre package (T4SEpre_Joint, T4SEpre_bpbAac and T4SEpre_psAac). The default cutoff SVM scores (≥ 0.5) were adopted for all the three models. The standalone T4SEpre package could be freely downloaded from the web site: http://biocomputer.bio.cuhk.edu.hk/softwares/T4SEpre/.