This study served as a proof of concept for the GenSET method for accurate T3SS effector prediction. We used a wide range of effector attributes to build predictive models through machine learning that could identify differences between effector and non-effector data sets. In GenSET Phase 1, our approach significantly increased effector prediction accuracy for the majority of species tested (3 out of 4) (Tables 4 and 6). The method predicted 10 to 80% more effectors in the top 40 proteins than the other established methods in three out of four test organisms (Table 6). The method was customized to four specific organisms and can be applied to other organisms to predict effectors in individual genomes. The GenSET method can therefore reduce the number of labor-intensive wet-lab validation experiments for effector prediction.
In GenSET Phase 1, we used 15 effectors for E. coli, S. dysenteriae, and S. Typhimurium or 21 effectors for P. syringae for the machine learning; whereas, in GenSET Phase 2, we used 30 effectors (15 each) from two related organisms in the family of Enterobacteriaceae. A bigger positive training set did not improve the performance values (Table 3) or the prediction rate (Table 4) when compared smaller positive training sets. Thus, the minimum number for the positive set could be set at 15 effectors. The strength of our GenSET method is in the use of a smaller positive sets such as 15 confirmed effectors for Phase 1 and 30 for Phase 2, and the potential to customize the method for species-specific prediction in any genome. In contrast, other published programs such as BPBAac [12], T3MM [16] and EffectiveT3 [10], used a pool of heterogeneous effectors from many genomes to construct their positive sets. These genomes were from phylogenetically diverse organisms with different T3SSs families that prevented the customization of the pooled effector data set for species-specific predictions.
Seven different families of T3SSs in gram-negative bacteria have been proposed based on the phylogram of ATPases. The families include SPI-1 and SPI-2 in S. Typhimurium, Ysc in Y. pestis, Hrp 1 in P. syringae, Hrp 2 in Xanthomonas campestris, chlamydiales and rhizobiales [17, 18]. These T3SS families may have different translocation signals embedded in their respective effectors. One problem arising from combining effectors from different families of T3SSs to create a generic classifier for machine learning is that different effectors may emphasize different attributes and this may introduce a strong bias into the classifier and reduce the performance of generic effector prediction algorithms when applied to specific organisms. For example. Sato et al. [14] defined the training set with effectors from single genomes of two organisms, S. Typhimurium and P. syringae. One disadvantage to this approach was that the SVM machine learning was performed on one organism and then applied to another unrelated organism for effector prediction. Similarly, Yang et al. [19] based their classifier training on P. syringae and then applied the method to rhizobial strains. Different effectors from different families of T3SSs may have different translocation signals. Therefore, combining effectors from unrelated organisms in machine learning may reduce prediction rates. The species-specific approach used in GenSET Phase 1 eliminates these biases and improves T3SS effector prediction in individual species. Indeed, our GenSET Phase 1 had better prediction rates that ranked an average of 70.3% of effectors in the top 40 positive prediction and 78.6% of the effectors overall (Table 4). These prediction rates were 10 to 80% better than other established methods on three out of the four organisms tested (Table 6). Although Yang et al. [19] pooled effectors from three different strains of the same species of P. syringae to make the positive set homogenous, the trained algorism was applied to unrelated species or families of T3SSs. Indeed, a trained algorithm called TREEE based on machine learning on P. syringae performed poorly when applied to S. Typhimurium [20].
Additionally, the mixing of two different types of T3SS effectors in the training set, such as SPI-1 and SPI-2 of S. Typhimurium, may have reduced the performance of the SVM training by Sato et al. [14]. In the GenSET Phase 1 approach, our machine learning was not only species- specific but also T3SS family-specific in order to increase prediction accuracy. For example, our machine learning on S. Typhimurium was specific to SPI-2 family effectors and predicted 88.9% of known effectors in the top 40 positive prediction (Table 4). GenSET was thus successful in predicting effectors in a species-specific manner as long as the organism had a minimum size of known effector population, such as 15 effectors. Our results strongly suggest that GenSET Phase 1 can be customized to any organisms and we have successfully applied it to four organisms in this study.
In order to investigate the universal application of GenSET to less-studied organisms with fewer identified effectors, we combined positive sets from two closely related organisms (S. Typhimurium and S. dysenteriae) in the GenSET Phase 2 and then applied the algorithm to E. coli, P. syringae, Y. pestis and S. fredii. It should be noted that E. coli belongs to the same T3SS family as the Salmonella species used to construct the training set whereas P. syringae, Y. pestis and S. fredii belong to different families of T3SSs [17, 18]. This may explain why the top 40 positive prediction rates and sensitivity values for the other three organisms were lower than those observed for E. coli (Tables 3 and 4). The lower sensitivity values suggest slightly different translocation signals in the training and testing data sets. However, these prediction rates are significant for an initial screening tool to reduce the down time spent in wet bench experiments for T3SS effector identification.
The GenSET Phase 1 approach had the highest prediction accuracy, was T3SS family-specific, and has potential to be universally applicable to any organisms. Some of the effectors of S. Typhimurium can be translocated using both SPI-1 and SPI-2 apparatuses [21]. Therefore, it was not surprising to see SPI-1 effectors identified using a SPI-2 specific machine learning in S. Typhimurium in GenSET Phase 1. GenSET Phase 1 method predicted eight out of nine (89%) SPI-2 effectors in the top 40 positive prediction (Table 4). We also picked four out of eight (50%) SPI-1 effectors in the top 40 positive prediction (Additional files 3 and 4). Furthermore, our method not only predicted effectors in less-studied organisms but was able to predict novel effectors in well-studied organisms. For example, in the top 40 ranked effector prediction from S. Typhimurium strain LT2 by GenSET, we were able to pick out about 30 hypothetical proteins. Some of these hypothetical proteins may be novel effectors that await further characterization.
GenSET used five different algorithms and a voting algorithm for machine learning on organisms with different effector population sizes and compositions of positive sets. The use of the voting algorithm is advantageous in that this can increase error tolerance. If one algorithm is completely off target in its predictions and the other four worked well, the averaging process reduces the impact of the poorly performing algorithms. For example, the filtered SVM algorithm on S. Typhmurium did not pick out any of the known effectors but because the other algorithms picked out the effectors, the voting algorithm ended up predicting them to be effectors (Additional file 3). In comparison, other programs only used one to three algorithms for the training. For example, SVM was used for the meta-analytical approach [14], SIEVE [11] and BPBAac [12]. SVM, generalized linear model and RandomForest were used for the T3MM by Wang et al. [16].
We started with 21 attributes (features) and employed attribute selection methods to define a subset of attributes called the filtered sets. Other published methods (i.e. SIEVE and T3SEpre) concentrate on a few attributes, such as G + C content and amino acid composition [11, 13]. However, we prefer to use a comprehensive list of attributes so that we can cover all the possible characteristics. We looked at peptide property to understand their physico-chemical nature; this property has been well examined by other researchers [9–11]. We also examined molecular weight, charge and pI as effectors are generally small in size and have a charged residue bias [4]. Other features used were related to the structures and environments, and included stability of the protein using aliphatic index and N-terminal disorder, solubility measure (PEPIB), hydropathy values (GRAVY score), and G + C gene content bias.
In general, the unfiltered sets performed better than the filtered sets in all organisms. This feature possibly works well for our species- and T3SS family-specific approach and can be adaptable to other organisms. Possible future directions to further improve this project will include researching and evaluating additional attributes that can be used in this method. The goal is to develop an exhaustive list of attributes that can characterize translocation signals embedded in the effector sequences. Additionally, we can fine-tune the parameters of the machine learning algorithms in order to increase the precision and reduce the number of false positive predictions. For example, we can increase the length of N-terminal sequence from N30 to N50 or longer as suggested by Wang et al. [13], or we can increase the size of the training sets.
It is not clear to us why the unfiltered attribute sets performed better than the filtered sets in general. One possible explanation is that perhaps there are clusters of effectors that are more related to each other than they are to other effectors. Indeed effectors could be classified and grouped under some common families [3]. If the majority of the effectors are inside such a relation cluster, this would create a bias in the feature selection algorithms in selecting attributes that pinpoint that relation cluster. The use of the unfiltered set can cover more such clusters and thus compensate for effectors that lie outside that relationship clusters. Perhaps there are several such relational clusters of effectors within the same or in different organisms especially since about 30% of each genome is still unannotated.