A generalized approach to predicting protein-protein interactions between virus and host

Background Viral infection involves a large number of protein-protein interactions (PPIs) between virus and its host. These interactions range from the initial binding of viral coat proteins to host membrane receptor to the hijacking the host transcription machinery by viral proteins. Therefore, identifying PPIs between virus and its host helps understand the mechanism of viral infections and design antiviral drugs. Many computational methods have been developed to predict PPIs, but most of them are intended for PPIs within a species rather than PPIs across different species such as PPIs between virus and host. Results In this study, we developed a prediction model of virus-host PPIs, which is applicable to new viruses and hosts. We tested the prediction model on independent datasets of virus-host PPIs, which were not used in training the model. Despite a low sequence similarity between proteins in training datasets and target proteins in test datasets, the prediction model showed a high performance comparable to the best performance of other methods for single virus-host PPIs. Conclusions Our method will be particularly useful to find PPIs between host and new viruses for which little information is available. The program and support data are available at http://bclab.inha.ac.kr/VirusHostPPI.


Background
There are many types of viruses that cause a wide variety of viral infections or viral diseases. For example, more than 11,000 deaths were reported in Africa during the outbreak of Ebola virus disease in 2014 and 2015 [1]. More recently, an outbreak of Middle East respiratory syndrome coronavirus (MERS-CoV) [2], which began with a patient in an emergency room, occurred in South Korea. So far, there is no specific vaccine or effective treatment for Ebola virus and MERS-CoV [1,2]. Viral infection involves a large number of protein-protein interactions (PPIs) between virus and its host. These interactions range from the initial binding of viral coat proteins to host membrane receptor vector machine (SVM) model to predict PPIs between human and two types of viruses (hepatitis C virus and human papillomavirus). However, these prediction methods cannot be applied to new viruses or new hosts that have no known PPIs to the methods. Inter-species PPIs predicted by these methods are for PPIs between virus of a single type and host of a single type. A recent SVM model called DeNovo is perhaps the only one that can predict PPIs of new viruses with a shared host [6]. Amino acid sequence similarity between different types of viruses or hosts is relatively low, so sequence-based prediction of virus-host PPIs for new viruses or hosts is quite challenging. In this study, we developed a new prediction method of virus-host PPIs which is applicable to new viruses or hosts. The rest of this paper discusses the details of the method and its experimental results.

Data of virus-host PPIs
We obtained all known PPIs between virus and host using the PSICQUIC web service (http://www.ebi.ac.uk/Tools/ webservices/psicquic/view/main.xhtml). We extracted virus-host PPIs from four databases, APID, IntAct, Mentha and UniProt, which use same protein identifiers. The sequences of the proteins involved in any of the PPIs were obtained from the UniProt database (http://www.uniprot. org). As of December 2016, there are a total of 12,157 PPIs between 29 hosts and 332 viruses ( Table 1). The reason that human is listed as a separate category from other animals (i.e., non-human animals) in the classification of hosts is because human has a much larger number of known PPIs with viruses than other animals. Detailed information on the viruses involved in the virus-host PPIs is available at http://bclab.inha.ac.kr/VirusHostPPI.
Learning-based prediction of PPIs requires both positive and negative PPI data, but negative data are not readily available in databases. For negative data, we obtained protein sequences of major hosts (human, non-human animal, plant, and bacteria) from UniProt, and removed those with a sequence similarity higher than 80% to any positive data using CD-HIT-2D [7].

Datasets
We constructed several datasets to examine the applicability of our prediction method to new viruses or hosts. The datasets are classified into two types: To examine the applicability of the prediction method to new viruses, we constructed a training dataset with 10,955 PPIs between human and any virus except H1N1 virus (hereafter called TR1). The prediction method was later tested on a test dataset with 381 PPIs between human and H1N1 virus (called TS1), which were not used in training the method. We constructed another training dataset TR2 with 11,341 PPIs between human and any virus except Ebola virus. The prediction method trained with TR2 was tested on a test dataset TS2, which contains 150 PPIs between human and Ebola virus (Fig. 1a). Additional training datasets for studying the applicability to new viruses are TR3 and TR4. TR3 contains 11,617 virus-host PPIs except PPIs of H1N1 virus. TR4 consists of 12,007 virus-host PPIs except PPIs of Ebola virus. The prediction model trained with TR3 and TR4 was later tested on TS1 and TS2, respectively (see Fig. 1b for details).
The reason for selecting the viruses for the SVM model is as follows: (1) For training the SVM model, we tried to select as many virus proteins as possible which have known interactions with host proteins. (2) For testing the SVM method on new viruses, we selected H1N1 and Ebola virus because the viruses caused a large number of deaths recently but no specific vaccine or effective treatment is available yet.
The applicability of the prediction method to new hosts was evaluated using training dataset TR5 and test datasets TS5.1-TS5.4. TR5 contains 11,491 PPIs between human and any virus. The prediction method trained with TR5 was tested on PPIs of non-human hosts with virus, which were not used in training the method. The test datasets include TS5.1 (PPIs of non-human animal with virus), TS5.2 (PPIs of plant with virus), TS5.3 (PPIs of bacteria with virus) and TS5.4 (PPIs of any non-human host with virus) (Fig. 2).
To assess the independence of the test data from the training data, we analyzed the sequence similarity between the training datasets and test datasets using EMBOSS Needle tool [8]. As shown in Table 2, target proteins in the test datasets showed a very low sequence similarity with proteins in the training datasets (see the supporting data at http://bclab.inha.ac.kr/VirusHostPPI for the similarity of every sequence pair between the training datasets and test datasets).

Features and representation
Feature selection and representation are critical to the success of prediction of PPIs. In particular, one of the challenges in sequence-based prediction of virus-host PPIs is to represent two types of proteins of variable lengths into a feature vector of a fixed length. Several encoding schemes have been used to represent protein sequences for predicting PPIs. For instance, Shen et al. [9] clustered 20 amino acids into seven groups, and represented the relative frequency of three consecutive amino acids (referred to 'amino acid triplet') in a protein sequence using the classification. In our previous work [5], we redefined the relative frequency of an amino acid triplet using six groups of amino acids. However, both Shen's representation and ours generate a feature vector with many zero-valued elements, which lower the prediction performance.   [9] and others [10]. In this classification of amino acids, there are 7 × 7 × 7 = 343 possible amino acid triplets.
For each pair of host and virus proteins, we represent the relative frequency of amino acid triplets (RFAT) as a feature vector with 686 elements (343 for a host protein and 343 for a virus protein). The RFAT of the i-th amino acid triplet is defined by Eq. 1. In the equation, f i , avgF, and maxF denote the frequency of the i-th amino acid triplet, the average, and the maximum frequency of amino acid triplets in the protein sequence, respectively. where Another feature is the frequency difference of amino acid triplets (FDAT) between virus and host proteins, which is defined by Eq. 2. In Eq. 2, f hi is the frequency of the i-th amino acid triplet in the host protein of the host-virus pair, and f vi is the frequency of the i-th amino acid triplet in the virus protein of the same host-virus pair. avgFD and maxFD denote the average and the maximum frequency difference of amino acid triplets in a host-virus pair, respectively.
We also represent amino acid composition (AC) in each pair of host and virus proteins (Eq. 3). AC i is the frequency of the i-th amino acid present in a host-virus pair divided by the maximum frequency of an amino acid in the pair.
The above three features, RFAT, FDAT and AC were developed in our previous study for inter-species PPIs of a single type [11]. However, the previous study used a different classification of amino acids and computed the average and the maximum frequency from all proteins in a dataset instead of a single protein being encoded.
As additional features, we used composition, transition and distribution of amino acid groups [10]. Composition represents the normalized frequency of each amino acid group in the protein sequence. Transition represents the normalized frequency of transition between each amino acid group in the protein sequence. Distribution is the normalized position of the first, 25%, 50%, 75% and 100%th amino acid of each amino acid group in the protein  Figure 3 shows an example of a feature vector for a pair of host and virus proteins.

Prediction models of virus-host PPIs
We built several support vector machine (SVM) models using LIBSVM [12] to predict the interactions between virus and host proteins. The radial basis function (RBF) was used as a kernel function for training the SVM models, and the best values of parameters C and γ were found by running the grid search of LIBSVM on training datasets. Unless specified otherwise, the results shown in this paper were obtained with C = 32, γ = 0.03125. The SVM models take a pair of virus and host protein sequences as input. As output, the SVM models classify whether or not the virus protein interacts with the host protein. The SVM models and supporting data are available at http://bclab.inha.ac.kr/VirusHostPPI.

Performance measures
The performance of the prediction models were evaluated by several measures: sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), Matthews correlation coefficient (MCC) and the area under the ROC curve (AUC), which are defined as follows

Results of cross validation
We performed 10-fold cross validation of the SVM model with several datasets which contain different ratios of positive to negative data (1:1, 1:2 and 1:3). Due to the randomness of selecting negative data, we constructed three different datasets for each ratio of positive to negative data. Table 3 shows the results of the cross validation. The best performance of the SVM model was observed in the balanced dataset with 1:1 ratio of positive to negative data. As expected, running the SVM model on unbalanced datasets resulted in lower performances than running it on the balanced dataset with 1:1 ratio of positive to negative data. Datasets are available at http://bclab.inha. ac.kr/VirusHostPPI.
We also examined the contribution of features to the prediction performance of our SVM model. Table 4 compares different combinations of features in 10-fold cross validation of the SVM model with the 1:1 dataset of Table 3. Among the single features, RFAT was better than the others (i.e., FDAT, AC, composition, transition, and distribution) in all performance measures. With RFAT alone, the SVM model achieved an accuracy above 83% and an MCC above 0.668, which indicates that RFAT is a very powerful feature in predicting virus-host PPIs. Although RFAT is a powerful feature, performance gain was obtained with it was used with combination of other features. For example, using three features of RFAT, FDAT and AC showed a better performance than using RFAT alone. The best performance of the SVM model was observed when all six features were used. Table 5 shows the results of testing the prediction model on 2 independent datasets of PPIs of H1N1 and Ebola virus, which were not used in training the models. As discussed earlier, proteins of H1N1 virus have a sequence similarity of 9.6% to those of other viruses, and proteins of Ebola virus have a sequence similarity of 10.9% to other viruses on average. Despite such a low sequence similarity of proteins in test datasets to those in training datasets, all prediction models trained with TR1-TR4 showed a relatively high performance in independent testing. Prediction models trained with host-virus PPIs (TR2 and TR4) showed a slightly better performance than those trained with human-virus PPIs (TR1 and TR3) in both H1N1 and Ebola viruses. The models showed a higher sensitivity for

Applying the prediction model to new hosts
In order to examine the applicability of our prediction model to new hosts, we tested it on PPIs of viruses with new hosts, which were not used in training the model. As described earlier, the model trained with human-virus PPIs was tested on PPIs of viruses with non-human (i.e., non-human animal, plant and bacteria). As shown earlier in Table 2, the average sequence similarity of human proteins to non-human animal, plant, and bacteria is 10.7%, 10.6%, and 10.4%, respectively. Despite the low sequence similarity, tests of the model on new hosts showed a reasonable good performance (Table 6), but its performance for new hosts was slightly lower than that for new viruses.
The difference seems ascribed to the difference in the number of target proteins in test datasets and to the difference in the number of partner proteins of the target proteins, which are shared by training and test datasets. Test datasets TS1 and TS2 have 381 interactions of 11 H1N1 virus proteins and 150 interactions of 3 Ebola virus proteins with human proteins, respectively ( Fig. 1 and Table 2). Test datasets TS5.1, TS5.  Table 2).
On average, a test dataset for new viruses has (381 + 150)/2 = 266 PPIs and a test dataset for new hosts has (488 + 17 + 143)/3 = 216 PPIs. Thus, the difference in the average number of PPIs of the two types of test datasets is not large. However, there is a big difference in the number of target proteins in the test datasets and in the number of proteins common to training and test datasets. The average number of virus proteins in a test dataset for new viruses is only (11+3)/2 = 7, whereas the average number of host proteins in the test datasets for new hosts is (368 + 13 + 106)/3 = 162. Thus, virus-host PPIs in the test datasets for new viruses share many host proteins in the training datasets (248 host proteins common to TR1 and TS1, 129 host proteins common to TR2 and TS2, 248 host proteins common to TR3 and TS1, and 129 host proteins common to TR4 and TS2) even though no virus proteins are shared by the test and the training datasets. In contrast, virus-host PPIs in the test datasets for new hosts share a much smaller number of virus proteins in the training datasets (85 virus proteins common to TR5 and TS5.1, 0 common to TR5 and TS5.2, 2 virus  This is a known problem with pair-input methods, which was first reported by Park and Marcotte [13], but not widely known to researchers. According to their study [13], prediction methods that operate on pairs of objects such as PPIs perform much better for test pairs that share components with a training set than for those that do not. Thus, our prediction model showed a better performance in testing for new viruses which share more partner proteins (i.e., host proteins) with training datasets than in testing for new hosts which share fewer partner proteins (i.e., virus proteins) with training datasets.

Comparison to other methods
We compared our method with two other methods, DeNovo [6] and Barman's method [14], using their datasets. For comparison with DeNovo's SVM model, we tested our SVM model on DeNovo's SLiM testing set, which contains 425 positive and 425 negative PPIs (Supplementary file S12 used in DeNovo's study ST6). While DeNovo's SVM model showed an accuracy of 81.90%, sensitivity of 80.71%, specificity of 83.06%, our SVM model achieved an accuracy of 84.47%, sensitivity of 80.00%, and specificity of 88.94% (Table 7). Our model showed a slightly lower sensitivity, but showed a higher specificity and accuracy. The dataset used for comparison of our SVM model with DeNovo is available at http://bclab.inha.ac.kr/VirusHostPPI. In Barman's study [14] three machine learning methods (SVM, Naïve Bayes, and Random Forest) were used to predict virus-host PPIs using several features such as domain-domain association in interacting protein pairs and composition of methionine, serine, and valine in viral proteins. In a 5-fold cross validation with virus-host PPIs from VirusMINT [15], their SVM showed higher sensitivity and F1 score than Naïve Bayes and Random Forest. Thus, we tested our SVM model on the same dataset used in Barman's study, which contains 1035 positive and 1,035 negative interactions between 160 virus proteins of 65 types and 667 human proteins. As shown in Table 8

Conclusion
Most computational methods of predicting PPIs are intended for interactions within a species rather than for interactions across different species such as interactions between virus and host cell proteins. A small number of computational methods which were recently developed for predicting PPIs between virus and host are limited to interactions of single virus or single host, and therefore a separate prediction model is required to predict PPIs of new viruses or hosts. However, proteins of new viruses or hosts often exhibit quite a low sequence similarity to proteins of known viruses or hosts, and little information is available for new viruses or hosts.
In this study, we developed a prediction model of virushost PPIs, which is applicable to new viruses and hosts. We tested the prediction model on independent datasets of virus-host PPIs, which were not used in training the model and have a very low sequence similarity to any protein in training datasets of the model. Despite a low sequence similarity between proteins in training datasets and target proteins in test datasets, the prediction model showed a high performance comparable to the best performance of other methods for single virus-host PPIs. Our