MHC2MIL: a novel multiple instance learning based method for MHC-II peptide binding prediction by considering peptide flanking region and residue positions

Background Computational prediction of major histocompatibility complex class II (MHC-II) binding peptides can assist researchers in understanding the mechanism of immune systems and developing peptide based vaccines. Although many computational methods have been proposed, the performance of these methods are far from satisfactory. The difficulty of MHC-II peptide binding prediction comes mainly from the large length variation of binding peptides. Methods We develop a novel multiple instance learning based method called MHC2MIL, in order to predict MHC-II binding peptides. We deem each peptide in MHC2MIL as a bag, and some substrings of the peptide as the instances in the bag. Unlike previous multiple instance learning based methods that consider only instances of fixed length 9 (9 amino acids), MHC2MIL is able to deal with instances of both lengths of 9 and 11 (11 amino acids), simultaneously. As such, MHC2MIL incorporates important information in the peptide flanking region. For measuring the distances between different instances, furthermore, MHC2MIL explicitly highlights the amino acids in some important positions. Results Experimental results on a benchmark dataset have shown that, the performance of MHC2MIL is significantly improved by considering the instances of both 9 and 11 amino acids, as well as by emphasizing amino acids at key positions in the instance. The results are consistent with those reported in the literature on MHC-II peptide binding. In addition to five important positions (1, 4, 6, 7 and 9) for HLA(human leukocyte antigen, the name of MHC in Humans) DR peptide binding, we also find that position 2 may play some roles in the binding process. By using 5-fold cross validation on the benchmark dataset, MHC2MIL outperforms two state-of-the-art methods of MHC2SK and NN-align with being statistically significant, on 12 HLA DP and DQ molecules. In addition, it achieves comparable performance with MHC2SK and NN-align on 14 HLA DR molecules. MHC2MIL is freely available at http://datamining-iip.fudan.edu.cn/service/MHC2MIL/index.html.


Background
Major Histocompatibility Complex (MHC) molecules play important roles in adaptive immune response. An important function of MHC molecules is to bind peptide fragments derived from pathogens and to display them on the cell surface for being recognized by appropriate T cells [1]. This stimulates subsequent immune response in order to fight against these pathogens. The MHC gene family is mainly divided into two subgroups: class I (MHC-I) and class II (MHC-II). Peptides presented by MHC-I originate from proteins produced within a cell, while those by MHC-II are from the outside of the cell. T helper cells (one type of T cells), which are activated by the peptides presented by MHC-II molecules, control or help the immune response. Therefore, accurate identification of MHC-II binding peptides is crucial in understanding the mechanism of immune systems, as well as in developing peptide-based vaccines for treating some serious diseases, such as hepatitis, EB virus, autoimmunity, and cancer [2]. Compared to biochemical experiments with high costs in both time and finance, computational methods are more economic in that they can be quickly deployed to select a small number of promising peptides for further biochemical experimental verification.
Computational methods for predicting MHC-I binding peptides are reported to produce considerable results of AUC (Area Under ROC Curve) between 0.85 and 0.95 [2]. This is because the prediction of MHC-I binding peptides is relatively easy, where the binding groove of MHC-I molecules is closed at both ends and the binding peptides usually have a length between 8 and 11 amino acids [1]. However, the prediction of MHC-II binding peptides is more challenging. One main reason for this is that the binding groove of MHC-II molecules is open at both ends, which leads that the MHC-II binding peptides typically vary from 11 to 20 amino acids in length [3]. Usually MHC peptide binding core is of 9 amino acids that fit into 9 pockets of MHC binding groove. Although a number of computational methods has been developed to predict MHC-II binding peptides in the last few years [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], recent experimental results on various benchmark datasets show that the performance of these methods needs to be improved [2,19]. These computational methods can be divided into two groups: allele-specific and pan-specific methods [2]. In the allele-specific methods, both training and test peptides are for a same MHC molecule. In contrast, panspecific methods can predict the binding peptides of MHC molecules that have very few or even no training data [2,20,21]. In this work, we focus on allele-specific methods that are the basis of pan-specific methods.
According to underlying techniques used, the allelespecific methods can be roughly divided into four groups: position specific score matrix (PSSM) based methods, artificial neural network (ANN) based methods, kernel based methods, and multiple instance learning based methods. Although TEPITOPE [10], ARB [8], CombLib [9], TEPITOPEpan [17] and SMM-align [4] are all PSSM based methods, they differ significantly in the way of generating the score matrices. TEPITOPE is the first PSSM method in which the score matrix is obtained by examining the binding specificities of target MHC molecule in each pocket using biochemical experiments. The PSSM of ARB is modeled by the difference of average binding affinity among 6 residue groups. The CombLib method derives the matrix from a positional scanning combinatorial libraries consisting of 180 peptides for each MHC molecule. TEPITOPEpan extends TEPITOPE by covering all HLA-DR molecules instead of only 51 HLA-DR molecules. In SMM-align, a Metropolis Monte Carlo procedure is employed to search for an optimal score matrix from the binding affinity data [4]. Unlike these PSSM based methods, NN-align [5] is an ANN based method. It first estimates the peptide binding core, then this part and flanking region are co-encoded and input into ANN in order to learn a prediction model. Finally, an ensemble of ANN is used to improve the robustness of the method. To avoid the difficulty of estimating the binding core, several kernel based methods have been proposed to directly measure the similarities among peptides of different lengths. Local alignment (LA) kernel [7] makes use of the alignment method for similarity calculations, while GS [22] and MHC2SK [18] are two alignment free methods that use a series of substrings to measure peptide similarities. One notable difference between GS and MHC2SK is that MHC2SK emphasizes long substrings while GS considers a substring whose length is as short as 1. Several recent methods resort to a multiple instance learning (MIL) framework to handle the length variation of peptides. In this framework, each pepitde is regarded as a bag, and some substrings of the peptide as the instances in the bag. Two MIL based methods, MHCMIR [16] and MultiMHCII [13], have been proposed to tackle the problem of MHC-II peptide binding prediction. Both of them create bags with 9 amino acids (AA) long instances and use Support Vector Regression (SVR) as the predictor. MultiMHCII uses a normalized set kernel as its kernel function and MHCMIR is an adaption of MILES [23].
Although both MHCMIR and MultiMHCII can deal with the length variation of binding peptides, some important peptide information has been ignored in their modeling process. First, in addition to the binding core region (9 amino acids), the peptide flanking region also contributes the binding process. However, both MHCMIR and MutliMHCII consider only instances of 9 AA, where the effect of the peptide flanking region could not be modeled. Second, it has been reported that the amino acids at some specific positions (e.g. 1,4,6,7,9) of the peptide binding core are more crucial for the binding process [10]. This kind of important domain knowledge is not well incorporated into MHCMIR and MultiMHCII for improving the prediction performance. To address these concerns, we develop a novel MHC-II multiple instance learning based method, MHC2MIL, that makes use of these important domain knowledge. Unlike MHCMIR and MultiMHCII, both length 9 and 11 instances are considered in our MHC2MIL so as to integrate useful information in both potential binding cores and peptide flanking region. We use a benchmark data covering 26 HLA DR, DP, and DQ molecules to evaluate the performance of MHC2MIL. Experimental results demonstrate the effectiveness of incorporating peptide flanking region, as well as the importance of some crucial pepitde positions. Interestingly, the experimental results show that position 2 may also play some roles in HLA-DR peptide binding process. Furthermore, the performance of MHC2MIL is compared with two state-of-the-art methods of MHC2SK and NN-align. MHC2MIL outperforms both MHC2SK and NN-align with being statistically significant on 12 HLA DP and DQ molecules. In addition, it achieves the comparable performance to MHC2SK and NN-align on 14 HLA DR molecules.

Data
We use a benchmark dataset (Wang dataset) described in [9] to compare the performance of different computational methods. This dataset is of high quality, which has been widely used in the evaluation of different computational methods. As shown in Table 1, the dataset consists of 24382 peptides with binding affinities covering 26 different HLA DR, DP and DQ molecules. Each HLA molecule has the sufficient number of peptides with binding affinity, ranging from 474 (HLA-DRB1_0404) to 3504 (HLA-DRB1_0101). The dataset has also been divided into five folds with the similar size.

Multiple instance learning
Multiple Instance Learning (MIL) is first introduced in 1997 [24]. Differing from classic learning methods, MIL uses "bags" instead of "instances". A bag consists of one or more instances. Note that a bag is positive if at least one of its instances is positive, and negative, otherwise. We denote B as a set of bags, B i as the ith bag in B, and B ij as the jth instances in B i . Function Label maps a bag or a instance into {-1, +1}. Then the label of a bag is as follows: where i f f . stands for "if and only if". If a bag is positive, we would not know which instance (instances) is (are) positive.
The MIL method of Axis-Parallel Rectangles (APR) [24] tries to find a rectangle that contains at least one instance from each positive bag while no instances from negative bags are included. Unfortunately, this kind of rectangle may not always exist. The famous MIL algorithm of Diverse Density (DD) [25] attempts to find a concept t with the highest "diverse density" instead of a rectangle. Such t is close to positive bags and far from negative bags. If we use B + i to stand for the ith positive bag, and B − i for the ith negative bag, then DD maximizes the equation: Multiple-instance learning via embedded instance selection (MILES) extends DD by considering that the DD framework appears to be rather restrictive because it always seeks for one and only one feature in feature selection. DD can be improved by searching for multiple features [23]. In MILES, all instances from bags are shuffled to a meta-space C for mapping each bag into a feature: where x k is a re-indexed instance from all of the bags and n is the total number of instances. Based on the meta-space, the probability of generating x k from B i is where s(x k , B i ) can be interpreted as a measure of the similarity between x k and B i . The similarity is determined by the closest instance to x k in B i . Then bag B i is mapped to a vector of features: MILES maps each bag into the feature space and applies SVM as the classifier. The essence of MILES is that if examples cannot be separated in the low dimension, they are mapped into high dimension space. If x k has a high similarity to some positive bags and a low similarity to some negative bags, it would provide useful information in the classification.

MHCMIR
Based on the general assumption that the peptide binding core is of 9 AA, MIL can be applied to the prediction of MHC-II binding peptides by creating a bag with 9-long instances. We denote p as an original peptide, p(i) as the ith amino acid in the p, and len(p) as the total number of amino acids. A bag B i corresponding to p is generated as follows: (7) MHCMIR adapts MILES to this problem after creating bags, by (1) replacing SVM with SVR; and (2) using matrix BLOSUM62 to compute the similarity between a bag and an instance [23]. Equation 5 is rewritten as: where dis function is computed as follows: x(i) in Equation 9 is the ith amino acid of instance x. The distance between two 9-long instances is computed by two steps: calculating the sum of the corresponding positions in BLOSUM62, and scaling the value obtained in step one to (0, 1] according to its sign.

MHC2MIL
The critical drawback of MHCMIR is that important domain knowledge, such as the effect of peptide flanking region and the importance of key positions in binding core, are ignored in the modelling. To address this shortcoming, we extends MHCMIR to MHC2MIL by taking into account two facts: (1) A bag can have instances with different lengths; and (2) Key positions of an instance should be emphasized. Specifically, we create bags to include both 9-long and 11-long instances. The assumption is that the information from the left and right amino acids of the 9-long instance is useful, and meta-space is expanded by adding 11-long instances, both of which are benefit to the prediction. In addition, according to related literature on MHC-II peptide binding, positions 1, 4, 6, 7 and 9 are the most influential for binding process. Therefore we use the amino acids at these positions instead of all positions to measure the distance between two instances.
Thus Equation 9 is modified as: where KP 9 = {1, 4, 6, 7, 9} is the key positions for 9long instances and KP 11 = {1, 2, 5, 7, 8, 10, 11} for 11long instances. According to our assumption, an 11-long instance is its middle 9-long instance with the left and right amino acids, so the first, last positions and 1,4,6,7,9 positions of the middle 9-long instance are taken into consideration. For the sake of simplicity, we assign ∞ for the distance between instances with different lengths.

Results and discussion
Experimental procedure and evaluation metric According to [9], Wang's dataset was divided into 5 partitions, and the peptide with the binding affinity of less than 1000 nM was deemed as a binder. We thus validated our model by 5-fold cross validation. For training MHC2MIL, similar to [5], the binding value was transformed by 1 -log(IC50)/log(50, 000), where IC50 is binding affinity measured in nM. libsvm [26] was used for our implementation of SVR in MHC2MIL and MHCMIR. The parameters c and g of SVR were set as the default values where c = 1 and g = 1/#f eatures (#f eatures is the total number of the input features). We implemented MHC2SK according to the related paper [18]. AUC was used as the evaluation metric to compare different models. In addition, for comparing two predictors, we used one-tailed per-allele binomial test to measure their performance differences, where pvalue < 0.05 is considered to being statistically significant. In the following, we denote MHC2MIL as MHC2MIL(fl) that considers instances of flexible length (both 9-long and 11-long instances), and MHC2MIL as MHC2MIL(fl+kp) that considers both instances of flexible length and key positions.

Effect of adding 11-long instances
First, we examined the effect of incorporating both 9-long and 11-long instances into bags. Comparisons were done among MHCMIR(9), MHCMIR (11), and MHC2MIL(fl), where MHCMIR(n) stands for filling bags with n AA long instances. The results are given in Table 2. In all 26 alleles, the best performance is achieved by MHCMIR(9) on 2 alleles, MHCMIR(11) on 7 alleles, and MHC2MIL(fl) on 20 alleles. MHC2MIL(fl) achieved the highest average AUC of 0.777, outperforming both MHCMIR(9) (AUC of 0.770) and MHCMIR(11) (AUC of 0.771) with being statistically significant (binomial test, pvalue < 0.05). From this we can see that addition of 11-long instances into bags provides useful information to improve the prediction performance. This is consistent with previous studies in that not only the binding core but also peptide flanking region contribute to the binding process.

Effect of first position limitation on HLA DR
Second, we explored the effect of adding first position limitation on generating instances for HLA DR molecules. Previous work suggests that a 9-long peptide in HLA DR could become a binding core only when aliphatic (I, L, M, V) and aromatic (F, W, Y) amino acids appear in its first position [10]. We can reduce the instances in a bag by adding a constraint that the first position of the instance should be in (I, L, M, V, F, W, Y). The comparison results are shown in Table 3, where MHC2MIL(fl+fpl) denotes the method that creates bags by utilizing first position limitation based on MHC2MIL (fl). Compared with MHC2MIL(fl), MHC2MIL(fl+fpl) performed worse in all alleles. The average AUC was reduced greatly from 0.777 to 0.734. This suggests that reducing too many negative instances may lose some important information for discriminating binders from non-binders.

Effect of utilizing key positions
Finally, we examined the effect of emphasizing some key positions of instances. According to Equation 11, MHC2MIL(fl+kp) computes the distance between two instances based on only key positions. Table 4     Comparison with other allele-specific methods In this section, we present the performance comparisons of MHC2MIL with other allele-specific methods against Wang's dataset using five fold cross validation. The results on HLA DP and DQ are shown in Table 6, and those on HLA DR in Table 7. For HLA DR, position 2 is also deemed as a key position in the implementation of MHC2MIL. There are total 7 allele specific methods of ARB, SMM-align, CombLib, TEPITOPE, MHC2SK, NN-align, and MHC2MIL. CombLib is only for HLA DP and DQ, while TEPITOPE is only for HLA DR.
Since we used the same dataset, partition, and experimental procedure, the experimental results of ARB, SMM-align, CombLib, TEPITOPE and NN-align were directly taken from [9]. As shown in their paper [13], MultiMHCII's performance is as low as TEPITOPE, so we did not compare this method. As given in Table 6, MHC2MIL, MHC2SK and NN-align are three best prediction methods for HLA DP and DQ molecules with the highest average AUC of 0.794, 0.783 and 0.783, respectively. In particular, MHC2MIL outperformed NN-align in 9 out of 11 alleles, and MHC2SK in 9 out of 12 alleles, both of which are being statistically significant (binomial test, pvalue < 0.05). For HLA DR molecules, as shown in Table 7, MHC2MIL, MHC2SK and NN-align are also the three best prediction methods with the highest average AUC of 0.780, 0.780 and 0.776 respectively. Specifically, MHC2MIL outperformed NNalign in 9 out of 14 alleles, and MHC2SK in 5 out of 14 alleles (with one tie). Overall, MHC2MIL is the best prediction method for HLA DP and DQ molecules, and achieved comparable performance with two state-of-theart methods of MHC2SK and NN-align on HLA DR molecules.
As shown in Table 8, we present the prediction binding affinities (IC50nM) of 5 peptides to HLA-DQA1_0501-DQB1_0201 in Table 8 by MHC2MIL, MHC2SK and NN-align, respectively. For all these 5 peptides, MHC2MIL achieved the closest binding affinity, compared with MHC2SK and NN-align. For example, for peptide "SVLLVVALFAVFLGS" of binding affinity of 748.1nM, the predicted affinity by MHC2MIL is 745nM, while the predicted values by MH2SK and NN-align are 1072.4nM and 2064.6nM, respectively. All these results demonstrate the effectiveness of MHC2MIL on predicting MHC-II binding peptides.

Discussion
Although multiple instance learning has been already applied in MHCMIR and MultiMHCII, the domain knowledge in MHC-II peptide binding is not incorporated very well. MHCMIR and MultiMHCII consider only 9 length instances and MHCMIL treats 9 amino acids in each instances equally. In contrast, MHC2MIL incorporates both length 9 and 11 instances to measure the effect of flanking regions. In addition, only the amino acids in the key positions of the instance are used to measure the distance between two instances in MHC2MIL. It is not surprising that MHC2MIL outperformed MHCMIR on the benchmark dataset, with being statistically significant.
In fact, previous studies show that the performance of MHCMIR is comparable with SMM-align [16], and Multi-MHCII with TEPITOPE [13]. In contrast, our experimental results demonstrate that, MHC2MIL is the best performed method in HLA DP and DQ molecules out of all six well-known computational methods, achieving comparable performance on HLA DR molecules with two state-of-the-art computational methods of NN-align and MHC2SK. All these indicate that incorporating domain knowledge into the algorithm design is able to greatly boost the prediction accuracy. On the other hand, some interesting knowledge can be captured from the data. In this work, we find that, for HLA DR molecules, in addition to 5 well-known key positions (1, 4, 6, 7 and 9), position 2 may also play some roles in MHC-II pepitde binding process. In fact, according to the position specific scoring matrix provided by TEPITOPE, both positions 2 and 3 contribute slightly to the HLA DR binding process. More experimental studies should be carried out to elucidate the role of these two positions.
Our experimental results suggest that, for the MHC-II pepitde binding prediction, MHC2MIL, MH2SK and NNalign are the three best predicting methods. Differing from NN-align using multiple neural network ensembles, both MHC2SK and MHC2MIL use a single classifier. Moreover, the underlying principles of these algorithms are quite different. For NN-align, the binding core is estimated and encoded with peptide flanking region as a vector. Neither MHC2MIL nor MHC2SK needs to estimate the binding core. MHC2SK relies on string kernel based methods, where short substrings of varying sizes are used to measure the similarity between two peptides. MHC2MIL resorts to multiple instance learning techniques to map each peptide into a common metaspace. It is obvious that these methods are complement to each other, so that ensemble techniques could be used to further enhance the prediction performance [27,28].

Conclusion
In this paper, we have presented a novel MIL algorithm called MHC2MIL that predicts MHC-II binding peptides by creating bags of flexible instances and utilizing key positions to calculate the similarity. MHC2MIL achieves competitive results with two stateof-the-art computational approaches of MHC2SK and NN-align. Considering the high diversity of MHC-II molecules, we would extend MHC2MIL to be a panspecific algorithm in order to cover many more MHC-II molecules.