Datasets
We obtained two benchmark datasets of linear B-cell epitopes and non-epitopes from EL-Manzalawy et al.[11] (henceforth termed as the EL-Manzalawy dataset) and Chen et al.[10] (Chen dataset). EL-Manzalawy dataset contained 701 unique, homology-reduced epitopes of five different peptide lengths (12-, 14-, 16-, 18- and 20-mers) and Chen dataset contained 872 unique 20-mer epitopes. In both datasets, equal numbers of non-epitope sequences were deposited by the authors through randomly extracting peptides from sequences in Uniprot databases [22] while ensuring that none of them were included among the epitopes.
We constructed an analysis subset containing a pool of 20-mer epitopes and an equivalent-sized pool of 20-mer non-epitopes from EL-Manzalawy dataset. An analysis subset was also constructed from Chen dataset in the same manner. To further validate the analyses, we constructed two control subsets from EL-Manzalawy and Chen datasets respectively. The control subsets contain only non-epitopes data which is divided into two pools of equal numbers each. Therefore, for the EL-Manzalawy dataset, the analysis subset comprised of a pool of 351 epitope and a pool of 350 non-epitope sequences, while the control subset contained two pools of non-epitopes with 350 sequences in each. For the Chen dataset, the analysis subset comprised of a pool of 436 epitope and a pool of 436 non-epitope sequences while the control subset contained two pools of 436 non-epitopes in each.
For SVM training and testing, peptides from EL-Manzalawy dataset (12- to 20-mers) were divided into training and independent test sets which comprised of 601 epitopes/601 non-epitopes and 100 epitopes/100 non-epitopes respectively. Peptides from Chen dataset (20-mers) were divided into training and independent test sets which comprised of 736 epitopes/736 non-epitopes and 100 epitopes/100 non-epitopes respectively.
Relative position-specific amino acid propensities
The relative position-specific amino acid propensity, Px, of an amino acid is a quantitative indicator of the propensity of the amino acid to be found at a particular position on the epitope. It is defined as the ratio of the frequency of occurrence of the amino acid in the epitopes pool to the frequency of occurrence of the same amino in the non-epitopes pool at a specific position. As 20-mer peptides were used, Px values were calculated for every amino acid at each of the twenty residue positions and visualized on heat maps. Px values were calculated using the epitopes and non-epitope pools in the analysis subsets, and using the two pools of non-epitopes in the control subsets. Average Px of a specific amino acid is calculated by taking the average of all Px values of that amino acid across all positions on the 20-mer peptides.
Vector encoding schemes
To encapsulate sequence information into a format for SVM training and testing, the sequences were coded as input vectors in simple binary format or in a bi-profile manner using Bayes Feature Extraction. In simple binary encoding, each amino acid is represented by a 20-dimensional vector, composed of either zero or one as elements. For example, alanine was represented as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1] and cysteine as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]. Therefore, in this case, a 20-mer peptide will encoded by a 400-dimensional vector (20 x 20). For details on bi-profile vector encoding using Bayes Feature Extraction, readers are advised to refer to Shao et al.[19]. Briefly, feature vectors are encoded in a bi-profile manner containing attributes from positive position-specific and negative position-specific profiles. These profiles are generated through calculating the frequency of occurrence of each amino acid at each position of the peptide sequence in the epitopes pool and non-epitopes pool respectively. Therefore, a 20-mer input peptide will be encoded by a 40-dimensional (20 x 2) feature vector containing information on the residues of the peptide in the positive (epitope) and negative (non-epitope) spaces.
SVM Training and Testing
For SVM implementation, we used the freely downloadable LIBSVM package by Chang and Lin [23]. Details of the SVM methodology can be obtained from the article by Burges [24]. In short, SVM is based on the structural risk minimization principle from statistical learning theory. A set of positive and negative examples can be represented by the feature vectors x
i
(i = 1, 2,....N) with corresponding labels y
i
∈ {+1,-1}. To classify the data, the SVM trains a classifier by mapping the input samples, using a kernel function in most cases, onto a high-dimensional space, and then seeking a separating hyperplane that differentiates the two classes with maximal margin and minimal error. The decision function for new predictions on unseen examples is given as:
where K (x
i
·x
j
) is the kernel function, and the parameters are determined by maximizing the following:
under the conditions,
The variable C serves as the regularization parameter that controls the trade-off between margin and classification error. We used the radial basis function (RBF) kernel and performed parameter optimization for γ, which determines the capacity of the RBF kernel, and the regularization parameter C using 10-fold cross-validation on EL-Manzalawy and Chen training sets (optimization process is further described in Additional file 2). In 10-fold cross-validation, the training dataset was spilt into 10 subsets where one of the subsets was used as the test set while the other subsets were used for training the classifier. The trained classifier was tested using the test set. The process is repeated ten times using a different subset for testing, hence ensuring that all subsets are used for both training and testing. The optimal values of γ and C obtained from the optimization processes were used subsequently for training the entire training sets to create the final SVM classifier for testing on the independent test sets.
Performance metrics
Various quantitative variables were employed to measure the effectiveness of the SVM model for predicting linear B-cell epitopes:
(i) TP, true positives - the number of correctly classified epitopes.
(ii) FP, false positives - the number of incorrectly classified non-epitopes.
(iii) TN, true negatives - the number of correctly classified non-epitopes.
(iv) FN, false negatives - the number of incorrectly classified epitopes.
Using the variables above, the metrics Sensitivity (S
n
) and Specificity (S
p
), which indicate the ability of the prediction model to correctly classify the epitope and non-epitope sequences respectively, were computed:
To provide an indication of the overall performance of the prediction model, we computed Accuracy (A
cc
):
While these metrics are generally indicative of model performance, they are dependent on the decision threshold. Therefore, a threshold-independent metric, the area under the receiver operating characteristic curve (AROC) was computed as well.
Residue epitope propensities and performance evaluation on Pellequer dataset
We define ‘residue epitope propensity’ or E
p
, as a quantitative measure of the likelihood of a residue to be part of an epitope. E
p
scores are computed to evaluate the effectiveness of the prediction model on antigens with known epitopes. 14 antigens with experimentally verified epitopes from a dataset derived from Pellequer et al.[2, 7] were scanned by a 20-mer sliding window and predicted for linear B-cell epitopes using the BFE-SVM20 classifier. SVM output scores from each 20-mer sliding window were integrated and re-computed as the residue epitope propensity score, E
p
for each residue on the antigen. Here, E
p
of a residue is computed as the additive sum of the SVM output scores from the prediction of the 20-mer sliding windows with the residue located at different positions along the sliding window. The E
p
values were calculated for residues in all antigen sequences from the 20th residue up the last 20th residue. To measure its effectiveness as a predictive metric, E
p
was benchmarked against the annotated epitopes on the antigens. Residues with E
p
> 0 and annotated as epitope were assigned as true positives (TP) while residues with E
p
≤ 0 and not annotated as epitope were assigned as true negatives (TN). Residues with E
p
≤ 0 and annotated as epitope were assigned as false negatives (FN) while residues with E
p
> 0 and annotated as non-epitope were assigned as false positives (FP). To measure prediction performance, accuracy scores were computed as described earlier for each antigen using the TN, TN, FN and FP variables.