HVR1 sequence data
Sequences of the intra-host HVR1 variants (n = 15,041) sampled from 301 HCV-infected patients diagnosed with chronic (n = 123) or recent (n = 178) infection –patients infected for more than 1 year or less than a year, respectively– were described in our previous study [4]. The four nucleotide (nt) bases (A, G, U and C) present in the HVR1 of HCV RNA genomes were converted to the corresponding DNA format (A, G, T and C) because of the greater availability of PhyChem properties for the DNA-specific base T than for the RNA-specific U.
For statistical and classification tests, the data were divided into two datasets (training/testing). Sequences of intra-host HVR1 variants (n = 5681) derived from 222 persons (R, n = 124; C, n = 98) were used for training of the classifier, while remainder of the data (n = 9360) from 79 persons (R, n = 54; C, n = 25) were used for testing of the classifier. HVR1 variants comprising the training and test datasets were represented as feature vectors of 148 PhyChem indexes and assigned to the R or C class based on the R/C infection status of the corresponding patient.
To examine effects of data randomization on performance of RBFNN classifier, five training datasets were generated from the HVR1 sequence data, where instances in each dataset were randomly shuffled using different randomization seeds. In addition, to account for the possibility of random correlations in data, four random datasets were generated from the training dataset, each generated by randomly class-labeling the instances using different randomization seeds.
PhyChem features
The PhyChem indices of DNA nt dimers used to generate feature vectors representing the PhyChem features of HVR1 variants were derived from [6, 7]. Correlation measures for the same PhyChem index between two nt dimers separated by a distance (Lag) along the sequence were calculated using the following equation (described in [8]):
$$ \mathit{\mathsf{DAC}}\left(\mathit{\mathsf{u}},\mathit{\mathsf{L}\mathsf{ag}}\right)=\sum \limits_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{L}}-\mathit{\mathsf{L}\mathsf{ag}}-\mathsf{1}}\left({\mathit{\mathsf{P}}}_{\mathit{\mathsf{u}}}\ \left({\mathit{\mathsf{R}}}_{\mathit{\mathsf{i}}}{\mathit{\mathsf{R}}}_{\mathit{\mathsf{i}}+\mathsf{1}}\right)-{\overline{\mathit{\mathsf{P}}}}_{\mathit{\mathsf{u}}}\right)\left({\mathit{\mathsf{P}}}_{\mathit{\mathsf{u}}}\ \left({\mathit{\mathsf{R}}}_{\mathit{\mathsf{i}}+\mathit{\mathsf{L}\mathsf{ag}}}{\mathit{\mathsf{R}}}_{\mathit{\mathsf{i}}+\mathit{\mathsf{L}\mathsf{ag}}+\mathsf{1}}\right)-{\overline{\mathit{\mathsf{P}}}}_{\mathit{\mathsf{u}}}\right)/\left(\mathit{\mathsf{L}}-\mathit{\mathsf{L}\mathsf{ag}}-\mathsf{1}\right) $$
where u is a PhyChem index, L is the length of the HVR1 sequence, (R
i
R
i + 1) term is the numerical value of PhyChem index u for the Nt dimer R
i
R
i + 1 at position i, and \( {\overline{P}}_u \) is the average value of the PhyChem index u along the HVR1 sequence, which is calculated as follows:
$$ {\overline{\mathit{\mathsf{P}}}}_{\mathit{\mathsf{u}}}=\sum \limits_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{L}}-\mathsf{1}}{\mathit{\mathsf{P}}}_{\mathit{\mathsf{u}}}\ \left({\mathit{\mathsf{R}}}_{\mathit{\mathsf{j}}}{\mathit{\mathsf{R}}}_{\mathit{\mathsf{j}}+\mathsf{1}}\right)/\left(\mathit{\mathsf{L}}-\mathsf{1}\right) $$
Calculations were performed as implemented in the Pse-in-One software (v1.0.3, 2015–08-21 dev) [8], and done in a manner so that length of the PhyChem feature vector is N*Lag, where N is the number of DNA PhyChem indices (N = 148) and Lag = 1.
Comparative analysis of the HVR1 PhyChem variants
The HVR1 PhyChem variants derived from sequences of intra-host HVR1 variants from chronically infected patients were compared with PhyChem profiles of variants derived from recently infected patients. We examined the differences between the population means for a given PhyChem index of HVR1 variants sampled from acute and chronic patients. To illustrate differences in binned plots, values for the same PhyChem index between two contiguous nt dimers were binned into equal-width bins (threshold = 0.006). Statistical analysis of differences in means of nt frequencies between the R/C patient-derived HVR1 variants were also conducted. In addition, differences in the PhyChem properties between HVR1 PhyChem variants were examined by the multi-dimensional scaling (MDS) technique as implement in [9]. Briefly, the MDS algorithm iteratively moves the points around in a kind of simulation of a physical model, where there is a force pushing them apart or together. A Euclidean distance matrix was computed to represent the spacing of the HVR1 PhyChem variants comprising the training dataset in Euclidean space. The two-dimensional MDS projection was initialized by randomizing the positions of the instances (or points). Sammon stress [10] was used as the stress function to define how the difference between the desired and the actual distance between points translates into the forces acting on the points.
RBFNN classifier and classification schemes
RBFNN classifier model
A machine-learning approach based on feed-forward neural networks (FFNNs) was used to examine the practical significance of DAC-based PhyChem features generated from sequences of HVR1 variants for developing computer applications for the R/C assessment. We implemented the Gaussian RBFNN classifier technique as described in [11]. Briefly, the RBFNN is a type of FFNN that uses a Gaussian radial basis function and consists of units divided into three layers: an input layer, a hidden (or radial basis) layer and an output layer (the linear model). The hidden layer of such types of networks are commonly trained using unsupervised learning by k-means clustering and the output layer using supervised learning by logistic regression (for classification tasks) or by linear regression (for regression tasks). For either task, penalized squared error, using a quadratic penalty on the non-bias weights in the output layer, is used as the loss function to find the model’s parameters.
The constructed RBFNN classifier had 2 output units (one output unit per class of infection durations), and the learned model for the lth output unit (i.e., class value) is described by the follow formula:
$$ {f}_l\left({x}_1,{x}_2,,\dots, {x}_m\right)=g\left({w}_{l,0}+\sum \limits_{i=1}^b{w}_{l,i}\exp \left(-\sum \limits_{j=1}^m\frac{a_j^2\ {\left({x}_j-{c}_{i,j}\right)}^2}{2{\sigma}_{i,j}^2}\right)\right) $$
where x
1, x
2, …, x
m
is the feature vector for the HVR1 PhyChem variant concerned, the activation function g(.) is the logistic function, b is the number of basis functions, w
i
is the weight for each basis function, \( {a}_j^2 \) is the weight of the jth feature, and c
i, j
and \( {\sigma}_{i,j}^2 \) are the basis function centers and variances, respectively.
Settings for the parameters w
(l, )i
, \( {a}_j^2 \), c
i, j
and \( {\sigma}_{i,j}^2 \) were established by finding a local minimum of the penalized squared error on the training dataset using the following error function:
$$ {L}_{SSE}=\left(\frac{1}{2}\sum \limits_{i=1}^n\sum \limits_{l=1}^k{\left({y}_{i,l}-{f}_l\left({\overrightarrow{x}}_i\ \right)\right)}^2\right)+\left(\lambda \sum \limits_{l=1}^k\sum \limits_{i=1}^b{w}_{l,i}^2\right) $$
where k classes = 2, y
i
is the class value for training instance \( {\overrightarrow{x}}_i \), the first sum ranges over all n instances in the training dataset and λ is the ridge parameter establishing the size of the penalty on the weights to control overfitting.
A value setting of 39 that was used for the b parameter, which was determined empirically based on the well-known strategy of grid search with cross-validation (GridSearchCV). The hidden unit centers and variances were initialized as follows: the k-means implementation in [12] was used to initialize the c
i, j
, where the number of k clusters was set at 39 and the minimum standard deviation for the clusters set at 1 × 10−3; and the initial value of all variance parameters \( {\sigma}_{i,j}^2 \) in the network was set to the maximum squared Euclidean distance between any pair of cluster centers to prevent initial value of the variance parameters from being too small [11]. The parameter λ for the logistic regression was set at 1 × 10−8.
Tuning of the b parameter
The number of basis functions (i.e., number of hidden units) that are employed in RBF networks is a relevant parameter that requires particular attention as it directly impacts complexity of the model. The GridSearchCV method was used to search through the hyper-parameter space for the best value for parameter b. Briefly, GridSearchCV implements a fit and a score method to optimize parameters of a model by cross-validated grid-search over a parameter grid (i.e., a range of values). The lower boundary of the grid was set at 2 and the upper boundary limit was set at 66, which was inferred by clustering the training dataset using an expectation-maximization (EM) algorithm (discussed in [12]). The GridSearchCV implementation used here is as follows: the initial grid is worked on with 2-fold cross-validation (2× CV) to determine the values of parameter b based on an evaluation metric(s) (hereafter, classification accuracy). The best point in the grid is then taken and 10× CV is performed with the adjacent point. If a better point is found, then this will act as new center and another 10× CV is performed. This process is repeated until no better point is found or the best (optimal) point is on the border of the grid.
Classification schemes
The RBFNN classifier was trained and evaluated on the training dataset comprising PhyChem variants of HVR1 labeled according to the actual R/C class associations, and with the randomly-labeled datasets where class-labels were randomly assigned to the variants. Classification performances of the RBFNN classifier derived from each scheme was also evaluated on the other dataset (i.e., unseen data).
Applied statistical tests
The Welch two sample t-test was used to examine the statistical significance of differences between the population means in HVR1 variants sampled from acute (n = 124) and chronic (n = 98) patients. The null hypothesis is that the difference between the means is 0 (making the difference between these two groups not statistically significant) and the alternative hypothesis is that their difference is not zero. The variance parameter was set to ‘false’ to account for the difference in sample size.
The strength of the association of DNA nt’s and PhyChem variables to the R/C durations of infection was measured using the Pearson product-moment correlation coefficient (r). Additionally, the heuristic Merit metric [13] was used to measure the importance of different subsets of DNA PhyChem features for establishing the association between HVR1 and R/C states. Merit scores for various subsets of PhyChem variables were computed using the following formula,
$$ {\mathit{\mathsf{Merit}}}_{\mathit{\mathsf{S}}}=\frac{\mathit{\mathsf{k}}\times \overline{{\mathit{\mathsf{r}}}_{\mathit{\mathsf{ca}}}}}{\sqrt{\mathit{\mathsf{k}}+\left(\mathit{\mathsf{k}}-\mathsf{1}\right)\times \overline{{\mathit{\mathsf{r}}}_{\mathit{\mathsf{aa}}}}}} $$
where \( \overline{r_{ca}} \) is the average feature-class correlation and \( \overline{r_{aa}} \) is the average feature-feature inter-correlation in a feature subset S containing k features.
Measures of statistical significance of the pairwise comparisons of classifying schemes performed in this study was done using the corrected resampled two-tailed T-test [14]. The Welch two sample t-test and Pearson product-moment correlation coefficient (r) were implemented in R (v3.0.1). Computations of the corrected two-tailed T-test and of the Merit scores were implemented as discussed in [12].
Classifier performance evaluation
Four metrics used to evaluate the RBFNN classifier(s) are reported herein: classification accuracy (CA), F1 measure, the Mathews correlation coefficient (MCC) and the Receiver Operating Characteristic (ROC) curve, which was summarized as a single value by computing the area of the convex shape below the ROC curve (AUROC). These metrics were computed as follows:
$$ \mathit{\mathsf{CA}}=\frac{\mathit{\mathsf{TP}}+\mathit{\mathsf{TN}}}{\mathit{\mathsf{TP}}+\mathit{\mathsf{TN}}+\mathit{\mathsf{FP}}+\mathit{\mathsf{FN}}}\times \mathsf{100}\% $$
$$ {\mathit{\mathsf{F}}}_{\mathsf{1}}=\mathsf{2}\bullet \left(\frac{\left(\frac{\mathit{\mathsf{TP}}}{\mathit{\mathsf{TP}}+\mathit{\mathsf{F}\mathsf{P}}}\right)\times \left(\frac{\mathit{\mathsf{TP}}}{\mathit{\mathsf{TP}}+\mathit{\mathsf{F}\mathsf{N}}}\right)}{\left(\frac{\mathit{\mathsf{TP}}}{\mathit{\mathsf{TP}}+\mathit{\mathsf{F}\mathsf{P}}}\right)+\left(\frac{\mathit{\mathsf{TP}}}{\mathit{\mathsf{TP}}+\mathit{\mathsf{F}\mathsf{N}}}\right)}\right) $$
$$ \mathit{\mathsf{MCC}}=\frac{\mathit{\mathsf{TP}}\times \mathit{\mathsf{TN}}-\mathit{\mathsf{FP}}\times \mathit{\mathsf{FN}}}{\sqrt{\left(\mathit{\mathsf{TP}}+\mathit{\mathsf{FP}}\right)\left(\mathit{\mathsf{TP}}+\mathit{\mathsf{FN}}\right)\left(\mathit{\mathsf{TN}}+\mathit{\mathsf{FP}}\right)\left(\mathit{\mathsf{TN}}+\mathit{\mathsf{FN}}\right)}} $$
where TP is the number of true positives; TN, the number of true negatives; FP, the number of false positive and FN, the number of false negatives. The interpolated curve (TPR vs FPR), made of points whose coordinates are functions of the threshold = θ ∈ [0, 1], was generated using the following equations:
$$ {\mathit{\mathsf{ROC}}}_{\mathit{\mathsf{x}}}\left(\mathit{\mathsf{\theta}}\right)=\mathit{\mathsf{FP}\mathsf{R}}\left(\mathit{\mathsf{\theta}}\right)=\frac{\mathit{\mathsf{FP}}\left(\mathit{\mathsf{\theta}}\right)}{\left(\mathit{\mathsf{FP}}\left(\mathit{\mathsf{\theta}}\right)+\mathit{\mathsf{TN}}\left(\mathit{\mathsf{\theta}}\right)\right)} $$
$$ {\mathit{\mathsf{ROC}}}_{\mathit{\mathsf{y}}}\left(\mathit{\mathsf{\theta}}\right)=\mathit{\mathsf{TP}\mathsf{R}}\left(\mathit{\mathsf{\theta}}\right)=\frac{\mathit{\mathsf{TP}}\left(\mathit{\mathsf{\theta}}\right)}{\left(\mathit{\mathsf{FN}}\left(\mathit{\mathsf{\theta}}\right)+\mathit{\mathsf{TP}}\left(\mathit{\mathsf{\theta}}\right)\right)} $$
where FPR is the false positive rate and TPR is the true positive rate. Computation of AUROC values was done by computing the probability that the RBFNN classifier ranks a randomly chosen positive instance above a randomly chosen negative instance, which was accomplished by calculating the ρ statistic from the U statistic. Equations and description of the method can be found in [12].