Classification relevance units machine
The sparse kernel-based binary classification model called the Classification Relevance Units Machine (CRUM) obtains probabilistic predictions [6, 9]. Let Ψ to be a set of objects; e.g. Ψ ⊆ Rd. The CRUM models the posterior distribution p(C+|x) that an object x ∈ Ψ is a member of the positive class C+ using the following model
(1)
where σ is the sigmoid function, M is a positive integer, k(·,·) is a kernel function, the weights w
i
∈ R, the bias b ∈ R, and the Relevance Units (RUs) u
i
∈ Rd. The posterior of the negative class is then P(C-|x) = 1 - P(C+|x).
For a given k(·,·), M, an observed dataset X = {x1, x2,..., x
N
} ⊆ Ψ and the associated class labels , the binary CRUM learning algorithm first estimates the kernel parameter(s) and u
i
's through unsupervised learning, and then learns the values of the w
i
's, and b through an iterative approach. The CRUM learning algorithm minimizes structural risk under log loss [7, 14] and determines the error/complexity trade-off parameter using an empirical Bayes method. Further details can be found in [6, 9].
The multiclass classification problem and solutions
Multiclass classification is the generalization of binary classification to an arbitrary number of classes K > 1. We denote the set of K classes as T = {C1, C2,..., C
K
}, and want to learn a classifier function g: Ψ → T.
There are two major approaches to converting a binary classifier to a multiclass classifier: the direct approach and through the aggregation of multiple binary classifiers.
Direct approach
In the direct approach, the internals of the binary classifier are changed to reflect the K class situation. For CRUM, this is done by changing the underlying model from the binary sigmoid model to a multiclass softmax model,
(2)
where the M RUs u
m
, M·K weights w
mi
, and K biases b
i
are to be learned. The RUs can be learned using unsupervised learning on the unlabeled data, as done in the binary case [9]. The K times increase in parameters lead to a K3 increase in the run-time complexity of the CRUM training algorithm compared to the binary case, due to the inversion of the (M·K + 1) × (M·K + 1) Hessian matrix. Similar to the RVM, this may make this method impractical for large problems [8]. Furthermore, related work in softmax regression suggests the need for more elaborate and costly methods for matrix inversion due to ill-conditioning [15].
Likewise, reformulating the SVM for multiclass classification leads to high cost training algorithms [16]. Therefore the second approach of aggregating multiple binary classifiers, which we will discuss next, has been the more popular and practical way to solve the multiclass classification problem.
Decomposition of a multiclass problem into binary classification problems
The idea of the aggregation approach is to decompose the multiclass problem into multiple binary problems that can then be solved with binary classifiers. The most popular framework for this approach is the method of error correcting output codes (ECOC) [10]. In this framework, the decomposition of a K-class problem into L binary problems is expressed with a coding matrix,
(3)
where each column of M specifies one binary classifier.
For example, the one-versus-rest (OVR) matrix for three classes is a 3 × 3 identity matrix:
There are three columns and thus this decomposition will require the training of three binary classifiers. The first binary classifier is trained with the training data belonging to class C1 as the positive class set and the data belonging to classes C2 and C3 as the negative class set. The second binary classifier is trained with the training data belonging to class C2 as the positive class set and the data belonging to classes C1 and C3 as the negative set. The third binary classifier is trained similarly. The name of this decomposition is called one-versus-rest (OVR) because each binary classifier is trained with only one class serving as the positive class and all other classes serving as the negative class. In general, the OVR matrix for K classes is the K × K identity matrix.
The all-pairs (AP) matrix for three classes is also a 3 × 3 matrix:
The Δ symbol denotes omission of the class in the training of the binary classifier. Therefore in this case, the first binary classifier is trained with the training data belonging to class C1 as the positive class set, data from C2 as the negative class set, and data from C3 is omitted. The next two binary classifiers are trained in a similar way. This decomposition is called one-versus-one or all-pairs (AP) as each binary classifier is trained with only a single class serving as the positive class and another single class as the negative class. Since there are K(K - 1)/2 distinct pairs of classes, the general AP matrix for K classes is a K × K(K - 1)/2 matrix.
In general any coding matrix M defined by Equation (3) can be used under the following constraints:
-
1.
All rows and columns are unique
-
2.
No row is solely composed of Δ
-
3.
Each column has at least one 1 and 0
Aggregating the binary outputs
Given a coding matrix M and the outputs of the L binary classifiers, how do we compute the probability P(C
k
|x)? Let us first consider the simple case of hard decoding, leading to a hard decision. Assume that the binary classifiers g
i
, corresponding to the binary classifier specified by the i-th column of M, return hard decisions where an output of 1 denotes the positive class and 0 denotes the negative class. Then the collective output of the binary classifiers on x can be collected into a row vector g(x) = [g1(x), g2(x),..., g
L
(x)]. The predicted class that x belongs to is determined by finding the row of M with the smallest distance to g(x). Let y, z ∈ {0, 1, Δ}1 × L
.
A commonly used measure of distance is a modified Hamming distance [17]:
(6)
where
(7)
Let M(k) denote the k-th row of M and,
(8)
Then the predicted class of x is . In the case of the AP coding matrices, this would correspond to choosing the class with the majority vote of the binary classifiers. Note that rows of M can be interpreted as the unique codewords representing the K classes and that the predicted g(x) is one of those codewords corrupted by noise. In this context, the above algorithm decodes g(x) into the closest codeword, thus performing error correction on the corrupted bits and giving the name of this approach to classification, ECOC.
Unfortunately, computing the posterior probabilities p
k
= P(C
k
|x) for all K classes is more difficult. For general coding matrices, the Generalized Bradley-Terry (GBT) model is used to estimate the posterior probabilities [11]. Let and denote the set of positive and negative classes, respectively, used by g
i
. Then the output g
i
(x) is the probability of the positive class of the i-th binary classification problem. Also let N
i
denote the number of training data with classes in , and the following quantities:
(10)
(11)
Given the probabilistic outputs of the binary classifiers , the core idea of the GBT model is that
(12)
Through these relations the posterior probabilities p = [p1, p2,..., p
K
]T can be retrieved. This is done by finding the p that minimizes the negative log-likelihood,
(13)
under the constraints that each p
k
> 0 and that they sum to unity. This optimization can be interpreted as the minimization of the weighted Kullback-Leibler divergence between and . Huang et al. [11] proposed an iterative algorithm to solve this optimization.
Note that the optimization of Equation (13) must be done for every object x that we want to make a prediction on. This could be too expensive in large-scale prediction applications. Furthermore, the computational complexity of the algorithm is not completely characterized. While Huang et al. [11] provide a proof of convergence under some assumptions, under a general decomposition the algorithm may not converge. In the cases that are known to converge, the speed of convergence is unknown. Therefore, a naive approach is proposed.
We make the naive assumption that the output of each binary classifier is independent. Under the interpretation of error-correcting codes, the formulation below is a soft-decoding of the observed g(x) to the codewords in M under the assumption that bit errors are independent. Then we can compute the class posteriors as simple products of the binary posteriors, as follows
(14)
where the output of classifiers not trained on data from class C
k
are omitted. For example, from the decomposition given in Equation (5), P(C2|x, M) = (1 - g1(x))g3(x). Given the outputs of the binary classifiers, the algorithm is linear in L. In the implementation log of Equation (14) is used for computational stability as shown in Step 4 of Algorithm 2.
The above formulation is a generalization to any valid M of the Resemblance Model for AP decomposition proposed in [18]. Again, the key assumption is the independence of the L binary classifiers, which is highly dependent on the decomposition M. Thus in general, this method is possibly only a crude approximation.
The following pseudocodes summarize the training and prediction processes of McRUM.
Algorithm 1: Training McRUM
Input: K × L decomposition matrix M, labeled training data
-
1:
for i = 1 to L
-
2:
positive_data ← initialize as empty
-
3:
negative_data ← initialize as empty
-
4:
for j = 1 to K
-
5:
if M
ij
= = 1
-
6:
Add data from class j to positive_data
-
7:
else if M
ij
== 0
-
8:
Add data from class j to negative_data
-
9:
end if
-
10:
end for
-
11:
Set g
i
to binary CRUM trained on positive_data and negative_data as described in [9]
-
12:
end for
-
13:
return g = [g1, g2,..., g
L
]
Algorithm 2: Prediction
Input: K × L decomposition matrix M, L binary CRUMs g
i
, input x
-
1:
Set p = [p1, p2,..., p
K
] = [0, 0,..., 0]
-
2:
for i = 1 to K
-
3:
for j = 1 to L
-
4.
-
5:
end if
-
6:
end for
-
7:
(normalize to ensure the posterior probabilities sum to 1)
-
8:
return p
Optimal coding matrix
The next question is whether there is any theory that can help guide us to designing the optimal coding matrix that gives the smallest error. There is, but it is not practically useful. These are some of the properties that would achieve a good ECOC-based classifier [17]:
-
1.
The minimum distance (using Hamming distance, Equation (6)) between rows of M should be maximized
-
2.
The number of Δ entries should be minimized
-
3.
The average error of the L binary classifiers should be minimized
All the criteria are at odds with each other. Consider OVR decomposition, Equation (4), again. Since all but one class is considered to be in the negative class, the training data is likely to be imbalanced. To see why this is a problem, let us consider an extreme case where 99% of the training data is negative and only 1% of the data is positive. Then a binary classifier that always predicts the negative class would achieve 1% error. Under the framework of empirical or structural risk minimization, classifier training would tend to converge to his solution as it provides low empirical risk under 0-1 loss. Therefore a large imbalance between the size of the positive and negative sets would bias the classifier against the smaller class. So while OVR does not have any Δ entries, the average error of the binary classifiers could be high.
In the case of the AP decomposition shown in Equation (5), each individual binary classifier only has a single class serving as the positive data and another single class serving as the negative. If the overall training set was balanced between all K classes, then each of the binary classifiers will also be balanced and have good average error. On the other hand, the AP matrix contains many Δ entries, which is a force that increases error. As a side effect, each individual binary classifier will be faster to train compared to OVR, as the amount of data per binary classifier is reduced. This can be overshadowed by the sheer number of classifiers to train if K is large.
Therefore knowing which coding matrix is superior to another a priori is not possible and the choice of coding matrix M is application-dependent. So we must experimentally try different matrices to find which one is the best suited to the particular application.
ncRNA dataset preparation and features
The ncRNA dataset is gathered from mirBase's collection of miRNA [19], NONCODE 3.0's collection of piRNA [20], and the remaining ncRNAs in the NONCODE 3.0 database serve as representatives of other ncRNAs. Each of these three sets is individually reduced using CD-HIT [21] to remove sequences with 80% or higher identity. This helps reduce the evolutionary correlations among the data and improves the generalization of the CRUM model that assumes an independent sample. The resulting dataset contains 9,439 miRNAs, 81,147 piRNAs, and 94,809 other ncRNAs.
In the gathered data, miRNAs are observed to be 15 ~ 33 nt long and piRNAs are observed to be 16 ~ 40 nt long. For the other ncRNAs, the training and evaluation of the McRUM does not necessarily use the entire sequence. We chose to use fragments of length 20 nt, which is in the overlapping range of the lengths between miRNAs and piRNAs, so that the fragment has the possibility of being an miRNA or piRNA had the identity of the fragment been unknown. If the other ncRNA sequence is of length longer than 20 nt, we take a random fragment of 20 nt from the sequence instead. Due to the imbalance of the dataset among the three classes, the training set is a sample of the available data. After holding out 20% of the miRNA sequences for an independent test set, we are left with 7,552 miRNAs in the training set. Therefore we sample 7,552 piRNAs and other ncRNAs each to form a balanced 1:1:1 training set. Together with the hold out of 1,887 miRNAs, the remaining 73,595 piRNAs and 87,257 other ncRNAs serve as an independent test set.
Since mature miRNAs and piRNAs lack strong secondary structures, internally the McRUM represents each ncRNA using k-mers, for k = 1, 2,..., 5. For each value of k, the number of occurrences of each type of k-mer is computed and normalized.
Performance measures
Receiver Operating Characteristic (ROC) curve is a visualization of performance of binary classifiers at various thresholds. On the x-axis is the false positive rate (FPR) and on the y-axis is the true positive rate (TPR), which is also called sensitivity or recall. These two quantities are calculated as follows,
(15)
(16)
where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives.
For classification of more than two classes, we can compute ROC curves by considering one class as the positive class and the remaining classes jointly as the negative class. For the small ncRNA experiment, we have three classes. For Figures 1(a)) and 2(a), we consider miRNA as the positive class, and piRNA and other ncRNAs jointly as the negative class. Under this setting TP is the count of miRNAs correctly classified, while FP is the number of piRNAs and other ncRNAs classified as miRNAs. FN is the sum of the number of miRNAs incorrectly classified and the number of miRNAs unclassified. Lastly, TN is the sum of the number of piRNA and other ncRNAs correctly classified, the number of piRNAs incorrectly classified as other ncRNAs, the number of other ncRNAs incorrectly classified as piRNA, and the number of piRNAs and other ncRNAs left unclassified. The piRNAs incorrectly classified as other ncRNAs, and vice versa, are counted as true negatives because they are correctly classified into the negative class and not into the positive class, miRNA. Similarly, Figures 1(b) and 2(b) are computed by considering piRNA as the positive class and miRNA and other ncRNAs jointly as the negative class, while Figures 1(c) and 2(c) are computed by considering other ncRNAs as the positive class and miRNA and piRNA jointly as the negative class.
The timing results for the Naïve and GBT decoding algorithms in the benchmark experiments were obtained using MATLAB implementations on a PC with a 2.83 GHz Intel Core 2 Quad processor and 8 GB of memory.