#### Support vector machines

SVM [23] has been widely used in classification. It constructs an optimal hyperplane decision function in feature space that is mapped from the original input space by using kernels, briefly introduced as follows:

Let x

_{
i
} denote the

*i*
^{th} feature vector in the original input space and z

_{
i
} denote the corresponding vector in the feature space, z

_{
i
} = Φ (x

_{
i
}). Kernel function

*k*(x

_{
i
}; x

_{
j
}) computes the inner product of two vectors in the feature space and defines the mapping function:

Three types of commonly used kernel functions are:

Linear Kernel *k*(x_{
i
}; x_{
j
}) = x_{
i
}•x_{
j
}

Polynomical Kernel *k*(x_{
i
}; x_{
j
}) = (1 + x_{
i
}•x_{
j
})^{
p
}

Gaussian Kernel *k*(x_{
i
}; x_{
j
}) = exp(-||x_{
i
} - x_{
j
}||^{2}/2*σ*
^{2})

For a typical classification problem with

*l* training samples (x

_{1}, y

_{1}),..., (x

_{
l
}, y

_{
l
}) where

*y*
_{
i
} ∈ {+1, -1}, finding the discriminant function

*f*(

*x*) =

*w*•Φ (x) +

*b* with the following optimization problem.

This optimization problem is usually solved in its dual form

#### Distance metric learning

Depending on the availability of training examples, the algorithms of distance metric learning can be divided into two categories: supervised distance metric learning and unsupervised distance metric learning. With the given class labels for training samples, supervised distance metric learning can be divided into global distance metric learning and local distance metric learning. The global learns the distance metric in a global sense, i.e., to satisfy all the pairwise constraints. The local approach is to learn the distance metric in a local setting, i.e., only to meet local pairwise constraints.

Unsupervised distance metric learning is also called manifold learning. Its main idea is to learn an underlying low-dimensional manifold whereby the geometric relationships between most of the observed data are preserved. Every dimension reduction approach works by essentially learning a distance metric without label information. Manifold learning algorithms can be divided into global linear dimension reduction approaches, including Principle Component Analysis (PCA) and Multiple Dimension Scaling (MDS), global nonlinear approaches, for instance, ISOMAP [24], local linear approaches, including Locally Linear Embedding (LLE) [25] and the Laplacian Eigenmap [26].

In supervised global distance metric learning, the representative work formulates distance metric learning as a constrained convex programming problem [27]. In local adaptive distance metric learning, many researchers presented approaches to learn an appropriate distance metric to improve a KNN classifier [28–32]. Inspired by the work on neighborhood component analysis [30] and metric learning with the use of energy-based models [33], Weinberger *et al*. proposed a distance metric learning for Large Margin Nearest Neighbor classification (LMNN). Specifically, the Mahanalobis distance is optimized with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin [34]. The LMNN has several parallels to learning in SVMs. For example, the goal of margin maximization and a convex objective function is based on the hinge loss. In multi-classification, the training time of SVMs scales at least linearly in the number of classes. By contrast, LMNN has no explicit dependence on the number of classes [34]. We introduce the idea of LMNN as follows:

Given a training set of n labeled samples and the corresponding class labels

, the binary matrix

*y*
_{
ij
} ∈ {0, 1} indicates whether or not the labels

*y*
_{
i
} and

*y*
_{
j
} match. And

*η*
_{
ij
} ∈ {0, 1} indicates whether

*x*
_{
j
} is a target neighbor of

*x*
_{
i
}. Both matrices

*y*
_{
ij
} and

*η*
_{
ij
} are fixed during training. The goal is to learn a linear transformation L: R

^{
d
} → R that optimizes KNN classification. The transform is used to compute squared distance as

The cost function is given as follows:

Where [z]

_{+} = max(z,0) denotes the standard hinge loss and the constant C > 0. The first term penalizes large distances between each input and its target neighbors and the second term penalizes small distances between each input and all other inputs that do not share the same label. The optimization of eq. (

5) can be reformulated as an instance of semidefinite programming (SDP) [

35] and the global minimum of eq. (

5) can be efficiently computed. Mahalanobis distance metric M = L

^{
T
}L, eq. (

4) is

Slack variables *ξ*
_{
ij
} for all pairs of differently labeled inputs are introduced so that the hinge loss can be mimicked. The resulting SDP is given by:

Subject to

(1) (*x*
_{
i
} - *x*
_{
l
})M(*x*
_{
i
} - *x*
_{
l
})-(*x*
_{
i
} - *x*
_{
j
})M(*x*
_{
i
} - *x*
_{
j
}) ≥ 1 - *ξ*
_{
ijl
}

(2) *ξ*
_{
ijl
} ≥ 0

(3) M ≥ 0