**Multi-layer perceptron (MLP):** Error backpropagation neural network is a feedforward multilayer perceptron (MLP) that is applied in many fields due to its powerful and stable learning algorithm [13]. The neural network learns the training examples by adjusting the synaptic weight according to the error occurred on the output layer. The back-propagation algorithm has two main advantages: local for updating the synaptic weights and biases, and efficient for computing all the partial derivatives of the cost function with respect to these free parameters. A perceptron is a simple pattern classifier.

The weight-update rule in backpropagation algorithm is defined as follows:

where *w* is the weight update performed during the *n*th iteration through the main loop of the algorithm, η is a positive constant called the learning rate, δ is the error term associated with j, and 0≤ α <1 is a constant called the momentum [9][11, 12].

**Radial basis function (RBF) networks:** RBF networks have 2 steps of processing. First, input is mapped in the hidden layer. The output layer is then a linear combination of hidden layer values representing mean predicted output. This output layer value is the same as a regression model in statistics [9]. The output layer, in classification problems, is usually a sigmoid function of a linear combination of hidden layer values. Performance in both cases is often improved by shrinkage techniques, also known as ridge regression in classical statistics and therefore smooth output functions in a Bayesian network.

Moody and Darken [14] have proposed a multi-phase approach to RBFNs. This multi-phase approach is straight-forward and is often reported to be much faster than, e.g., the backpropagation training of MLP. A possible problem of the approach is that the RBF uses clustering method (e.g., k-means) to define a number of centers in input space and the clustering method is completely unsupervised and does not take the given output information into account. Clustering methods usually try to minimize the mean distance between the centers they distribute and the given data which is only the input part of the training data. Therefore, the resulting distribution of RBF centers may be poor for the classification or regression problem.

**Support Vector Machines (SVM):** Given a training set of instance-label pairs (

), i = 1,…, l where

and

, the support vector machines require the solution of the following optimization problem:

SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. C > 0 is the penalty parameter of the error term.
is called the kernel function [6]. Here there are four basic kernels: linear, polynomial, radial basic function (RBF), and sigmoid:

Linear:

Polynomial:

RBF:

Sigmoid:

**The k-means:** The k-means algorithm takes a dataset and partitions it into *k* clusters, a user-defined value. Computationally, one may think of this method as a reverse method of analysis of variance (ANOVA). The algorithm starts with *k* random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and 2) maximize variability between clusters [21]. In other words, the similarity rules will apply maximally to the members of one cluster and minimally to members belonging to the rest of the clusters. The significance test in ANOVA evaluates the between group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. Usually, as the result of a *k*-means clustering analysis, the means for each cluster on each dimension would be examined to assess how distinct *k* clusters are. Obtaining very different means for most is perfect [22].

**Farthest First:** Farthest First Traversal Algorithm works as a fast simple approximate clustering model after Simple K-Means. To find *k* cluster centers, it randomly chooses one point as a first center, and then selects point with maximal min-distance to current centers as a next center [23].

**Density Based Clustering (DBC):** Density-based clustering has turned out to be one of the most successful traditional approaches to clustering. It can be extended to detect subspace clusters in high dimensional spaces. A cluster is defined as a maximal set of density-connected points. Correlation clusters are sets of points that fit to a common hyperplane of arbitrary dimensionality. Density-based clustering starts by estimating the density of each point to identify core, border and noise points. A core point is referred to as a point whose density is greater than a user-defined threshold. A noise point is referred to as a point whose density is less than a user-defined threshold. Noise points are usually discarded in the clustering process. A non-core, non-noise point is considered as a border point [24].

**Expectation Maximization (EM) clustering:** An expectation-maximization (EM) algorithm finds maximum likelihood estimates of parameters in probabilistic models. EM performs repeatedly between an expectation (E) step, an expectation of the likelihood of the observed variables, and maximization (M) step, which computes the maximum expected likelihood found on the E step. EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters [25]. By cross validation, EM can decide how many clusters to create.

The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data. The results of EM clustering are different from those computed by k-means clustering [26]. K-means assigns observations to clusters to maximize the distances between clusters. The EM algorithm computes classification probabilities, not actual assignments of observations to clusters.

**Cross validation:** In order to perform to measure classification error, it is necessary to have test data samples independent of the learning dataset that was used to build a classifier. However, obtaining independent test data is difficult or expensive, and it is undesirable to hold back data from the learning dataset to use for a separate test because that weakens the learning dataset. V-fold cross validation technique performs independent tests without requiring separate test datasets and without reducing the data used to build the tree. The learning dataset is partitioned into some number of groups called “folds” [31]. The number of groups that the rows are partitioned into is the ‘V’ in *V-fold cross classification*. 10 is the recommended and default number for “V”. It is also possible to apply the *v-fold cross-validation* method to a range of numbers of clusters in *k*-means or *EM* clustering, and observe the resulting average distance of the observations from their cluster centers.

Leave-one-out cross-validation involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data [31].