Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

Background Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods. Results We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates. Conclusion Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.


Background
Tumor classification is performed on microarray data col-lected by DNA microarray experiments from tissue and cell samples [1][2][3]. The wealth of this kind of data in dif-ferent stages of cell cycles helps to explore gene interactions and to discover gene functions. Moreover, obtaining genome-wide expression data from tumor tissues gives insight into the gene expression variation of various tumor types, thus providing clues for tumor classification of individual samples. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples; p is the number of genes.
Here p is always much larger than n, which hurts generalization performance of most classification methods. To overcome this problem, dimension reduction methods are applied to reduce the dimensionality from p to q with q <<> p.
Dimension reduction usually consists of two types of methods, feature selection and feature extraction [4]. Feature selection chooses a subset from original features according to classification performance, the optimal subset should contain relevant but non redundant features. Feature selection can help to improve generalization performance and speed of classifiers. There have been a great deal of work in machine learning and related areas to address this issue [5][6][7][8][9]. But in most practical cases, relevant features are not known beforehand. Finding out which features to be used is a hard work. At the same time, feature selection will lose the relevant information among features, while feature extraction is good at handling interactions among features.
Feature extraction projects the whole data into a low dimensional space and constructs the new dimensions (components) by analyzing the statistical relationship hidden in the data set. Principle components analysis (PCA) is one of the frequently used methods for feature extraction of microarray data. It is unsupervised, since it need not the label information of the data sets. Partial Least Squares (PLS) is one of the widely used supervised feature extraction methods for analysis of gene expression microarray data [10,11], it represents the data in a low dimensional space through linear transformation. Although feature extraction methods produce independent features, but Usually, a large number of features are extracted to represent the original data. As we known, the extracted features also contain noise or irrelevant information. Choosing an appropriate set of features is critical. Some researcher considered that the initial several components of PLS contain more information than the others, but it is hard to decide how many tail components are trivial for discrimination. Some authors proposed to fixed the number of components from three to five [12]; some proposed to determine the size of the space by classification performance of cross-validation [13]. However each one has its own weakness. Fixing at an arbitrary dimensional size is not applicable to all data sets, and the crossvalidation method is often obstructed by its high compu-tation. An efficient and effective model selection method for PLS is demanded. Furthermore, we consider not all the initial components are important for classification, subsets should be selected for classification.
Here, we propose and demonstrate the importance of feature selection after feature extraction in the tumor classification problems. We have performed experiments by using PCA [14] and PLS [15] as feature extraction methods separately. In this paper, we will perform a systematic study on both PCA and PLS methods, which will be combined with the feature selection methods (Genetic Algorithm) to get more robust and efficient dimensional space, and then the constructed data from the original data is used with Support Vector Machine (SVM) and k Nearest Neighbor (kNN) for classification. By applying the systematic study on the analysis of gene microarray data, we try to study whether feature selection selects proper components for PCA and PLS dimension reduction and whether only the top components are nontrivial for classification.

Results by using SVM
In order to demonstrate the importance of feature selection in dimension reduction, we have performed the following series experiments by using support vector machine (SVM) as the classifier: 1. SVM is a baseline method, all the genes without any selection and extraction are input into SVM for classification.
2. PCASVM uses PCA as feature extraction methods, all the newly extracted components are input into SVM.
3. PLSSVM uses PLS as feature extraction methods, all the newly extracted components are input into SVM.
4. PPSVM uses PCA+PLS as feature extraction methods, all the newly extracted components are input into SVM.
5. GAPCASVM uses PCA as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM.
6. GAPLSSVM uses PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM. 7. GAPPSVM uses PCA+PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM.
Since there are parameters for SVM, we try to reduce its effect to our comparison and use four pairs of different parameters for SVM, they are C = 10, σ = 0.01, C = 10, σ = 10, C = 1000, σ = 0.01, and C = 1000, σ = 10. It is noted that different data sets including the extracted data sets and selected data sets need different optimal parameters for different methods, we do not choose the optimal parameters, because 1) this is unreachable, finding the optimal parameters is an NP hard problem; 2) we do not exhibit the top performance of one special method on one single data set, but we want to show the effect of our proposed framework. The average error rates and the corresponding standard  deviation values are shown in Table 1, where the standard  deviation values are produced from our 50 times repeated  experiments. From Table 1, we can find that:

Prediction performance
• Results of all the classification methods with feature selection and extraction like PLSSVM, GAPLSSVM, PCASVM, GAPCASVM, GAPPSVM are better than that of SVM without any dimension reduction on average. Only on the LUNG data set, when SVM uses parameters of C = 10, σ = 0.01, results of PPSVM are worse than those of SVM.
• Results of classification methods with feature selection like GAPLSSVM, GAPCASVM and GAPPSVM are better than those of the corresponding feature extraction methods without feature selection like PLSSVM, PCASVM and PPSVM on average. Only on few cases, i.e. when C = 10, σ = 10 is for SVM, results of GAPCASVM are slightly worse than those of PCASVM on the COLON data set. • Results of GAPLSSVM are better than those of PCASVM and GAPCASVM, even the corresponding results of PPSVM and GAPPSVM on average. Only on the CNS data set out of four data sets, GAPCASVM obtains the best results than other methods do.
• Results of PPSVM and GAPPSVM which combine PCA and PLS as feature extraction methods are not the best, just equal with those of PCASVM and GAPCASVM.

Number of selected features
We also show the number of features selected by each method with their corresponding standard deviation in  Fig. 1 shows the comparison of distributions of components selected by GA in two cases of GAPCASVM and GAPLSSVM, and Fig. 2 shows that of GAPPSVM. Difference between Fig. 1 and Fig. 2 is that in Fig. 1, PCA and PLS are used as feature extraction individually, while in Fig. 2, PCA is combined with PLS as feature extraction methods.

Distribution of selected features
From Fig. 1 and Fig. 2, we can find that: • When only PLS is used for feature extraction, the top components are a little more than that of others in the selected components, but the others are also important.
• When only PCA is used, the top components is less than others in the selected features, and the tail components are more important than others.
• When both PCA and PLS are used as feature extraction methods, they are nearly equal in the selected components, and the top components of PLS is a little more than others.

Results by using kNN
In order to show the importance of feature selection, we have also performed the following series experiments on the kNN learning machine to reduce the bias caused by learning machines.
1. KNN is a baseline method, all the genes without any selection and extraction are input into kNN for classification.
2. PCAKNN uses PCA as feature extraction methods, all the newly extracted components are input into kNN.
3. PLSKNN uses PLS as feature extraction methods, all the newly extracted components are input into kNN.
4. PPKNN uses PCA+PLS as feature extraction methods, all the newly extracted components are input into kNN.
5. GAPCAKNN uses PCA as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into kNN.
Comparison of distributions of eigenvectors used by GAPCASVM and GAPLSSVM with C = 10, σ = 0.01 for SVM

Figure 1
Comparison of distributions of eigenvectors used by GAPCASVM and GAPLSSVM with C = 10, σ = 0.01 for SVM. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA. 6. GAPLSKNN uses PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into kNN.
7. GAPPKNN uses PCA+PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into kNN.
Since there are parameters for kNN, we try to reduce its effect to our comparison and use three parameters for kNN, they are k = 1, k = 4 and k = 7.
It is noted that different data sets need different optimal parameters for different methods, we do not choose the optimal parameters, because we do not exhibit the top performance of one special method on one single data set, but we want to show the effect of our proposed framework.

Prediction performance
The average error rates and the corresponding standard deviation values are shown in Table 3, from which we can find the similar observations: • Results of all the classification methods with feature selection and extraction like PLSKNN, GAPLSKNN, PCAKNN, GAPCAKNN, GAPPKNN are better than that of KNN without any other dimension reduction on average and on each cases.
• Results of classification methods with feature selection like GAPLSKNN, GAPCAKNN and GAPPKNN are better than those of the corresponding feature extraction methods without feature selection like PLSKNN, PCAKNN and PPKNN on average and each cases.
Comparison of distributions of eigenvectors used by GAPPSVM with C = 10, σ = 0.01 for SVM

Figure 2
Comparison of distributions of eigenvectors used by GAPPSVM with C = 10, σ = 0.01 for SVM. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.    Fig. 3 shows the comparison of distributions of components selected by GA in two cases of GAPCAKNN and GAPLSKNN, and Fig. 4 shows that of GAPPKNN. Difference between Fig. 3 and Fig. 4 is that in Fig. 3, PCA and PLS are used as feature extraction individually, while in

Distribution of selected features
Comparison of distributions of eigenvectors used by GAPCAKNN and GAPLSKNN with k = 1 for kNN Figure 3 Comparison of distributions of eigenvectors used by GAPCAKNN and GAPLSKNN with k = 1 for kNN. X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.  Fig. 3 and Fig. 4, we can find the similar observations as below: • When only PLS is used for feature extraction, the top components are more than that of others in the selected components, but the others are also selected, the top, the more.
• When only PCA is used, the top components is less than others in the selected features, and the tail components are more important than others.
• When both PCA and PLS are used as feature extraction methods, they are nearly equal in the selected components, and the top components of PLS are a little more than others.

Discussion
The results are interesting, beyond our imagination, but they are reasonable.  ponent of PLS is more important than the others for classifiers. Furthermore, the top components of PCA are not the top feature subset with high discriminative power for classifiers, while the top ones of PLS are the top feature subset with high discriminative power, but the tail ones have also discriminative power, they are selected too. So, we should not only choose the top components, but employ feature selection methods to select a feature subset from the extracted components for classifiers.
Feature selection is performed by genetic algorithm (GA), which shows great power to select feature subsets for classifiers, this can be seen from the experimental results. Here genetic algorithm based feature selection is a so called wrapper model, which uses the classifier to measure the discriminative power of feature subsets from the extracted components. This method has been proved the best one feature selection method [16]. While this wrapper method is time consuming, nowadays, the scale of data sets is increasing rapidly, so efficient feature selection methods need be developed.
Partial least squares is superior to principle component analysis as feature extraction methods. The reason is simple, PLS extracts components by maximizing the covariance between the response variable y and the original genes X, which considers using the labels y and can be viewed as a supervised method. While PCA extracts components by maximizing the variance of a linear combination of the original genes, which does not consider using the label y and can be viewed as an unsupervised method. Here, we try to improve the classification accuracy of SVM, this is a supervised task, so PLS a supervised method is superior to PCA, an unsupervised method.
Features selected by different classifiers has minor difference, and results of prediction accuracy are also different. Feature selection has done more effect on kNN than that on SVM. Because kNN is more sensitive on high dimensional data sets than SVM. But, they all benefit from feature selection.

Conclusion
We have investigated a systematic feature reduction framework by combing feature extraction with feature selection. To evaluate the proposed framework, we used four typical data sets. In each case, we used principle component analysis (PCA) and partial least squares (PLS) for feature extraction, GA as feature selection, support vector machine (SVM) and k nearest neighbor (kNN) for classification. Our experimental results illustrate that the proposed method improves the performance on the gene expression microarray data in accuracy. Further study of our experiments indicates that not all the top components of PCA and PLS are useful for classification, the tail component also contain discriminative information. Therefore, it is necessary to combine feature selection with feature extraction and replace the traditional feature extraction step as a new preprocessing step for analyzing high dimensional problems.

A novel framework of dimension reduction
Principle Components Analysis (PCA) and Partial Least Squares (PLS) are two favorite methods in gene analysis, but how to determine the number of extracted components for classifiers is a critical problem. In the previous works, the number is fixed as 3 or 5 top ones, or obtained by cross validation. These works assume that only the top several components are important. In fact the components are ranked from a statistical view; it may not the same rank according to their discriminative ability. Therefore, we propose to apply feature selection techniques to select components for classifiers. Fig. 5 illustrates the main steps of the framework employed here. The main difference from the traditional approach is the inclusion of a step that performs feature selection among the features extracted by feature extraction methods. From Fig. 5, we can see that dimension reduction consists of two parts, feature extraction and feature selection, here feature extraction is performed by PCA and PLS, feature selection is performed by GA and classifier is performed by support vector machine (SVM) or k nearest neighbor (kNN). In Fig. 5, classifier is also applied to feature selection, that is also so call the wrapper evaluation strategy, classification performance of classifiers is used to evaluate the selected feature subset. These are explained in detail as follows.

Feature extraction
Principle component analysis PCA is a well-known method of dimension reduction [17]. The basic idea of PCA is to reduce the dimensionality of a data set, while retaining as much as possible the vari-A framework of dimension reduction for the analysis of gene microarray data are known as the principal components (PCs). Geometrically, these linear combinations represent the selection of a new coordinate system obtained by rotating the original system. The new axes represent the directions with maximum variability and are ordered in terms of the amount of variation of the original data they account for. The first PC accounts for as much of the variability as possible, and each succeeding component accounts for as much of the remaining variability as possible. Computation of the principal components reduces to the solution of an eigenvalue-eigenvector problem. The projection vectors (or called the weighting vectors) u can be obtained by eigenvalue decomposition on the covariance matrix S X , where λ i is the i-th eigenvalue in the descending order for i = 1,..., q, and u i is the corresponding eigenvector. The eigenvalue λ i measures the variance of the i-th PC and the eigenvector u i provides the weights (loadings) for the linear transformation (projection).
The maximum number of components q is determined by the number of nonzero eigenvalues, which is the rank of S X , and q ≤ min(n, p). But in practical, the maximum value of q is not necessary. Some tail components, which have tiny eigenvalues and represent few variances of original data, are often needed to be reduced. The threshold of q often determined by cross-validation or the proportion of explained variances [17]. The computational cost of PCA, determined by the number of original predictor variables p and the number of samples n, is in the order of min(np 2 + p 3 , pn 2 + n 3 ). In other words, the cost is O(pn 2 + n 3 ) when p > n.

Partial least squares based dimension reduction
Partial Least Squares (PLS) was firstly developed as an algorithm performing matrix decompositions, and then was introduced as a multivariate regression tool in the context of chemometrics [18,19]. In recent years, PLS has also been found to be an effective dimension reduction technique for tumor discrimination [11,12,20], which denoted as Partial Least Squares based Dimension Reduction (PLSDR).
The underlying assumption of PLS is that the observed data is generated by a system or process which is driven by a small number of latent (not directly observed or measured) features. Therefore, PLS aims at finding uncorrelated linear transformations (latent components) of the original predictor features which have high covariance with the response features. Based on these latent components, PLS predicts response features y, the task of regression, and reconstruct original matrix X, the task of data modeling, at the same time.
The objective of constructing components in PLS is to maximize the covariance between the response variable y and the original predictor variables X, subject to the constraint , for all 1 ≤ i <j. The central task of PLS is to obtain the vectors of optimal weights w i (i = 1,..., q) to form a small number of components, while PCA is an "unsupervised" method that utilizes the X data only.
To derive the components, [t 1 , t 2 ,..., t q ], PLS decomposes X and y to produce a bilinear representation of the data [21]: and where w's are vectors of weights for constructing the PLS components t = Xw, v's are scalars, and e and f are the residuals. The idea of PLS is to estimate w and v by regression. Specifically, PLS fits a sequence of bilinear models by least squares, thus given the name partial least squares [18]. . . .
covariance with the response variable y subject to being uncorrelated with all previously constructed components.
The first PLS component t 1 is obtained based on the covariance between X and y. Each subsequent component t i (i = 2,..., q), is computed using the residuals of X and y from the previous step, which account for the variations left by the previous components. As a result, the PLS components are uncorrelated and ordered.
The number of components q is the only parameter of PLS which can be decided by user [11,12], by cross-validation [13] or by the regression goodness-of-fit [22]. With the increase of q, the explained variances of X and y are expanded, and all the information of original data are preserved when q reaches the rank of X, which is the maximal value of q.
Like PCA, PLS reduces the complexity of microarray data analysis by constructing a small number of gene components, which can be used to replace the large number of original gene expression measures. Moreover, obtained by maximizing the covariance between the components and the response variable, the PLS components are generally more predictive of the response variable than the principal components.
PLS is computationally efficient with cost only at O(npq), i.e. the number of calculations required by PLS is a linear function of n or p. Thus it is much faster than the method of PCA for q is always less than n.

Feature selection
Finding out the optimal feature subset according to classification performance is referred to as feature selection. Given a set of features, the problem is selecting a subset that leads to the least classification error. A number of feature selection methods have been studied in the bioinformatics and machine learning fields [6][7][8]. There are two main components in every feature subset selection system: the search strategy used to pick the feature subsets and the evaluation method used to test their goodness based on some criteria. Genetic algorithm as a search strategy is proved to be the best one among different complete and heuristic methods [16]. There are two categories of evaluation strategies: 1) filter and 2) wrapper. The distinction is made depending on whether feature subset evaluation is performed using the learning algorithm employed in the classifier design (i.e., wrapper) or not (i.e., filter).
Filter approaches are computationally more efficient than wrapper approaches since they evaluate the goodness of selected features using criteria that can be tested quickly. This, however, could lead to non-optimal features, especially, when the features dependent on the classifier. As result, classifier performance might be poor. Wrapper methods on the other hand perform evaluation by training the classification error using a validation set. Although this is a slower procedure, the features selected are usually more optimal for the classifier employed. Here we want to improve classification performance, and use the wrapper strategy. Classification performance of SVM or kNN is used as the criteria in this paper.

Genetic algorithm
Genetic Algorithm (GA) is a class of optimization procedures inspired by the biological mechanisms of reproduction. [23]. GA operate iteratively on a population of structures, each one of which represents a candidate solution to the problem at hand, properly encoded as a string of symbols (e.g., binary). Three basic genetic operators guide this search: selection, crossover, and mutation. The genetic search processes it iterative: evaluating, selecting, and recombining strings in the population during each iteration until reaching some termination condition.
The basic idea is that selection probabilistically filters out solutions that perform poorly, choosing high performance solutions to concentrate on or exploit. Crossover and mutation, through string operations, generate new solutions for exploration. Given an initial population of elements, GA uses the feedback from the evaluation process to select fitter solutions, generating new solutions through recombination of parts of selected solutions, eventually converging to a population of high performance solutions.
In our proposed algorithm GA-FS (Genetic Algorithm based Feature Selection), we use a binary chromosome with the same length as the feature vector, which equals 1 if the corresponding feature is selected as the input, and 0 if the feature is discarded. The goal of using GA here is to use fewer features to achieve the same or better performance. Therefore, the fitness evaluation contains two terms: 1) Classification error; 2) The number of selected features. We use the fitness function shown below: fitness = error + γ*number_of_selected_features, where error corresponds to the classification error on the validation data set D v , γ is a trade-off between classification error and the number of selected features. Here between classification error and feature subset size, reducing classification error is our major concern, so γ is set to 1 = (2 * 10 4 ).
The GA-FS approach is summarized in Fig. 6, where the data set is divided into 3 parts, training set D r , validation set D v and test set D s as in the subsection of experimental setting.
Classifier -support vector machines Support vector machines (SVM) proposed by Vapnik and his co-workers in 1990s, have been developed quickly during the last decade [24], and successfully applied to biological data mining [6], drug discovery [25,26], etc.
Denoting the training sample as {(X, y)} ⊆ ‫ޒ{‬ p × {-1, 1}} n , SVM discriminant hyperplane can be written as where n is the size of training sample, w is a weight vector, b is a bias. According to the generalization bound in statistical learning theory [27], we need to minimize the following objective function for a 2-norm soft margin version of SVM in which, slack variable ξ i is introduced when the problem is infeasible. The constant C > 0 is a penalty parameter, a larger C corresponds to assigning a larger penalty to errors.
By building a Lagrangian and using the Karush-Kuhn-Tucker (KKT) complementarity conditions [28,29], we can obtain the value of optimization problem (8).
Because of the KKT conditions, only those Lagrangian multipliers, α i s, which make the constraint active are non zeros, we denote these points corresponding to the non zero α i s as support vectors (sv). Therefore we can describe the classification hyperplane in terms of α and b: where K(x, z) is a kernel function [30], it is introduced to SVM to treat nonlinear cases, and Gaussian kernel function K(x, z) = exp(-||x -z|| 2 /σ 2 ) is considered as a prior choice [31].
Classifier -k nearest neighbor k nearest neighbor is a non-parametric classifier [32], where the result of new instance is classified based on majority of k nearest neighbor category, any ties can be broken at random. It does not use any model to fit and only based on the training data set.

Experimental data sets
Eight microarray data sets are used in our study which are listed in Table 5. They are briefly described as below.  Genetic algorithm based feature selection Figure 6 Genetic algorithm based feature selection.

CNS
the 2,000 genes with the highest minimal intensity across the 62 tissues were used in the analysis.
Leukemia [1] consists of 72 bone marrow samples with 47 ALL and 25 AML. The gene expression intensities are obtained from Affymetrix high-density oligonucleotide microarrays containing probes for 7,129 genes.
Lung [34] proposed a data set for the purpose of classifying lung Cancer between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. The data set includes 181 tissue samples (31 MPM and 150 ADCA). Each sample is described by 12,533 genes.

Experimental settings
To evaluate the performance of the proposed approach, we use the hold out validation procedure. Each data set is used as a whole set, originally split data sets are merged, and then we randomly divide the whole set into the training set and test set D s (2/3 for training and the rest for test). Furthermore, if a validation data set is needed, we splits the training data set, keeping 2/3 samples for training D r and the rest for validation D v . Classification error of SVM is obtained on the test data sets D s . We repeat the process 50 times.
The parameters of GA is set by default as in the software of MATLAB, and we set different parameters for SVM and kNN to test how parameters affect the results.