Supervised recursive learning
Our method of RFA uses supervised learning to achieve the highest level of training accuracy and statistical similarity measures to choose the next variable with the least dependence on or correlation to the already identified variables as follows:
1. Insignificant genes are removed according to their statistical insignificance. Specifically, a gene with a high p-value is usually not differently expressed and therefore has little contribution in distinguishing normal tissues from tumor tissues or in classifying different types of tissues. To reduce the computational load, those genes should be removed. The filtered gene data is then normalized. Here we use the standard normalization method, MANORM, which is available from MATLAB bioinformatics toolbox.
2. Each individual gene is selected by supervised learning. A gene with highest classification accuracy is chosen as the most important feature and the first element of the feature set. If multiple genes achieve the same highest classification accuracy, the one with the lowest p-value measured by test-statistics (e.g., score test), is the target of the first element. At this point the chosen feature set, G1, contains just one element, g1, corresponding to the feature dimension one.
3. The (N+1)st dimension feature set, GN+1= {g1, g2, ..., g
N
, gN+1} is obtained by adding gN+1to the Nthdimension feature set, G
N
= {g1, g2, ..., g
N
}. The choice of gN+1is described as follows:
Add each gene g
i
(g
i
∉ G
N
) into G
N
and obtain the classification accuracy of the feature set G
N
∪{g
i
}. The g
i
(g
i
∉ G
N
) associated with the group, G
N
∪{g
i
} that obtains the highest classification accuracy, is the candidate for gN+1(not yet gN+1). Considering the large number of variables, it is highly possible that multiple features correspond to the same highest classification accuracy. These multiple candidates are placed into the set C, but only one candidate from C will be identified as g
N+1
. How to make the selection is described next.
Candidate feature addition
To find the most informative (or least redundant) next feature gN+1, two formulas may be designed by measuring the statistical similarity between the chosen feature set and each candidate. Here we use, say, Pearson's correlation coefficient [28] between chosen features g
n
(g
n
∈ G
N
, n = 1, 2, ..., N) and candidate g
c
(g
c
∈ C) to measure the similarity.
In the first formula, the sum of the square of the correlation, SC, is calculated to measure the similarity and is defined as follows:
(7)
Where, g
c
∈ C , g
n
∈ G
N
.
Then selection of gN+1can be based on the Minimum Sum of the square of the Correlation (MSC), that is,
(8)
In the second formula, the maximum value of the square of the correlation, MC, is calculated:
(9)
Where, g
c
∈ C , g
n
∈ G
N
.
The selection of g
N+1
follows the criterion that the MC value is the minimum, which we call Minimum of Maximum value of the square of the Correlation (MMC).
(10)
In the methods mentioned above, a feature is recursively added to the chosen feature set based on supervised learning and the similarity measures. With the use of a classifier XXX, we call the first gene selection method XXX-MSC and the second one XXX-MMC. For example, if the classifier is Naive Bayes Classifier (NBC), we call the two strategies NBC-MSC and NBC-MMC, respectively.
Lagging Prediction Peephole Optimization (LPPO)
We want to find a combination of features (genes) that yields the best performance on breaking down solvents. Normally, with the recursive addition for the next feature, the training accuracy will increase and reach a peak classification performance at some point, and then may maintain it with subsequent feature additions; but after that the training accuracy may decrease. Generally speaking, all strategies for determining the final feature set should be based on the best training classification. In high-volume data analysis, it is common that the best training accuracy corresponds to different feature sets; that is, multiple feature sets achieve the same highest training accuracy. However, although all these feature sets are associated with the same highest training accuracy, the testing accuracy of these feature sets may be different. Among these highest training feature sets, the one having the best testing accuracy is called the optimal feature set, which is highly complicated to characterize when a sample size is small. Either applying different gene methods to the same training samples, or applying the same gene selection method to different training samples, or applying different learning classifiers to the same training samples, will produce a different optimization of the feature set. Pochet et al. [29] presented a method of determining the optimal number of genes by means of a cross-validation procedure; the drawback of this method is that it actually utilizes whole data information, including training samples and testing samples.
How do we choose the optimal feature set? If there are multiple best training classifications, a random choice, called random strategy, works for best training classification. In the recursive addition of the features, for training samples, a classification model is one of the best methods. But for testing samples, at this point, the classification model may not be optimal because of the difference between the training samples and the testing samples; the optimal classification model will lag in appearance (see Figure 1). Based on this observation, we propose the following algorithm for optimization.
Under feature dimension j, the training accuracy of the ith experiment is r(i, j). If the feature set G
k
, corresponding to feature dimension k, has the best training accuracy in the trainings from the feature set G1 to GD, corresponding to the feature dimensions from 1 to D, let HR denote the set that contains all the combinations of G
k
, corresponding to all the feature set having the highest classification accuracy under feature dimension 1 to D.
(11)
In general, the best classification model for testing samples will lag in appearance behind the initial best training model. We will exclude the elements of HR that correspond to the initial best training. The remaining elements in HR constitute the candidate set HRC for optimization.
Each element in HRC is associated with the best training accuracy. We set a peephole for each element and choose the element associated with the optimal peephole. The details are described as follows:
a. For each element G
k
∈ HRC, the peephole over G
k
with length of 2l+1 covers the feature sets G
k-l
, Gk-l+ 1, ..., G
k
, ..., Gk+l-1, G
k+l
, corresponding to the training accuracy r(i, k-l), r(i, k-l+ 1), ..., r(i, k), ..., r(i, k+l- 1), r(i, k+l). The mean training value of the peephole is denoted by mp_r(i, k).
(12)
The peephole with the best classification of mp_r is then chosen as the optimal one.
b. If there are multiple optimal peepholes, then we apply random forest to these peepholes and check the mean values of the Out-of-Bag (OOB) error rates [24, 25, 30]. The feature sets G
k-l
, Gk-l+1, ..., Gk,, ..., Gk+l-1,G
k+l
correspond to the OOB errors, oob_e(i, k-l), oob_e(i, k-l+ 1), ..., oob_e(i, k), ..., oob_e(i, k+l- 1), oob_e(i, k+l). The mean value of the OOB errors is denoted by mp_oob_e(i, k)
(13)
The peephole with minimum mp_oob_e is the optimal one.
c. If there are multiple peepholes corresponding to the best mp_r and minimum mp_oob_e, then set l +1 → l, and repeat 'a' to 'c', until a unique optimal peephole is determined.
d. The feature set located at the center of the final optimal peephole is chosen as the final optimal feature set.
This optimization of RFA is called Lagging Prediction Peephole Optimization (LPPO). Figure 3 briefly outlines the LPPO on the prostate data set, which was studied by Singh et a l. [31].
Data sets
The following six benchmark microarray data sets have been extensively studied and used in our experiments to compare the performances of our methods with others. Data sources that are not specified are available at: http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi.
1) The LEUKEMIA data set consists of two types of acute leukemia: 48 acute lymphoblastic leukemia (ALL) samples and 25 acute myeloblastic leukemia (AML) samples with over 7129 probes from 6817 human genes. It was studied by Golub et al. [32].
2) The LYMPHOMA data set consists of 58 diffuse large B-cell lymphoma (DLBCL) samples and 19 follicular lymphoma (FL) samples. It was studied by Shipp et al. [33]. The data file, lymphoma_8_lbc_fscc2_rn.res, and the class label file, lymphoma_8_lbc_fscc2.cls were used in our experiments for identifying DLBCL and FL.
3) The PROSTATE data set used by Singh et al. [31] contains 52 prostate tumor samples and 50 non-tumor prostate samples.
4) The COLON cancer data set used by Alon et al. [34] contains 62 samples collected from colon-cancer patients. Among them, 40 tumor biopsies are from tumors, and 22 normal biopsies are from healthy parts of the colons of the same patients. Based on the confidence in the measured expression levels, 2000 genes were selected. The data source is available at: http://microarray.princeton.edu/oncology/affydata/index.html.
5) The Central Nervous System (CNS) embryonal tumor data set that was originally studied by Pomeroy et al. [35] contains 60 patient samples. Among them, 21 are survivors who are alive after treatment, and 39 are failures who succumbed to their diseases. There are 7129 genes.
6) The Breast cancer data set studied by Van et al. [36] contains 97 patient samples, 46 of which are relapse patients who had developed distance metastases within 5 years, and 51 patients who are non-relapsed who remained healthy for at least 5 years from the distance after their initial diagnosis. This data source is available at: http://www.rii.com/publications/2002/vantveer.htm.
Experiments
Our experiments are designed as follows:
1. The data sets are first divided randomly into training samples and testing samples. The ratio of training samples to testing samples is approximately 1:1 in each class.
2. Recursive feature additions with Naive Bayes Classifier (NBC) and Nearest Mean Scaled Classifier (NMSC) for gene selection (NBC-MSC, NBC-MMC, NMSC-MSC, and NMSC-MMC) were applied to the training samples for gene selection. Different feature sets of the gene expression data are produced under feature dimensions 1 to 100. We compared the above proposed methods to several recently developed and published gene selection methods: LOOCSFS, GLGS, SVMRFE, SFFS-LS bound, SFS-LS bound, and also T-TEST.
3. To compare different gene selection methods, the learning classifiers including NBC, NMSC, SVM [37, 38], and Random Forest are applied to the testing samples.
4. The experiments were performed in 20 runs, and the average testing accuracies were compared to evaluate performance.