Sparse feature selection for classification and prediction of metastasis in endometrial cancer

Background Metastasis via pelvic and/or para-aortic lymph nodes is a major risk factor for endometrial cancer. Lymph-node resection ameliorates risk but is associated with significant co-morbidities. Incidence in patients with stage I disease is 4–22% but no mechanism exists to accurately predict it. Therefore, national guidelines for primary staging surgery include pelvic and para-aortic lymph node dissection for all patients whose tumor exceeds 2cm in diameter. We sought to identify a robust molecular signature that can accurately classify risk of lymph node metastasis in endometrial cancer patients. 86 tumors matched for age and race, and evenly distributed between lymph node-positive and lymph node-negative cases, were selected as a training cohort. Genomic micro-RNA expression was profiled for each sample to serve as the predictive feature matrix. An independent set of 28 tumor samples was collected and similarly characterized to serve as a test cohort. Results A feature selection algorithm was designed for applications where the number of samples is far smaller than the number of measured features per sample. A predictive miRNA expression signature was developed using this algorithm, which was then used to predict the metastatic status of the independent test cohort. A weighted classifier, using 18 micro-RNAs, achieved 100% accuracy on the training cohort. When applied to the testing cohort, the classifier correctly predicted 90% of node-positive cases, and 80% of node-negative cases (FDR = 6.25%). Conclusion Results indicate that the evaluation of the quantitative sparse-feature classifier proposed here in clinical trials may lead to significant improvement in the prediction of lymphatic metastases in endometrial cancer patients. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3604-y) contains supplementary material, which is available to authorized users.

major wound dehiscence, all of which adversely affect the patients' health and quality of life [3].
Incidence of pelvic and para-aortic node metastasis in patients with stage I endometrial cancer varies from 4-22% depending on grade, depth of invasion, lymphovascular space invasion, and histologic subtype [4]. Patients harboring tumors less than 2 centimeters in diameter and with less than 50% myometrial invasion are considered to be at low risk for lymphatic metastasis [5]. In a key clinical study, patients whose tumors violate these criteria were recommended for lymphadenectomy. Yet, within this high risk group, only 22% had lymph node metastasis, suggesting that 78% of the lymphadenectomies were unnecessary [5]. A more recent study [6] that separately considered pelvic versus paraaortic lymph node invasion showed little improvement in this statistic. It is therefore clear that current best practice clinical-pathologic parameters are grossly insufficient for reasonable prediction of metastatic disease [5].
To address this clinical need, efforts have been made to develop molecular signatures for predicting lymph node metastasis. An ideal classifier should consist of two parts: a set of features that are highly predictive, and a numerical procedure for combining the measured values of these features so as to make a binary prediction (yes or no) about the metastatic risk of a patient. Most of the current molecular signatures under consideration perform the first step but not the second; that is, they contain a set of key biomarkers, but do not apply a systematic method for predicting the outcome on an independent testing cohort. For example, absence of expression of the estrogen receptor (ER) and progesterone receptor (PR) genes, the so-called double-negative situation, correlates with increased risk of lymph node metastasis [7]. However, this correlation does not appear to translate into a prognostic test that can predict lymph node metastasis on a patientby-patient basis. Of note, levels of CA125 together with three parameters obtained from radiological images are sufficient to correctly identify low-risk patients. However, about half of patients are incorrectly classified as at risk for metastasis [8]. Finally, CA125 together with HE4 are positively correlated with tumor grade as well as risk of lymph node metastasis, but again, this correlation has not been translated into a prognostic classifier [9].
Machine learning is a discipline that combines engineering, statistics, and computer science that can potentially be used to generate highly informative biomarkers automatically from biological data sets. Most effective and widely used machine learning methods, such as the support vector machine (SVM) [10], are specialized for applications where the number of samples is far larger than the number of features per sample. However, a common conundrum for medical research applications is that the number of characterized patient samples is far smaller than the number of molecular measurements (features) per sample. Consequently, the application of machine learning methods to translational medical research must address two distinct but interwoven challenges: the selection of a handful of the most predictive features from a very large initial set of features, and a method for combining the measured values of these predictive features into a numerical recipe for making predictions. Motivated by this consideration, we have developed an algorithm that is specifically tailored for such biological applications. The 1 -norm SVM formulated in [11] guarantees sparse solutions, but in biological applications where the data is highly correlated its prediction performance is poor and the set of nonzero features is not stable when the data is noisy. To overcome this limitation, the Elastic Net (EN) algorithm was introduced in [12], which minimizes a convex combination of the 1 -norm and the square of the 2 -norm. It is shown in [12, Theorem 1] that the EN formulation achieves the so-called "grouping effect, " whereby highly correlated features are achieved near equal weight. However, in a theoretical paper written by a subset of the present authors [13], it is established that the EN formulation is not suitable for compressed sensing; see [13,Theorem 2.1]. To overcome these limitations, we take a novel approach and use a convex combination of 1 -and 2 -norms in our sparse classification algorithm. In [13], it is established that our algorithm out-performs both the 1 -norm SVM and the Elastic Net. Robustness to variations in experimental protocols is achieved by incorporating recursive feature elimination [14], and stability selection [15]. When applied to quantitative genome-scale microRNA expression data from 86 clinically annotated primary endometrial tumors, 18 micro-RNAs were recovered that are sufficient to predict the risk of lymph node metastasis within the training cohort. This biomarker panel was tested on an independent cohort of 28 tumors, and returned predictions with high sensitivity, low false discovery rate, and P < 0.0004. The panel therefore provides a path towards the development of a practical molecular diagnostic to avoid unnecessary surgeries (and their associated morbidities) in patients who are not at risk currently about 78% of all lymph node resections for endometrial cancer patients in the USA. This study is thus a transdisciplinary combination of two distinct advances: (i) a new algorithm for sparse feature selection in binary classification problems, and (ii) its application to predict the risk of metastasis in endometrial cancer.

Selection of training cohort and generation of the predictive feature matrix
We established strict inclusion and exclusion criteria for this study in an effort to control for the known clinical factors associated with lymph node metastasis. Specifically, we excluded all non-endometrioid histologies as these tumors are clinically and biologically distinct from the more common endometrioid histologic subtypes and are much more likely to show evidence of lymphatic spread. We also excluded those tumors with gross evidence of extra-uterine disease at the time of surgery, thus limiting inclusion to clinical stage I tumor, as the presence of gross pelvic or intra-abdominal tumor increases the likelihood of positive lymph node metastasis [3]. Fifty stage I (1988 FIGO staging) and 50 stage IIIC frozen endometrial cancer samples were obtained from the Gynecologic Oncology Group tumor bank according to the above criteria. The samples were collected from patients enrolled in GOG tissue acquisition protocol 210 which established a repository of clinical specimens with detailed clinical and epidemiologic data from patients with surgically staged endometrial carcinoma. Samples were matched for age, grade, presence of lymphvascular space invasion, and where possible for race. All patients enrolled in GOG 210 have undergone comprehensive surgical staging consisting of total abdominal hysterectomy, bilateral salpingooophorectomy, pelvic and para-aortic lymphadenectomy. All patients included in this study had no gross or pathologic evidence of extra-uterine disease aside from lymph node metastasis and could therefore be considered clinical stage I tumors. While more patients with stage IIIC tumors had LVSI and deep myometrial invasion relative to the stage I tumors, the majority of patients in both groups had poor prognostic factors and would have been included in the high-intermediate risk (HIR) subgroup set forth in GOG protocol 99 [16]. Specifically, 65% the stage I tumors and 81% of the stage IIIC tumors would be considered HIR (P = 0.5), highlighting the homogeneity of the entire tumor set ( Table 1). All tumors were subjected to central pathologic review by the GOG. Quantitative measurement of miRNA expression was chosen for detection of putative predictive features. As a family, miRNAs represent a relatively compact feature set which is, never-the-less, profoundly integrated with cell and tissue behavior [17][18][19]. Moreover, miRNA expression patterns have been identified that can predict benign vs. malignant disease, histologic subtypes, survival, and response to chemotherapy [20][21][22]. Two recent surveys highlight the role of miRNAs in cancer in general [23] and endometrial cancer in particular [24].
Total cellular miRNA was extracted from all tissues and measured using LNA-based detection arrays (Additional file 1: Table S1). 86 samples passed quality controls based on RNA integrity and expression array performance. Among the 1,428 available probe sets, 213 miRNAs were detectable in all 86 samples (Additional file 1: Table S2). An unsupervised two-way hierarchical clustering of the resulting miRNA expression values within each subclass revealed substantial expression variation between tumors, with no qualitatively evident distinctions between subclasses ( Fig. 1).

Generation of molecular signature for predicting lymph node metastasis
In order to detect candidate quantitative microRNA feature sets within the primary tumors that may discriminate between node positive and node negative disease, as well as a numerical procedure for combining the measured values of the features, we turned to machine learning protocols. When the number of features is larger than the number of samples, which is typical for biological problems such as the one here, machine learning approaches commonly encounter a phenomenon known as "overfitting, " wherein a classifier does an excellent job on the training data, but has poor generalization abilities. To overcome this problem, we developed a sparse classification algorithm that uses a convex combination of 1 -and 2 norms as a regularization term in its objective function. The traditional support vector machine (SVM), as broadly applied to medical research, generates a classifier via a so-called discriminant function, which is a weighted linear combination of the measured values of the features, minus a threshold. If the discriminant value associated with a particular sample is positive, the sample is assigned to the positive class (in our case, at risk of metastasis), and is assigned to the negative class otherwise (in our case, not at risk). The main drawback of the traditional SVM is that in general the discriminant assigns a nonzero weight to all the features, which is unacceptable when the number of features is large. Therefore we replaced the Euclidean or 2 -norm distance measure used in the traditional SVM algorithm by a combination of the 2 -norm and the so-called 1 -norm, which is the sum of the absolute values of a vector. The use of 1 -norm in the penalty function causes the classifier to be sparse, while the 2 -norm causes correlated features to be selected together providing robustness to the method; see [13,Theorem 2.2]. To reduce the size of the feature set still further, we applied recursive feature elimination (RFE) [14]. When RFE is applied with the traditional SVM, the performance is often erratic, and the algorithm must be iterated many times before a satisfactory result can be obtained, if at all [14]. However, because the combined 1 -and 2 -norm SVM assigns "exactly" zero weights to several features at once, RFE together with the combined 1 -and 2 -norm SVM led to a steady improvement in the fitting at each iteration. Finally, to ensure that the chosen feature set is relatively insensitive to noise, at each iteration of lone star we divided the available samples into random training and crossvalidation sets, repeated this exercise many times till the number of the selected features stabilizes. This approach is known as "stability selection" [15]. The number of such divisions is the only user defined parameter in lone star and in practice we have observed that 80 iterations is optimal, in the sense that increasing this number does not lead to better performance. Furthermore, to avoid over-fitting lone star compensates for it automatically by increasing the number of iterations. The overall algorithm is referred to in its full form as " 1 -, 2 -norm SVM ttest and RFE, " or "lone star" for short. To facilitate its use by the general community, a Matlab implementation of the algorithm has been made freely available by the authors at the following URL: http://sourceforge.net/ projects/lonestar/.
To detect discriminatory features that may predict metastatic disease, 213 miRNA expression features measured in 86 samples (43 lymph node-positive and 43 lymph node-negative) were used as the training data after normalization (Additional file 1: Table S3). The application of the lone star algorithm in the training data with 80 random cross validations at each iteration resulted in a set of 18 features. Afterwards, to compute a unique classifier, a single iteration of lone star is run with these 18 features and the 20 best-performing classifiers giving the best cross-validation error were computed (Additional file 1: Table S4). To have a more robust classifier the weight vectors and thresholds of these 20 classifiers were averaged to arrive at the weight vector and threshold of the final classifier. Table 2 gives the details of the classifier, including the 18 miRNAs, the weights assigned to their expression levels, and the threshold. This classifier was applied to the 86 tumor training cohort, and it classified all 86 tumors correctly. Figure 2 shows the values of the discriminant function on the expression levels of all 86 tumors.

Biological significance of selected biomarkers
We next carried out an analysis of the various genes that are regulated by the 18 miRNAs in the final feature set. The results are shown in Table 3. We retrieved data from the miRTarbase database, which comprises experimentally validated micro-RNA to target gene interactions in humans. A total of 740 genes were recovered, the vast majority of which are associated with the micro-RNA hsa-mir-155. A recent study suggests that hsa-mir-155 is over-expressed in endometrial cancer patients vis-avis normal patients [25]. We next computed the average expression value of each of the 18 miRNAS within the 43 node-positive samples as well as the 43 node-negative samples to identify those with a statistically significant differential representation between node positive versus node negative tumors. This returned hsa-miR-340, hsa-miR-451, hsa-miR-1284, has-miR-1908 and hsa-miR-223 (P < 0.05, student t-test). To prune the list of 740 miRNA targets, we used two criteria: (i) A gene is targeted by   Next, we compared the list of 23 genes to the pathways in the KEGG database. Several cancer pathways were examined, and for each pathway, the q-value of the gene set was computed. The q-value is obtained from the Fisher exact test after the Benjamini-Hochberg multiple testing correction and quantifies the statistical significance of the overlap between the gene list and a set of genes in a particular pathway. The complete list of pathways examined and the associated q-values were computed (Additional file 1: Table S10). The most enriched pathways are Prostate Cancer, Small Cell Lung Cancer and Bladder Cancer with q-values of 0.00074, 0.01048 and 0.01418 respectively.

Classifier validation with an independent cohort
To rigorously test the classifier developed using the lone star algorithm, an independent cohort of primary tumors with known metastatic state was collected. This comprised 28 endometrial cancer samples obtained between 2010 and 2012 under an IRB approved Comprehensive Gynecologic Oncology Tumor Repository protocol. Patients were consented according to protocol and fresh tumor was obtained in the operating room after the uterus was excised from the patients and bivalved with a scalpel. Tissue was flash frozen in liquid nitrogen and stored at −80 • . The cohort included 19 corpus confined endometrial cancers and 9 metastatic endometrial cancers (Additional file 1: Table S5). Eight cases in the latter group demonstrated nodal metastasis while one (sample 198) had metastatic disease involving the left fallopian tube. Six sampled lymph nodes from this patient were negative for metastatic disease. However, this patient developed recurrent disease involving the left lung within six months of completing adjuvant chemotherapy. Thus the surgeons believe that this patient actually presented with metastatic disease that was not detected.
MicroRNAs were extracted and measured using the identical procedures as described for the training cohort with the exception that the Exiqon version 6 arrays were replaced by version 7 (Additional file 1: Table S9). For all of the 28 samples the discriminant value is calculated using the classifier obtained from the training data. The discriminant values for each of the 28 samples are given in the Additional file 1: Table S11 and plotted in Fig. 5. The left-most nineteen samples correspond to lymph nodenegative patients while the right-most nine samples are lymph node-positive patients. A patient with a positive discriminant value is predicted to be lymph positive, while a patient with a negative discriminant value is predicted to be lymph negative. Sample 198 is the right-most point in each plot. It can be seen that the value of the discriminant function is very large for this sample. This reinforced our suspicion that the clinical annotation of this sample as lymph node-negative is erroneous and that in fact this patient had metastatic disease.
The quality of the classification results were determined with a 2 × 2 contingency table, and computing the likelihood of arriving at the classifications purely through chance. P-values were computed using the Fisher exact test [26] and the Barnard exact test [27] (Table 4). When sample 198 was treated as being lymph node-positive, as potentially justified by the fact that the patient subsequently developed recurrent disease within the left lung, the P-value was 0.0004 with the Barnard exact test, and 0.0012 with the less powerful Fisher exact test. Even when sample 198 was treated as lymph node-negative, and thus

Lone star as a sparse classification algorithm
The development of the support vector machine (SVM) [10] was a major milestone in machine learning, because the algorithm is very robust numerically, and can therefore handle very large datasets. The original SVM formulation was for engineering problems, where it is relatively easy to generate a large number of samples, as a result of which the number of features is far smaller than the number of samples. However, it was recognized almost at once that the standard SVM formulation had some weaknesses when applied to biological datasets, where the situation is the reverse. Specifically, the classifier produced by the traditional SVM assigns a nonzero weight to every single feature. When the number of features is larger than the number of samples, this leads to a phenomenon known as "over-fitting," wherein a classifier does an excellent job on the training data, but has poor generalization abilities. This phenomenon is also referred to as "memorization" of the training data.
To overcome this shortcoming, several approaches have been proposed in the literature. The 1 -norm SVM of [11] suggests replacing the traditional 2 -norm proposed in [10] with the 1 -norm, which is the sum of the absolute values of the components of a vector. As shown in [11], the 1 -norm SVM is guaranteed to choose no more features than the number of samples, no matter how large the number of features happens to be. However, when the number of samples is around a hundred, as in the application studied in the present paper, even this number is too large to be useful in practice. An entirely different approach known as recursive feature elimination is proposed in [14], in which a traditional ( 2 -norm) SVM is trained, the feature with the smallest weight (in magnitude) is dropped, and the algorithm is run anew. In [14] the approach was applied to a leukemia data set, and identified just two features as being sufficient. However, in general, the performance of the algorithm is non-monotonic, meaning that the performance on training data tends to go up and down as more and more features are dropped. For this reason, the recursive feature elimination step needs to be repeated many times from different starting points. Another improvement in machine learning is stability selection proposed in [15], that suggests running an algorithm many times with different random partitionings of the available data into training and testing datasets. Stability selection ensures that the finally selected feature set is quite robust against measurement noise. The lone star algorithm combines the above-mentioned ideas in a self-contained package. Specifically, the objective function minimized in the lone star algorithm is a convex combination of the 1 -and the 2 -norms. In this respect, the algorithm differs from both the 2 -norm SVM of [10] as well as the 1 -norm SVN of [11]; it also differs from the Elastic Net formulation of [12]. In addition, the algorithm also incorporates differential weighting for false positive and false negatives [28] and an optional t-test to filter the features when their initial number is very large. Our new algorithm is therefore of interest to the theoretical machine learning community.
While the above-mentioned ideas have been individually proposed in the machine learning literature, thus far they have not been effectively combined into one algorithm. The closest approach to the lone star algorithm is the so-called SVM-T-RFE algorithm introduced in [29]. In that algorithm, the authors use as their starting point the SVM-RFE approach suggested in [14], and also compute the t-test statistic to determine whether an individual gene does, or does not, show a significant variation between the two classes to be discriminated. Thus the SVM-T-RFE algorithm in [29] still uses the traditional SVM formulation based on Euclidean distances, which causes all genes to be assigned positive weights in general. Then a new figure of merit is computed for each gene, which is a combination of its weight from the Euclidean norm-based SVM output and the t-test statistic. The gene (or feature) with the smallest of merit is discarded, and the process is repeated. This is in contrast to the lone star algorithm, wherein a combination of the 2 -and the 1 -norm distance measures is used, which causes most weights to exactly equal zero. As a result, a large number of features can be eliminated at each iteration, as opposed to one feature at a time in SVM-T-RFE. Consequently the lone star algorithm converges far more quickly and is also more numerically stable, compared to SVM-T-RFE and other methods based on using Euclidean distance measures.

Application to endometrial cancer
The problem of assessing the risk of endometrial cancer patients for lymph node metastasis has been the subject of much study over the years. So far as we are able to determine, the present study is one of only two in which predictive biomarkers were tested with an independent sample cohort, the other being [30]. Validation on an independent cohort is vital to determine whether the prediction methodology is robust against unavoidable variations in measurement platforms and experimental protocols. If a prediction methodology is cross-validated on a common cohort, all of the potential variations in data introduced by platform-and protocol-dependencies are absent. This can lead to misleadingly high performance that may or may not be repeated with a genuinely independent data set.
The ultimate objective of a molecular signature for endometrial cancer should be to identify patients who are not at risk of lymph node metastasis, in such a way that most patients who require lymphadenectomy receive it. However, in every clinical test there is an associated false negative rate and a good test should be able to make this rate acceptably small, say around 5%. In the validation analysis presented here, 8 out of the 9 surgically confirmed node positive patients were correctly identified. In addition, 15 out of the 16 patients classified as not requiring surgery were surgically confirmed node negative patients. Thus the classifier achieved both desired objectives within a significant confidence interval. Application to much larger patient cohorts is anticipated to determine if appropriate receiver operator characteristics can be achieved for clinical application as a diagnostic.

Conclusions
In this work, we have developed a novel sparse classification algorithm and applied it to predict risk of lymph node metastasis in endometrial cancer patients. The algorithm produced a weighted classifier, using 18 micro-RNAs, and achieved 100% accuracy on the training cohort. When applied to an independent testing cohort, the classifier correctly predicted 90% of node-positive cases, and 80% of node-negative cases (FDR= 6.25%).
The classifier developed in this study was based on molecular measurements from excised tumors. If one could predict the risk of lymph node metastasis on the basis of a biopsy, then the decision to carry out lymphadenectomy or not could be made at the time of excision of the primary tumor. Therefore a useful next step would be to repeat the present study on a cohort of biopsies. Pending the completion of such a study, it is worth noting that a prediction of the risk of metastasis is valuable even if lymphadenectomy is not performed, as it can inform choices for post-resection patient care.

Selection of specimens
Fifty stage I and 50 stage IIIC frozen endometrioid endometrial cancer samples were obtained from the Gynecologic Oncology Group (GOG) tumor bank. The samples were collected from patients enrolled in GOG tissue acquisition protocol 210 which established a repository of clinical specimens with detailed clinical and epidemiologic data from patients with surgically staged endometrial carcinoma. All patients enrolled in GOG 210 have undergone comprehensive surgical staging consisting of total abdominal hysterectomy, bilateral salpingo-oophorectomy, pelvic and para-aortic lymphadenectomy. While there was no mandated minimum lymph node count for inclusion on GOG protocol 210, specific procedural requirements for pelvic and paraaortic lymphadenectomy were stipulated which necessitated removal of all lymphatic tissue from the relevant lymphatic beds. Patients included in our study had no gross or pathologic evidence of extra-uterine disease and could be considered clinical stage I tumors. All tumors have undergone central pathologic review by the GOG and contain ≥ 75% tumor.

MicroRNA isolation and array analysis:
Once the tumor samples were collected, frozen tissue was added to a chilled BioPulverizer H tube (Bio101, Irvine, CA). Lysis buffer from the Ambion mirVana microRNA isolation kit (Ambion, Austin TX) was added and the tissue homogenized for two minutes in a Mini-Beadbeater (Biospec Products, Bartlesville, OK). Tubes were spun briey to pellet the garnet mixture and reduce foam. The lysates were then transferred to a new 1.5 ml tube using a syringe. MicroRNA was then extracted using the Ambion mirVana microRNA isolation kit (Ambion, Austin TX).

Array methods:
Total RNA samples were labeled with Hy3 using the Hi-Power labeling kit (Exiqon) per the manufacturers protocol. miRCURY LNA microRNA Array Spike-in kit v2 (Exiqon) was used as a control for the labeling reaction and to calibrate scanner settings. Briefly, 1.5 μg total RNA in 3 μL,1 μL spike-in miRNA kit v2, 0.5 μL CIP buffer and 0.5 μL CIP enzyme were mixed on ice and incubated at 37 • for 30 minutes. The RNA was then denatured at 95 • and then immediately placed on ice for at least 2 minutes. This reaction product was then mixed with 3 μL Hi-Power labeling buffer, 1.5 μL Hy3 uorescent label, 2 μL DMSO and 1 μL Hi-Power labeling enzyme for a total of 12.5 μL, and then incubated for 2 hours at 16 • . Samples were subsequently hybridized to microarray slides (Exiqon miRCURY LNA microRNA 6th generation array) using a NimbleGen/MAUI 4-Bay hybridization station per the manufacturers protocol. Briefly, the labeled RNA was brought up to 25 μL volume and 25 μL of hybridization buffer (Exiqon) was added. This solution was then denatured at 95 • and put on ice for at least 2 minutes. Microarray slides were placed in hybridization chambers and pre-warmed to 56 • for at least 5 minutes. A total of 45 μL of sample was added to the microarray slide and hybridized for 16 hours at 56 • in the hybridization chamber. Slides were then washed once for 2 minute at 56 • in wash buffer A (Exiqon) and once for 2 min at 23 • in wash buffer B (Exiqon). Slides were then washed for 2 minutes at 23 • in wash buffer C (Exiqon), washed briey in 99% ethanol, and then spun in a centrifuge (1000 rpm) for 5 minutes to dry. Microarray slides were scanned using the Tecan PowerScanner scanning system. Spot quantification and statistical analysis were performed using ImaGene 9 and Nexus Expression 2 software (BioDiscovery Inc.) using the Exiqon default settings. Briefly, for quality control, correlation coefficients of spike-in controls across arrays were calculated, and arrays with correlation coefficients less than 0.8 were removed from the dataset; spot background subtraction was done by subtracting the median local background from the mean intensity of the spot, replicated probes on each array were combined into one output value using the median value, and normalized across all arrays using quantile normalization. The data output was log 2 transformed.

Lone star algorithm:
Suppose we are given a set of labeled data here x i ∈ R n and y i ∈ {−1, 1} for i = 1, 2, · · · , m. Therefore, n denotes the number of features and m denotes the number of samples. The feature vector x i is viewed as a row vector. The objective is to choose a subset of features F ⊆ {1, · · · , n}, a weight vector w ∈ R n and a threshold θ ∈ R such that (a) the discriminant function f (x i ) = x i w − θ has the same sign as y i for most indices i, (b) w j = 0 for all j / ∈ F, and (c) |F| m. In words, the discriminant function f is linear, and the set of features used by the discriminant has smaller cardinality than the number of samples. Define and let m 1 = |P|, m 2 = |N|. The algorithm consists of three parts, namely: an optional preprocessing step, an iterative loop and a final classifier generation step. The first step is the preprocessing, steps 2 through 4 are the iterative loop, and step 5 is the final classifier generation.
1. Normalization of Feature Vectors: Normalize each of the remaining feature vectors by subtracting the mean over all m samples and then scaling so that the resulting vector has Euclidean norm of one. The resulting vector is just the set of Z-scores divided by √ m. Set the iteration counter to 1, the feature set F to the set of significant features, the feature count s 1 to |F|, the iteration count i to one, and proceed to the iterative loop. subject to the constraints w t x j − θ + y j ≥ 1, j ∈ P, w t x j − θ − z j ≤ −1, j ∈ N, y ≥ 0 k 1 , z ≥ 0 k 2 . The parameter λ should be chosen "close to" zero but not exactly zero. The parameter α should be chosen as 0.5 if sensitivity and specificity are equally important. To place more emphasis on sensitivity, α should be chosen less than 0.5, while α should be chosen to be greater than 0.5 to place more emphasis on specificity. Finally the parameter γ adjusts the relative weights given to the 1 -and 2 -norms. In this study, we choose α = 0.5 and we used 2-fold cross-validation for tuning γ = 0.5.

Recursive Feature Elimination (RFE):
The previous step results in l different optimal weight vectors w i 1 , · · · , w i l , where i is the iteration count. Each weight vector will have a different number of nonzero components. Compute the average number of nonzero components, and round upwards to the exact test after the Benjamini-Hochberg multiple testing correction and quantifies the statistical significance of the overlap between the gene list and a set of genes in a particular pathway. (1170 KB XLSX)