To build a robust classifier, the number of training instances is usually required to be more than the number of features. In many real life applications such as bioinformatics, natural language processing, and computer vision, a high number of features might be provided to the learning algorithm without any prior knowledge about which ones should be used. Therefore, the number of features can drastically exceed the number of training instances and the model is subject to overfit the training data. Many regularization methods have been developed to prevent overfitting and to improve the generalization error bound of the predictor in this learning situation.
Most notably, Lasso [1] is an ℓ_{1}regularization technique for linear regression which has attracted much attention in machine learning and statistics. The same approach is useful in classification because any binary classification problem can be reduced to a regression problem by treating the class labels as real numbers, and consider the sign of the model prediction as the class label. The features selected by the Lasso depends on the regularization parameter, and the set of solutions for all values of this free parameter is provided by regularization path [2]. Although efficient algorithms exist for recovering the whole regularization path for the Lasso [3], finding a subset of highly relevant features which leads to a robust predictor is a prominent research question.
In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevanceordering of the features for each sample, and finally combine these relevanceorderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations using a reallife lymphoma dataset as well as several UCI datasets, which confirms the superiority of our method in exploratory data analysis and prediction performance.
Background and previous work
Lasso is an ℓ_{1}regularization technique for leastsquare linear regression:
\mathcal{L}:={\displaystyle \sum _{i=1}^{n}}\frac{1}{2n}{\u2225{y}_{i}{w}^{T}\cdot \phantom{\rule{0.3em}{0ex}}{\mathsf{\text{x}}}_{i}\u2225}_{2}^{2}+\phantom{\rule{0.3em}{0ex}}\lambda {\u2225w\u2225}_{1}
(1)
where the response random variable Y ∈ ℝ is dependent on a ddimensional covariate X ∈ ℝ^{d}, and the training data D={\left\{\left({\mathbf{x}}_{i},{y}_{i}\right)\right\}}_{i=1}^{n} is independently and identically sampled from a fixed joint distribution P_{
XY
}. It is well known that the ℓ_{1}regularization term shrinks many components of the solution to zero, and thus performs feature selection [4]. There has been also some variants, such as elastic nets [5], to select highlycorrelated predictive features. The number of selected features in eqn (1) is controlled by the regularization parameter λ.
A common practice is to find the best value for λ by crossvalidation to maximize the prediction accuracy. Having found the best value for the regularization parameter, the features are selected based on the nonzero components of the global and unique minimizer of the training objective in equation (1). However, recent research on the consistency of the Lasso [4, 6–10] shows that a fixed value of λ for all n will not result in a consistent estimate for the parameter vector [7]. Now, the question is what would be a proper value for λ as a function of n with a theoretical basis?
Various decaying schemes of the regularization parameter were studied [4, 7, 8, 11] and it is shown that under specific settings, Lasso selects the relevant features with probability one and the irrelevant features with a positive probability less than one, provided that the number of training instances tends to infinity. To do a better feature selection, note that each run of the crossvalidation gives the value of the regularization parameter λ and the corresponding selectedfeatures. If several samples were available from the underlying data distribution, irrelevant features could be removed by simply intersecting the set of selected features for each sample. The idea in [7] is to provide such datasets by resampling with replacement from the given training dataset using the bootstrap method [12]. This approach leads to Bolasso algorithm for feature selection that is theoretically motivated by the proposition 1.
Proposition 1. [7]Suppose P_{
XY
} satisfies some mild assumptions and let \lambda ={\mu}_{0}{n}^{\phantom{\rule{0.3em}{0ex}}\frac{1}{2}} for a fixed constant μ_{0} > 0. Let J represents the index of the true relevant features, and \u0134 denote the index of relevant features found by Bolasso. Then, the probability that Bolasso does not select the correct model is upperbounded by:
\mathsf{\text{Pr}}\left(\mathit{\u0134}\ne \mathbf{J}\right)\le m{A}_{1}{e}^{{A}_{2}n}+{A}_{3}\frac{\mathsf{\text{log}}n}{{n}^{\frac{1}{2}}}+{A}_{4}\frac{\mathsf{\text{log}}m}{m},
where m > 1 is the number of bootstrap samples, and all A_{
i
} s are positive constants.
Now, if we send m to infinity slower than {e}^{{A}_{2}n}, then with probability tending to one Bolasso will select J, exactly the relevant features. The proposition 1 guarantees the performance of Bolasso only asymptoticly, i.e. when n → ∞. However, in real applications where the number of training samples is often limited, the probability of selecting relevant features can be significantly less than 1. One of the main goals of our proposed framework in this paper is to address this problem by scoring the features.
Previous studies have shown that there is room for improving Bolasso [7, 8]. For example, while on synthetic data it outperforms similar methods such as ridge regression, Lasso, and bagging of Lasso estimates [13], Bolasso is sometimes too strict on real data because it requires the relevant features to be selected in all bootstrap runs. BolassoS, a soft version of Bolasso, performs better in practice because it relaxes this condition and selects a feature if it is chosen in at least a userdefined fraction of the bootstrap replicates (a threshold of 90% is considered to be enough). BolassoS is more flexible and thus, more appropriate for the practical models that are not extremely sparse [8].
Our contributions
In this paper, we develop FeaLect algorithm that is softer than Bolasso in the following three directions:

For each bootstrap sample, Bolasso considers only one model that minimizes the training objective \mathcal{L} in eqn (1), whereas we include information provided by the whole regularization path,

Instead of making a binary decision of inclusion or exclusion, we compute a score value for each feature that can help the user to select the more relevant ones,

While BolassoS relies on a threshold, our theoretical study of the behaviour of irrelevant features leads to an analytical criterion for feature selection without using any predefined parameter.
We compared the performance of Bolasso, FeaLect, and Lars algorithms for feature selection on six real datasets in a systematic manner. The source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.
Feature scoring and mathematical analysis
In this section, we describe our novel algorithm that scores the features based on their performance on samples obtained by bootstapping. Afterwards, we present the mathematical analysis of our algorithm which builds the theoretical basis for its proposed automatic thresholding in feature selection.
The FeaLect algorithm
Our feature selection algorithm is outlined in Figure 1 and described in Algorithm 1. Let B be a random sample with size γn generated by choosing from the given training data D without replacement, where n = D and γ ∈ (0, 1) is a parameter that controls the size of sample sets. Using a training set B, we apply the Lars algorithm to recover the whole regularization path efficiently [3]. Let {F}_{k}^{B} be the set of selected features by the Lasso when λ allows exactly k features to be selected. The number of selected features is decreasing in λ and we have:
\mathrm{\varnothing}={F}_{0}^{B}\subset \dots {F}_{k}^{B}\subset {F}_{k+1}^{B}\subset \cdots \subset {F}_{d}^{B}=F.
For each feature f, we define a scoring scheme depending on whether or not it is selected in {F}_{k}^{B}:
{S}_{k}^{B}\left(f\right):=\left\{\begin{array}{cc}\hfill \frac{1}{k}\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}f\in {F}_{k}^{B}\hfill \\ \hfill 0\hfill & \hfill \mathsf{\text{otherwise}}\hfill \end{array}\right.
(2)
The above randomized procedure is repeated several times for various random subsets B to compute the average score of f when exactly k features are selected, i.e. {\mathbb{E}}_{B}\left[{S}_{k}^{B}\left(f\right)\right] is estimated empirically. According to our experiments, the convergence rate to the expected score is fast and there is no significant difference between the average scores computed by 100 or 1000 samples (Figure 2). The total score for each feature is then defined as the sum of average scores:
S\left(f\right):={\displaystyle \sum _{k}}{\mathbb{E}}_{B}\left[{S}_{k}^{B}\left(f\right)\right]
(3)
Algorithm 1 Feature Scoring
1: for t = 1 to m do
2: Sample (without replacement) a random subset B ⊂ D with size γD
3: Run Lars on B to obtain {F}_{1}^{B}, ...,{F}_{d}^{B}
4: Compute {S}_{1}^{B}, ...,{S}_{d}^{B} using eqn (2)
5: for k ∈ {1, ..., d} do
6: Update the feature scores for all feature f:S(f\leftarrow S\left(f\right)+{S}_{k}^{B}\left(f\right)/m
7: end for
8: end for
9: Fit a 3segment spline (g_{1}(.), g_{2}(.), g_{3}(.)) on logscale feature score curve (see the text for more information)
10: return features corresponding to g_{3} as informative features
Before describing the rest of the algorithm, let us have a look at the feature scores for our lymphoma classification problem (The task and data set is described in details in the Experiment section). Figure 2 depict the total score of features in logscale, where features are sorted according to their increasing total scores. The feature score curve is almost linear in the middle and bending at both ends. We hypothesize that features with a "very high score" in the top nonlinear and bending part of the curve are good candidates for informative features. Furthermore, the linear middlepart of the curve consists of features that are responsible for the model to get overfitted and therefore we call them irrelevant features. In the next section, a formal definition will be provided to clarify this intuitive idea and we show how this insight can be very helpful in identifying informative features.
The final step of our feature selection algorithm is to fit a 3segment spline model to the feature score curve: the first quadratic lowerpart captures the low score features, the linear middlepart captures irrelevant features, and the last quadratic upperpart captures highscore informative features. As discussed below, the middle linearpart provides an analytic threshold for the score of relevant features: The features with score above this threshold are reported as informative features which can be used for training the final predictor and/or explanatory data analysis.
The analysis
The aim of this analysis is to provide a mathematical explanation for the linearity of the middle part of the scoring function (Figure 2), and also a justification for why the features corresponding to this part can be excluded. We first present a probabilistic interpretation of the feature scores. and then we provide a precise definition of an irrelevant feature. Our definition formalizes the fact that such a feature is selected by the Lasso if and only if a particular fixed finite subset U of instances is included in the random training set, whereas a relevant feature should be selected for almost any general U. We estimate the probability that a random sample B ⊂ D contains U as n grows to infinity. Finally, we show that asymptotically, the log of the scores for irrelevant features is linear in U. This explains the linearity of the middle part of the feature score curve in Figure 2.
Proposition 2. Suppose Pr (f = f_{
i
}) is the probability of selecting a feature f_{
i
} by the Lasso in some stage of our feature selection method in Algorithm 1. Then, the probability distribution of the random variable f is given by:
\mathsf{\text{Pr}}\left(\mathbf{f}={f}_{i}\right)=\frac{1}{d}S\left({f}_{i}\right)
Proof. By conditional probability:
\begin{array}{c}\mathsf{\text{Pr}}\left(\mathbf{f}={f}_{i}\right)\phantom{\rule{1em}{0ex}}={\displaystyle \sum _{B}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathsf{\text{Pr}}\left(\mathbf{f}={f}_{i}B\right)\mathsf{\text{Pr}}\left(B\right)\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}={\displaystyle \sum _{B}}{\displaystyle \sum _{k=1}^{d}}\phantom{\rule{1em}{0ex}}\left(\mathsf{\text{Pr}}(\mathbf{f}={f}_{i}\mathbf{f}\in {F}_{k}^{B}\right)\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{1em}{0ex}}\left(\right)close=")">\mathsf{\text{Pr}}\left(\mathbf{f}\in {F}_{k}^{B}\right)\mathsf{\text{Pr}}\left(B\right)\end{array}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}={\displaystyle \sum _{B}}{\displaystyle \sum _{k=1}^{d}}\phantom{\rule{1em}{0ex}}{S}_{k}^{B}\left({f}_{i}\right)\mathsf{\text{Pr}}\left(\mathbf{f}\in {F}_{k}^{B}\right)\mathsf{\text{Pr}}\left(B\right)\n
Since we have not imposed any prior assumption, we put a uniform distribution on \mathsf{\text{Pr}}\left(\mathbf{f}\in {F}_{k}^{B}\right) to get:
\begin{array}{c}\mathsf{\text{Pr}}\left(\mathbf{f}={f}_{i}\right)\phantom{\rule{1em}{0ex}}=\frac{1}{d}{\displaystyle \sum _{B}}{\displaystyle \sum _{k}}{S}_{k}^{B}\left({f}_{i}\right)\mathsf{\text{Pr}}\left(B\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}=\frac{1}{d}{\mathbb{E}}_{B}\left({\displaystyle \sum _{k}}{S}_{k}^{B}\left({f}_{i}\right)\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}=\frac{1}{d}S\left({f}_{i}\right).\hfill \end{array}
□
The following definition formalizes the idea that irrelevant features depend only on a specific subset of the whole data set.
Definition 3. For any subset of samples U ⊆ A and any feature f_{
i
}, we say that f_{
i
} overfits on U if:
\forall k,\forall B:{f}_{i}\in {F}_{k}^{B}\iff U\subseteq B
In words, f_{
i
} is selected in {F}_{k}^{B} if and only if B contains U. Next, we derive the probability of including a specific set U in a randomly generated sample.
Lemma 4. For any U ⊆ A, we have:
\underset{n\to \infty}{\mathsf{\text{lim}}}\underset{B}{\mathsf{\text{Pr}}}\left(U\subseteq B\right)={\gamma}^{r}
where r is the number of samples in U and γ is the fraction of samples chosen for a random set B.
Proof. Assuming B has γn members chosen without replacement, we have:
\begin{array}{c}\underset{B}{\mathsf{\text{Pr}}}\left(U\subseteq B\right)=\frac{\left({}_{\gamma nr}^{nr}\right)}{\left({}_{\gamma n}^{n}\right)}\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=\frac{\left(nr\right)!\left(\gamma n\right)!}{n!\left(\gamma nr\right)!}\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=\left({\displaystyle \prod _{i=1}^{nr}}i\right)\left({\displaystyle \prod _{i=1}^{\gamma n}}i\right)\left({\displaystyle \prod _{i=1}^{n}}{i}^{1}\right)\left({\displaystyle \prod _{i=1}^{\gamma nr}}{i}^{1}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=\left({\displaystyle \prod _{1}^{nr}}i\right)\cdot \left({\displaystyle \prod _{1}^{\gamma nr}}i\right)\phantom{\rule{0.3em}{0ex}}\cdot \phantom{\rule{0.3em}{0ex}}\left({\displaystyle \prod _{\gamma nr+1}^{\gamma n}}i\right)\times \hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left({\displaystyle \prod _{1}^{nr}}{i}^{1}\right)\cdot \left({\displaystyle \prod _{nr+1}^{n}}{i}^{1}\right)\phantom{\rule{0.3em}{0ex}}\cdot \phantom{\rule{0.3em}{0ex}}\left({\displaystyle \prod _{1}^{\gamma nr}}{i}^{1}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=\left({\displaystyle \prod _{\gamma nr+1}^{\gamma n}}i\right)\phantom{\rule{0.3em}{0ex}}\cdot \phantom{\rule{0.3em}{0ex}}\left({\displaystyle \prod _{\gamma nr+1}^{n}}{i}^{1}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}={\displaystyle \prod _{i=0}^{r1}}\left[\left(\gamma ni\right){\left(ni\right)}^{1}\right]\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}={\gamma}^{r}{\displaystyle \prod _{i=0}^{r1}}\left(\frac{n\frac{i}{\gamma}}{ni}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}={\gamma}^{r}{\displaystyle \prod _{i=0}^{r1}}\left(1+\frac{i\left(11/\gamma \right)}{ni}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}={\gamma}^{r}\left(1+O\left({n}^{1}\right)\right).\hfill \end{array}
The first line of the above proof relies on the assumption that the members of the random set B are chosen without replacement, and the claim derives from the fact that γ is a fixed constant. □
The following theorem concludes our argument for the exponential behavior of total score of irrelevant features. It relates the probability of selecting a feature f_{
i
} irrelevant on U to the probability of including U in the sample.
Theorem 5. If a feature f_{
i
} overfits on a set of samples U with size r, then:
\underset{n\to \infty}{\mathsf{\text{lim}}}S\left({f}_{i}\right)=d{\gamma}^{r}.
Proof. From proposition 2 we have:
\begin{array}{c}S\left({f}_{i}\right)=d\mathsf{\text{Pr}}\left(\mathbf{f}={f}_{i}\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=d{\displaystyle \sum _{B}}\mathsf{\text{Pr}}\left({f}_{i}\in {F}^{B}\right)\mathsf{\text{Pr}}\left(B\right)\sum \mathsf{\text{Pr}}\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=d\underset{B}{\mathsf{\text{Pr}}}\left(U\subseteq B\right)\hfill \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}=d\left({\gamma}^{r}+O\left({n}^{1}\right)\right).\hfill \end{array}
The last equation was proved in lemma 4, and the one before that from definition 3. □
Although we presented the above arguments for the Lasso, it also should work for any other feature selection algorithm which exhibits linearity in its feature score curve. That is, features corresponding to the linear part of the scoring curve are indeed the irrelevant features for that algorithm, and therefor, the features on nonlinear upperpart should be considered as informative ones. Obviously the features on the nonlinear lowerpart are not interesting for the any prediction task because their scores are even less than the irrelevant features. We speculate that these features do not present a linear behavior because not only they are not relevant to the outcome, but also they are not associated with any particular set U, meaning they are not even included in an overfitted model. A followup study may investigate this hypothesis further.