Skip to main content

A clustering-based approach for efficient identification of microRNA combinatorial biomarkers



MicroRNAs (miRNAs) have great potential serving as tumor biomarkers and therapeutic targets. As the rapid development of high-throughput experimental technology, gene expression experiments have become more and more specialized and diversified. The complex data structure has brought great challenge for the identification of biomarkers. In the meantime, current statistical and machine learning methods for detecting biomarkers have the problem of low reliability and biased criteria.


This study aims to select combinatorial miRNA biomarkers, which have higher sensitivity and specificity than single-gene biomarkers. In order to avoid exhaustive search and redundant information, miRNAs are firstly clustered, then the combinations of representative cluster members are assessed as potential biomarkers. Both the criteria for the partition of clusters and selection of representative members are based on Fisher linear discriminant analysis (FDA). The FDA-based criterion has been demonstrated to be superior to three other criteria in selecting representative members, and also good at refining clusters. In the comparison with eight common feature selection methods, this clustering-based method performs the best with regard to the discriminative ability of selected biomarkers.


Our experimental results demonstrate that the clustering-based method can identify microRNA combinatorial biomarkers with high accuracy and efficiency. Our method and data are available to the public upon request.


MicroRNAs (miRNAs) play important regulatory roles in many fundamental biological processes for disease development and progression. Especially, tremendous researches have demonstrated that miRNAs can serve as oncogene or tumor suppressor in various cancer types [1, 2]. During the last decade, benefitting from the development of miRNA microarray and small RNA-Seq techniques, miRNA expression data has been widely used in the comparison of diseased samples with control samples, or different subtypes of diseased samples. The miRNAs with most discriminant capacity, regarded as biomarkers, have assisted in diagnosis, prognosis prediction and therapeutic assessment of cancers [3, 4], and sometimes they are even more accurate than coding-gene markers [5, 6].

In order to search biomarkers, the analysis of differential gene expression is performed and genes are ranked according to certain criteria. The evaluation on the quality of biomarkers is mainly based on statistical or machine learning approaches, whose corresponding measurements are statistical significance and classification accuracy, respectively.

Till now, a variety of statistical methods have been applied into the gene expression analysis. Fold change has been used as an initial metric for measuring the significance of change in expression levels between two groups of samples [7], and t-test [8] is a widely-used statistical method to select differentially expressed genes. Besides, researchers have developed many alternatives of t-test, such as ANOVA [9], Wilcoxon test [10], SAM [7], RVM [11], LIMMA [12], VarMixt [13] and SMVar [14]. Most of the present criteria are for univariate analysis. As the rapid development of high-throughput experimental technology, gene expression experiments have become more and more specialized and diversified. Especially, tissue-specific and condition-specific researches have largely been emerged. The single-gene biomarkers are often unreliable or have insufficient ability to distinguish subtypes or different conditions for complex diseases.

In order to increase the sensitivity and specificity of biomarkers, in many studies, the top ranked genes according to some selection metric were put together and used as composite biomarkers. This method is not guaranteed to find optimal biomarkers, since there may be redundant information among the genes because of correlation. And, many genes individually do not show good discriminative ability between different groups, while perform well together with other genes. Therefore, multivariate analysis is necessary to examine the performance of multiple genes as a whole. A common method for multivariate statistical analysis is Hotelling’s t-square test [15]. Note that in gene expression analysis, the number of samples is often very limited. As the dimensionality increases, the statistical inference often fails to provide reliable results.

Feature selection is a major branch of methods for screening biomarkers [16]. From a machine learning point of view, biomarkers correspond to the features with most discerning power. A multivariate feature selection method scores feature subsets and rank them, usually by their classification accuracy. For example, in order to select gene combinations, Cui et al. [17] and Xu et al. [18] used support vector machines to separate cancer and normal tissues, and assessed classification accuracy for all the k-gene combinations, for k≤4 and k≤8, respectively. These multivariate analysis methods can avoid feature redundancy but may run into exponential complexity due to the huge search space. Another issue is about the interpretation of computational results. Too complex classifier (often regarded as a black-box) and too many variables/features in the composite biomarkers could be useless, because the results are extremely difficult for biological explanation and validation.

MiRNA expression analysis usually follow the same procedure and approaches as mRNA expression analysis, such as hypothesis test [19], clustering [20] and classification [21] based on machine learning models. Meanwhile, the above mentioned problems also exist in miRNA data. Besides, due to the low intensity on expression level and small difference between miRNA sequences, the selection of miRNA biomarkers becomes more challenging. In this study, instead of screening single miRNAs or large miRNA sets, we aim to find the combinatorial biomarkers, i.e., k-miRNA combinations, where k is a small number. To avoid exponential number of combinations, we propose a clustering-based method to reduce the number of candidate combinations and conduct a highly efficient search. The basic idea is to assess only the combinations consisting of representative members from clusters that are generated based on expression level similarity, rather than all combinations. In order to further reduce the search space, a proper criterion is needed to rank the miRNAs in the clusters, and only the most promising ones can be selected as the representatives of their clusters to form the candidate biomarkers.

Clustering approaches have been extensively used to find co-expressed genes. Genes in the same clusters are usually functionally related. There have been some studies that adopted clustering-based methods for feature selection. For example, Jaeger et al. proposed to use a fuzzy C-means clustering method to pre-filter genes before ranking genes individually [22]. That is, only one representative gene is selected from each cluster and involved in the ranking procedure. A similar approach was proposed by Hanczar et al. [23], who used k-means clustering to select ‘prototype genes’. In both of these two methods, the number of clusters needs to be pre-defined. Actually, it is an important issue to find the proper number of clusters. In order to address this issue, Wang et al. developed a novel hybrid approach [24]. They applied hierarchical clustering on these genes to generate a dendrogram, then the optimal number of clusters was determined by a leave-one-out cross-validation (LOOCV) strategy by trying all of the different clusterings by breaking up the dendrogram.

In all of these methods, there is no defined criterion on how to determine the number of clusters or the proper size of clusters, though Wang et al. conducted an empirical analysis of LOOCV [24]. Moreover, these methods typically used genes which are the closest to centers of their clusters as the representative genes, while whether the center gene is the best choice is questionable. In another similar research proposed by Sahu et al. [25], k-means clustering was adopted, while signal-to-noise ratio (SNR score) was used to rank genes in every clusters.

Our approach has two major differences from the previous approaches: i) there is a new criterion to select the most discriminant member in each cluster, ii) there is a defined criterion to determine whether a cluster should be split. And the goal of this study is slightly different from the aforementioned literatures in that we aim to develop efficient method for identifying miRNA combinatorial biomarkers, instead of large feature subsets which are hard to be interpreted in biology. We have conducted a series of experiments to compare different criteria for selecting representative genes from clusters and splitting raw clusters. We also compared the new method with some widely used feature selections methods. The experimental results demonstrate that our proposed method is very effective in screening genes in the clusters. The resulting clusters can greatly reduce the number of combinations to be assessed, and obtain high-quality combinations in the mean time. The selected miRNA combinations have not only high discriminative ability, but also enriched pathways closely related with tumorigenesis. Moreover, many frequently present miRNAs in these combinations have been validated to be associated with breast cancer development in previous literatures.


The proposed method consists of three major steps. The first step is a pre-screening to remove uninformative miRNAs using Welch’s t-test. The second step is a hierarchical clustering on the remaining genes. In the last step, representative miRNAs are selected from every clusters to form miRNA combinations as candidate biomarkers. Both the criteria for assessing the qualities of clustering and selecting representative miRNAs within clusters are defined through a linear discriminant method. The flowchart of the method is shown in Fig. 1.

Fig. 1

Flowchart of the feature extraction method

Fisher linear discriminant analysis

Fisher linear discriminant analysis (FDA) [26] seeks a best linear combinations of features to achieve maximum separation on the projected feature space, by optimizing the object function which is a ratio of inter-class difference to intra-class difference of data. Since FDA projects original features onto one-dimensional features, it is used not only for classification but also for dimensionality reduction. Different from principal component analysis (PCA), FDA works in a supervised manner, thus the projected features are more discriminative with respect to the classification task. The algorithm of FDA is briefly described in the below.

For a binary classification problem, suppose X is the training set which has n samples with p dimensions, i.e., X={x 1,x 2,x 3,,x n }, where x i s (1≤in) are p-dimensional sample vectors belonging to class c 0 or c 1. Let m 0 and m 1 be the mean vectors of samples in these two classes, respectively, and w be the optimal projection direction. According to FDA’s object function, w is obtained by Eq. (1),

$$ \mathbf{w} \propto S_{w}^{-1}\left(\mathbf{m}_{0} - \mathbf{m}_{1}\right), $$

where S w is the sum of variance within each class, i.e.,

$$ {}S_{w}\,=\,\! \sum_{\mathbf{x}_{i} \in c_{0}}\left(\mathbf{x}_{i} - \mathbf{m}_{0} \right)\left(\mathbf{x}_{i} - \mathbf{m}_{0} \right)^{\mathrm{T}} + \sum_{\mathbf{x}_{i} \in c_{1}}\left(\mathbf{x}_{i} - \mathbf{m}_{1} \right)\left(\mathbf{x}_{i} - \mathbf{m}_{1} \right)^{\mathrm{T}} $$

Given this optimal direction, all x i s are projected onto w to get the new one-dimensional sample sets Y={y 1,y 2,,y n }, where

$$\begin{array}{*{20}l} {y_{i}} = {\mathbf{w}}^{T} \mathbf{x}_{i} \text{ for} \ i = 1, 2, \cdots, n \end{array} $$

As for classification, the definition of threshold (class boundary) has multiple choices. Normally, the threshold, y 0, can be computed as Eq. (4),

$$ y_{0} = \frac{n_{0}{m_{0}} + n_{1}{m_{1}}}{n_{0} + n_{1}}, $$

where m 0 and m 1 are means for the two classes in the projected data space, i.e.,

$$\begin{array}{*{20}l} m_{0}=\mathbf{w}^{T}\mathbf{m}_{0} \end{array} $$
$$\begin{array}{*{20}l} m_{1}=\mathbf{w}^{T}\mathbf{m}_{1} \end{array} $$

In the test phase, a test sample x is firstly projected onto w, then the resulted value y is compared against y 0. If y is larger than or equal to y 0, it will be assigned to Class c 0. Otherwise, it is regarded as belonging to class c 1.

The criteria for selecting representative members

In order to avoid feature redundancy, a representative member is selected from each cluster of miRNAs. Although many methods directly choose the mean or center member, it is necessary to define some criterion to rank the members by their contribution or potential in the separation of groups of samples. As described in “Fisher linear discriminant analysis” section, FDA aims to find the projection direction, w, with maximum discriminative capacity. And, w can be regarded as a vector of weights, indicating the importance of features. Intuitively, those features with the largest weights are the most informative for classification. In other words, the magnitude of each component of w implies the relevance of the corresponding miRNA to classification.

Let I be the index set of all miRNAs, i.e. I={1,2,…,p}, and I c be the index set of the miRNAs in the cluster c. The index of the representative member of c, i c , satisfies Eq. (7):

$$ |\mathbf{w}(i_{c})| = \max_{j\in {I}_{c}}(|\mathbf{w}(j)|), $$

where w(·) is a component of w.

The criteria for splitting clusters

Besides selection of miRNAs, the determination of number or size of clusters also has a great impact on the performance of search algorithms. Too many clusters are more likely to find high quality combinatorial biomarkers but can result into huge computational complexity. The extreme case is the trivial clustering that each single miRNA is a cluster. On the contrary, too few clusters would miss valuable combinations since only a few representative miRNAs are considered. Thus here is a tradeoff between accuracy and efficiency. Instead of explicitly specify the number of clusters, we seek proper criteria for determining whether a given cluster needs to be split into smaller clusters.

Here, we define the criterion mainly based on the loss of information caused by projection. Intuitively, if the cluster members are diversified, it would be very hard to find a unified direction for projection, so the data samples would suffer great information loss after projection, which indicates that the cluster needs to be split. Thus, we define a measure called mean squared loss (MSL), to estimate average information loss in a cluster. Equation (8) formulates this measure.

Let h be the hyperplane that passes the mean point of the data samples and has normal direction of w (FDA projection direction), then MSL is defined as:

$$ MSL = \frac{\sum_{i=1}^{n} |\mathbf{P}_{h}(\mathbf{x}_{i}-\mathbf{m})|^{2}}{n}, $$

where P h (·) denotes the projection of a vector onto h, m is the mean vector of samples, x i is the ith data sample. Since h is perpendicular to w, we regard the projection of the difference between x i and m on h as an approximative loss caused by FDA projection.

Furthermore, considering that the samples may differ in data magnitudes, we define another criterion called mean loss rate (MLR) as shown in Eq. (9),

$$ MLR = \frac{\sum_{i=1}^{n} \frac{|\mathbf{P}_{h}(\mathbf{x_{i}}-\mathbf{m})|}{|\mathbf{x}_{i}|}}{n}, $$

where MLR denotes the averaged loss rate, i.e. the ratio of the loss (in the projection) to the norm of sample.

The whole pipeline is described in Algorithm ??, in which the MLR is used as the selection criterion.

Evaluation criteria

The performance of different criteria are evaluated using two measures for the resulted combinations which are ranked top 10, 100 and 1000, respectively. One is average rank, denoted by AvgRank, and the other is the proportion of the true top combinations identified by the method, denoted by HitRatio.

These two measures are defined in Eqs. (10) and (11), respectively. For top n k-miRNA combinations searched by the method,

$$ {AvgRank}_{n} = \frac{\sum_{1\leq i \leq n} {rank}_{i}}{n}, $$

where r a n k i is the true rank of the ith best combination identified among all k-miRNA combinations (In contrast of the rank obtained by the proposed heuristic search, we call the original rank of the miRNA combination by using the exhaustive search as “true rank”). All these ranks are determined according to classification accuracy.

$$ {HitRatio}_{n} = \frac{{hit}_{n}}{n}, $$

where h i t n is the number of hits in the n best combinations searched by the method. A hit means the searched result is truly among the top-n combinations.

Apparently, small AvgRank and high HitRatio of the search results indicate good performance of the algorithm for identifying high-quality biomarker candidates.

In addition, to evaluate the classification performance of the selected miRNA combinations, we used three accuracy measures, namely sensitivity, specificity and total accuracy (TA).


Data sets

In this study, we used two public miRNA data sets from NCBI GEO [27], GSE22220 [28] and GSE40525 [29], which were measured by Illumina Human v1 miRNA panel and Agilent-019118 Human miRNA microarray platform, respectively. Both of these two studies aim to explore function of microRNAs in breast tumorigenesis, and reveal potential therapeutic targets. There are a total of 120 samples collected from 64 breast cancer patients, including 56 pairs of matched tumor and adjacent peri-tumoral breast tissues, and 8 unmatched tissues in GSE 40525. And in GSE22220, there are 210 samples from 219 breast cancer patients, including 84 estrogen receptor (ER)-negative tissues, and 135 ER-positive tissues. The detailed statistics of patient characteristics are shown in Table 1.

Table 1 Sample statistics

In order to ensure the data quality, we removed the miRNAs whose expression levels were not detected or below the threshold value in more than 30% of the samples. GSE40525 was classified into two categories according to tumor and peri-tumor status, while GSE22220 was divided into two categories according to ER status. Finally, the GSE40525 data set contains 52 pairs of tumor and peri-tumor profiles and the GSE22220 data set contains 127 samples of ER-positive and 80 of ER-negative.

Experimental settings

As a pre-screening step, Welch’s t-test was conducted on the two data sets. MiRNAs with pvalue greater than 0.05 were filtered out, and the remaining miRNAs were clustered by a hierarchical clustering with average-link method. Next, the hierarchical tree was cut into raw clusters. In order to find natural cluster divisions in the hierarchical tree, we computed inconsistency coefficient for each link in the tree [30]. This value compares the height of a link in a hierarchical tree with the average height of links below it. Inconsistent links indicate the border of naturally divided clusters. The inconsistency coefficients range from 0 to 1.15 for both the two data sets. Thus we specified an inconsistency coefficient threshold of 1 to partition the two hierarchical trees into raw clusters, resulting in 82 and 63 clusters, respectively.

Further, FDA-based criteria were used to determine whether or not those clusters should be split into smaller clusters. After the final clusters were determined, a representative member was selected from each cluster to form miRNA combinations. The comparison on several criteria for selecting representative members within clusters and for splitting clusters are given in the following two sections.

In the final step, each combination was assessed by classification accuracy. We evaluated the classification accuracies of all k-combinations (k≤4) comprised by the selected representative miRNAs using LIBSVM [31] with default parameters via 5-fold cross validation. The AvgRank and HitRatio were calculated based on the true ranking lists obtained by exhaustive searches.

Comparison of criteria for selecting representatives in clusters

In previous researches, the center gene, i.e. the gene closest to the cluster center, was usually selected as the representative member of a cluster [2224]. Also, some researchers proposed specific ranking criteria, such as the signal-to-noise ratio (SNR) proposed by Sahu et al. [25]. Here, we compared the FDA measure with three other methods based on center-gene, SNR, and pvalue of t-test, respectively. All the k-miRNA combinations (2≤k≤4, i.e. pair, triple and quadruple) resulted from these selection criteria were assessed.

In order to evaluate the quality of the search results, we examined top 10, 100 and 1000 best combinations identified by these four methods and recorded their AvgRanks and HitRatios obtained on GSE22220 and GSE40525 in Tables 2 and 3, respectively.

Table 2 Comparison of four selection criteria on GSE22220
Table 3 Comparison of four selection criteria on GSE40525

The results show that a proper selection criterion is crucial for searching high-quality miRNA combinations. Specifically, FDA and t-test based criteria have significant advantage over other two methods, and SNR is slightly better than center-gene. For instance, on GSE22220, FDA and t-test successfully identified the best pair and triple miRNAs, and the second-best quadruple, whose accuracy is only 0.4% lower than the best one. FDA and t-test have much smaller AvgRanks than SNR and center-gene, no matter what the k is and how long the top list considered. Moreover, FDA hits 80% of the top 10 pairs. Both FDA and t-test catch majority of the top-ranked miRNA pairs and triples. As k increases to 4, the hit ratio decreases greatly, which is mainly due to the exponentially expanded search space. AvgRank and HitRatio values of the top 100 lists obtained by the four methods on GSE22220 data set are depicted in Figs. 2 and 3, respectively.

Fig. 2

AvgRank of top 100 lists obtained by the four methods for GSE22220

Fig. 3

HitRatio of top 100 lists obtained by the four methods for GSE22220

Generally, these methods have consistent performance on the two data sets. For GSE40525, the accuracies of combinatorial miRNAs are very high. Even a pair of miRNAs can achieve the accuracy as high as 92.3%, and the highest accuracy of quadruples is 95.2%, which suggests that the k-miRNA combinations (k≤4) are sufficient for separating the samples from two classes. The goal of GSE40525 is to discriminate tumor and peri-tumor samples, thus the differential expressed signal may be widespread. If too many combinations can achieve high accuracies, the real biomarkers may become not that notable. Thus, the results of average rank and hit ratio seem to be worse than those of GSE22220.

Comparison of criteria for splitting clusters

In this study, we propose two criteria for determining whether a given cluster should be split, i.e., mean squared loss (MSL) and mean loss rate (MLR). Considering that different clusters contain different numbers of miRNAs, instead of using the original MSL, we divide the squared loss by m (number of miRNAs in the cluster), and use \(MSL' = \frac {MSL}{m}\) in the analysis. The M S L s for all raw clusters in GSE40525 sorted in ascending order are shown in Fig. 4. It can be observed that a dramatic change occurs a little above 0.6 on the curve. Thus, we set the threshold as 0.65, where the steepest ascent locates. And we found that in GSE22220 the value is very close.

Fig. 4

MSL curve of the initial clusters of GSE40525

Obviously, MLR would grow rapidly as the number of miRNAs in the clusters increases. Here we set the threshold as \(1-\frac {1}{m^{2}}\), which is a relatively loose criterion. MLR works as a supplement to MSL. In our experiment, if either of these two criteria is not satisfied (i.e., MLR/MSL is greater than its threshold), the cluster should be partitioned.

We compared the refined clustering (RC) by using these two criteria and the conventional hierarchical clustering (HC) without further splitting. The results are shown in Table 4. Considering that the results obtained after refinement are generally better than those from raw clusters because more clusters make larger search space of miRNA combinations, we did not use results of the raw clusters in the RC experiment. Instead, we tried different inconsistency coefficients for HC, which produced close number of clusters as RC did, and selected the best results, while in RC the inconsistency coefficient and thresholds of MSL and MLR are fixed as mentioned above.

Table 4 Comparison of clustering methods on two data sets

Generally, RC has a comparable or better performance to the best HC. For the top 10 list, HC shows some advantage, while for top 100 and 1000, RC performs better. These results suggest that MSL and MLR can help to improve the clustering quality, and save effort on searching good threshold to yield clusters in the hierarchical tree. Basically, both MSL and MLR measure the part of information that cannot be expressed by the projected features, i.e. information loss during the projection. Different from the absolute loss represented by MSL, MLR measures the relative loss and plays a part in screening low-quality clusters when the variances of miRNAs differ greatly.

Comparison with existing feature selection methods

We further compared the new method with some widely used feature selection methods, including the Correlation-based Feature Selection (CFS) [32], best-first search (BFS), consistency-based selection [33], Chi-square score [34], information gain (IG) [35], Random forest (RF) filter [36], t-test [37] and the Wilcoxon rank-sum test [38].

Among these methods, CFS, BFS and consistency-based methods determine the number of selected features automatically. For other methods, we chose the subsets consisting of top 2, 3 and 4 features in the assessment. The R package FSelector [39] was adopted to implement these eight methods in the comparison experiments.

Most feature selection methods shown in Table 5 are filtering methods, except that BFS is a wrapper method, and the proposed method can be regarded as a hybrid method, which conducts filtering within clusters and then acts as a wrapper method using SVMs. The best miRNA combinations identified by the new method achieve the highest accuracies on both data sets, increasing the total accuracies by about 0.5% on GSE22220 and 1.9% on GSE40525 compared with the best accuracies obtained by other methods. This result again demonstrates the validity of clustering-based screening and the FDA criteria. Given the representative members selected from clusters, the search space is greatly reduced and the best combinations can be efficiently searched. Hence, the new method achieves a good balance between efficiency and accuracy.

Table 5 Comparison of feature selection methods on two data setsa

Functional analysis on the selected miRNAs

In order to perform functional enrichment on the miRNA combinatorial biomarkers, we analyzed the enriched pathways of their target genes by using mirPath [40]. For GSE40525 data set, triples of miRNAs have the best discriminant capacity, and the top 5 significant pathways for the best triple are: Fatty acid biosynthesis, PI3K-Akt signaling pathway, Prostate cancer, TGF-beta signaling pathway and p53 signaling pathway, all of which have pvalues below 5×10−7. For the GSE22220 data set, the enriched pathways include PI3K-Akt signaling pathway, NF-kappa B signaling pathway, focal adhesion, etc. Interestingly, PI3K-Akt signaling pathway is significantly enriched in both data sets. This pathway acts as regulator of cell proliferation, differentiation, apoptosis, and plays important roles in tumorigenesis.

In addition, we found that the top-ranked combinations usually have overlapped members. For example, all the top 10 pairs and triples of GSE40525 contains hsa-miR-139-5p. And, best quadruples often contain best pairs and triples. Therefore, we recorded the most frequent miRNAs in pairs and triples respectively and got their intersection set (Table 6). There are 8 miRNAs and 7 miRNAs for GSE22220 and GSE40525, respectively. Furthermore, these miRNAs were searched against two miRNA-disease relationship databases, namely HMDD v2.0 [41] and miR2Disease [42]. Among the 15 most frequent miRNAs, 9 miRNAs were reported in previous literatures as being involved in the development of breast cancer (Table 7). It is worth noting that 4 of the miRNAs are not covered in the top 10 list evaluated by statistical significance of the conventional t-test ranking method, but all of them have supporting literatures. Specifically, hsa-miR-365 ranks 11, hsa-miR-340 ranks 33, hsa-miR-100 ranks 34, and hsa-miR-141 ranks 56. Both miR-340 and miR-100 have been demonstrated as inhibitors of tumorigenesis with biological-experimental evidence.

Table 6 Most frequent miRNAs in pairs and triplesa
Table 7 Most frequent miRNAs in pairs and triples


In this paper, we propose to identify miRNA combinatorial biomarkers due to the important role that miRNAs play in the development of cancer and also some good properties of combinatorial biomarkers. The reasons for searching biomarkers of miRNA combinations are manifold. Firstly, single-gene biomarkers identified by uni-variate analysis are often unreliable with low specificity for discriminating complex disease properties. Thus, multi-gene biomarkers are in great need. However, the biomarkers containing too many genes, resulted from feature subset selection, are extremely difficult to be interpreted in biomedicine. For instance, if we have identified a k-tuple combinatorial biomarker, and we want to validate the overexpress/unexpress rule as well as inter-correlation in this biomarker, the over/under express pattern has a total of 2 k possibilities. Moreover, correlation coefficient can only be computed between two genes, and now there have been some studies on the conditionally independent properties in triples (3-gene combinations). But there have been no effective means to measure or validate the interconnection among multiple genes yet. Moreover, according to our results, combinations with small k have sufficient capability to separate groups of samples. We have also examined the accuracy of using all representative members selected from every clusters (Table 8), which are much lower than the best k-miRNA combinations (k≤4), decreasing by about 4% on GSE22220 and 9% on GSE40525. This result further demonstrates the usefulness of small combinatorial biomarkers.

Table 8 Accuracies of different feature subsetsa


MiRNA expression files have been widely used in the identification of biomarkers for complex diseases. Due to the low specificity of single-gene biomarker and difficulty in interpretating large gene set, this study aims to develop efficient algorithm for searching miRNA combinatorial biomarkers with high discriminability. We propose a clustering-based method to avoid brute force search, and define two types of criteria for refining clusters and selecting representative members. The former criterion aims to measure the loss during the feature projection by Fisher linear discriminant analysis, and determine whether or not to partition the given clusters. The latter criterion aims to select the most informative miRNAs in the clusters according to their contribution for classification in FDA model. We conducted experiments on two breast cancer miRNA expression profiles. The FDA-based selection method performs the best with regard to average rank of the top searched results and hit ratio on the true top list. The FDA-based cluster splitting rule has also been demonstrated to be effective in refining the clustering results. For the two data sets, k-miRNA combinations with k≤4 have sufficient capacity to discriminate the samples (83% for GSE22220 and 95% for GSE40525). This method can also be applied to the search of combinations with larger k, and mRNA expression data. The top-ranked miRNA combinations are worth further study on their functions as well as interactions of the miRNAs. As an additional computational analysis, the most frequent miRNAs occurring in top 10 pairs and triples have been searched again miRNA-disease database. Among the 15 most frequent miRNAs, 9 miRNAs have supporting literatures of their roles in the development of breast cancer.


  1. 1

    Chen C, et al. Micrornas as oncogenes and tumor suppressors. N Engl J Med. 2005; 353(17):1768.

    CAS  Article  PubMed  Google Scholar 

  2. 2

    Zhang B, Pan X, Cobb GP, Anderson TA. micrornas as oncogenes and tumor suppressors. Dev Biol. 2007; 302(1):1–12.

    CAS  Article  PubMed  Google Scholar 

  3. 3

    Calin GA, Croce CM. Microrna signatures in human cancers. Nat Rev Cancer. 2006; 6(11):857–66.

    CAS  Article  PubMed  Google Scholar 

  4. 4

    Bertoli G, Cava C, Castiglioni I. Micrornas: new biomarkers for diagnosis, prognosis, therapy prediction and therapeutic tools for breast cancer. Theranostics. 2015; 5(10):1122.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5

    He L, Thomson JM, Hemann MT, Hernando-Monge E, Mu D, Goodson S, Powers S, Cordon-Cardo C, Lowe SW, Hannon GJ, et al. A microrna polycistron as a potential human oncogene. Nature. 2005; 435(7043):828–33.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  6. 6

    Ghelani HS, Rachchh MA, Gokani RH, et al. Micrornas as newer therapeutic targets: A big hope from a tiny player. J Pharmacol Pharmacother. 2012; 3(3):217.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7

    Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001; 98(9):5116–121.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. 8

    Mankiewicz R. The Story of Mathematics. Princeton: Princeton University Press; 2000.

    Google Scholar 

  9. 9

    Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol. 2000; 7(6):819–37.

    CAS  Article  PubMed  Google Scholar 

  10. 10

    Kruskal WH. Historical notes on the wilcoxon unpaired two-sample test. J Am Stat Assoc. 1957; 52(279):356–60.

    Article  Google Scholar 

  11. 11

    Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics. 2003; 19(18):2448–455.

    CAS  Article  PubMed  Google Scholar 

  12. 12

    Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):3.

    Google Scholar 

  13. 13

    Delmar P, Robin S, Daudin JJ. Varmixt: efficient variance modelling for the differential analysis of replicated gene expression data. Bioinformatics. 2005; 21(4):502–8.

    CAS  Article  PubMed  Google Scholar 

  14. 14

    JaffrÉzic F, Marot G, Degrelle S, Hue I, Foulley J-l. A structural mixed model for variances in differential gene expression studies. Genet Res. 2007; 89(01):19–25.

    Article  PubMed  Google Scholar 

  15. 15

    Hotelling H. A Generalized T Test and Measure of Multivariate Dispersion. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press: 1951. p. 23–41.

  16. 16

    Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–517.

    CAS  Article  PubMed  Google Scholar 

  17. 17

    Cui J, Chen Y, Chou WC, Sun L, Chen L, Suo J, Ni Z, Zhang M, Kong X, Hoffman LL, et al. An integrated transcriptomic and computational analysis for biomarker identification in gastric cancer. Nucleic Acids Res. 2011; 39(4):1197–207.

    CAS  Article  PubMed  Google Scholar 

  18. 18

    Xu K, Cui J, Olman V, Yang Q, Puett D, Xu Y. A comparative analysis of gene-expression data of multiple cancer types. PloS ONE. 2010; 5(10):13696.

    Article  Google Scholar 

  19. 19

    Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, et al. Microrna expression profiles classify human cancers. Nature. 2005; 435(7043):834–8.

    CAS  Article  PubMed  Google Scholar 

  20. 20

    Miska EA, Alvarez-Saavedra E, Townsend M, Yoshii A, Šestan N, Rakic P, Constantine-Paton M, Horvitz HR. Microarray analysis of microrna expression in the developing mammalian brain. Genome Biol. 2004; 5(9):68.

    Article  Google Scholar 

  21. 21

    Erbes T, Hirschfeld M, Rücker G, Jaeger M, Boas J, Iborra S, Mayer S, Gitsch G, Stickeler E. Feasibility of urinary microrna detection in breast cancer patients and its potential as an innovative non-invasive biomarker. BMC Cancer. 2015; 15(1):1.

    CAS  Article  Google Scholar 

  22. 22

    Jäger J, Sengupta R, Ruzzo WL. Improved gene selection for classification of microarrays. Lihue: The Proceedings of the Eighth Pacific Symposium on Biocomputing; 2002, pp. 53–64.

    Google Scholar 

  23. 23

    Hanczar B, Courtine M, Benis A, Hennegar C, Clément K, Zucker JD. Improving classification of microarray data using prototype-based feature selection. ACM SIGKDD Explor Newsl. 2003; 5(2):23–30.

    Article  Google Scholar 

  24. 24

    Wang Y, Makedon FS, Ford JC, Pearlman J. Hykgene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics. 2005; 21(8):1530–7.

    CAS  Article  PubMed  Google Scholar 

  25. 25

    Sahu B, Mishra D. A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Procedia Eng. 2012; 38:27–31.

    Article  Google Scholar 

  26. 26

    McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition: Wiley Interscience; 2004.

  27. 27

    Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. Ncbi geo: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Res. 2007; 35(suppl 1):760–5.

    Article  Google Scholar 

  28. 28

    Buffa FM, Camps C, Winchester L, Snell CE, Gee HE, Sheldon H, Taylor M, Harris AL, Ragoussis J. microrna-associated progression pathways and potential therapeutic targets identified by integrated mrna and microrna expression profiling in breast cancer. Cancer Res. 2011; 71(17):5635–645.

    CAS  Article  PubMed  Google Scholar 

  29. 29

    Biagioni F, Ben-Moshe NB, Fontemaggi G, Canu V, Mori F, Antoniani B, Di Benedetto A, Santoro R, Germoni S, De Angelis F, et al. mir-10b*, a master inhibitor of the cell cycle, is down-regulated in human breast tumours. EMBO Mol Med. 2012; 4(11):1214–29.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. 30

    Kaufman L, Rousseeuw PJ, Vol. 344. Finding Groups in Data: an Introduction to Cluster Analysis. Hoboken: Wiley; 2009.

    Google Scholar 

  31. 31

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727. Software available at

  32. 32

    Hall MA, Smith LA. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In: Proceedings of the 12th International Florida Artificial Intelligence Research Society Conference. Orlando: AAAI Press: 1999. p. 235–9.

    Google Scholar 

  33. 33

    Dash M, Liu H. Consistency-based search in feature selection. Artif Intell. 2003; 151(1):155–76.

    Article  Google Scholar 

  34. 34

    Liu H, Setiono R. Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of the 24th International Conference on Tools with Artificial Intelligence. Herndon: 1995. p. 388–91.

  35. 35

    Lee C, Lee GG. Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag. 2006; 42(1):155–65.

    Article  Google Scholar 

  36. 36

    Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006; 7(1):1.

    Article  Google Scholar 

  37. 37

    Jafari P, Azuaje F. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak. 2006; 6(1):1.

    Article  Google Scholar 

  38. 38

    Thomas JG, Olson JM, Tapscott SJ, Zhao LP. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res. 2001; 11(7):1227–36.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  39. 39

    Romanski P. Fselector: Selecting attributes. 2009. R package version 0.18, URL Accessed 30 June 2016.

  40. 40

    Vlachos IS, Kostoulas N, Vergoulis T, Georgakilas G, Reczko M, Maragkakis M, Paraskevopoulou MD, Prionidis K, Dalamagas T, Hatzigeorgiou AG. Diana mirpath v. 2.0: investigating the combinatorial effect of micrornas in pathways. Nucleic Acids Res. 2012; 40(W1):498–504.

    Article  Google Scholar 

  41. 41

    Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q. Hmdd v2. 0: a database for experimentally supported human microrna and disease associations. Nucleic Acids Res. 2013; 42(D1):D1070–4.

    Article  PubMed  PubMed Central  Google Scholar 

  42. 42

    Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y. mir2disease: a manually curated database for microrna deregulation in human disease. Nucleic Acids Res. 2009; 37(suppl 1):98–104.

    Article  Google Scholar 

Download references


Y. Yang is supported by the Shanghai Municipal Natural Science Foundation (No. 16ZR1448700) and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. W. Kong is supported by the National Natural Science Foundation of China (No. 61271446).


The publication costs for this article were partly funded by the the Shanghai Municipal Natural Science Foundation (No. 16ZR1448700) and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.

Availability of data and materials

The datasets analysed during the current study are available in the Gene Expression Omnibus (GEO) repository,

Authors’ contributions

YY designed the system, NH and LH performed the computational tasks, YY and WK analyzed the results. YY and NH drafted the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

About this supplement

This article has been published as part of BMC Genomics Volume 18 Supplement 2, 2017. Selected articles from the 15th Asia Pacific Bioinformatics Conference (APBC 2017): genomics. The full contents of the supplement are available online

Author information



Corresponding author

Correspondence to Yang Yang.

Additional information

From The Fifteenth Asia Pacific Bioinformatics Conference Shenzhen, China. 16-18 January 2017

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Huang, N., Hao, L. et al. A clustering-based approach for efficient identification of microRNA combinatorial biomarkers. BMC Genomics 18, 210 (2017).

Download citation


  • MicroRNA
  • Biomarker
  • Clustering