Automatically clustering large-scale miRNA sequences: methods and experiments
© Wan et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Wan et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Since the initial annotation of microRNAs (miRNAs) in 2001, many studies have sought to identify additional miRNAs experimentally or computationally in various species. MiRNAs act with the Argonaut family of proteins to regulate target messenger RNAs (mRNAs) post-transcriptionally. Currently, researches mainly focus on single miRNA function study. Considering that members in the same miRNA family might participate in the same pathway or regulate the same target(s) and thus share similar biological functions, people can explore useful knowledge from high quality miRNA family architecture.
In this article, we developed an unsupervised clustering-based method miRCluster to automatically group miRNAs. In order to evaluate this method, several data sets were constructed from the online database miRBase. Results showed that miRCluster can efficiently arrange miRNAs (e.g identify 354 families in miRBase16 with an accuracy of 92.08%, and can recognize 9 of all 10 newly-added families in miRBase 17). By far, ~30% mature miRNAs registered in miRBase are unclassified. With miRCluster, over 85% unclassified miRNAs can be assigned to certain families, while ~44% of these miRNAs distributed in ~300novel families.
In short, miRCluster is an automatic and efficient miRNA family identification method, which does not require any prior knowledge. It can be helpful in real use, especially when exploring functions of novel miRNAs. All relevant materials could be freely accessed online (http://admis.fudan.edu.cn/projects/miRCluster).
Over the last decade, 20-30 nt RNA molecules have emerged as critical regulators in the expression and function of eukaryotic genomes [1, 2]. microRNA (miRNA), one of the most important categories of these small RNAs, acts in both somatic and germline lineages in a broad range of eukaryotic species to regulate endogenous genes and to defend the genome from invasive nucleic acids [3–6]. In 1993, the first identification of lin-4 in C. elegant added another dimension to the known genome regulation networks . miRNAs act as guide molecules in post-transcriptional gene silencing (PTGS) by base pairing with target mRNAs, which leads to mRNA cleavage or translational repression .
The biogenesis of both animal and plant miRNAs is a two-step process [9, 10], which initiates with the nascent transcripts - primary miRNAs (pri-miRNAs) that are usually several hundred nucleotides long. For animal miRNAs, the primary transcripts are processed in the nucleus by a multiprotein complex containing an enzyme called Drosha to give rise to the ~70 nt long miRNA stem-loop precursors (pre-miRNAs) which are then exported to the cytoplasm . In the cytoplasm, a second step takes place where a pre-miRNA matures into a ~21-nt long miRNA:miRNA duplex, with each strand originating from opposite arms of the stem-loop. Although some miRNAs are Dicer independent [11, 12], most are produced by the action of an enzyme called Dicer, which recognizes the double-stranded stem . In general, the miRNA strand is then integrated into the miRNA-induced silencing complex (miRISC) or miRNA-containing ribonucleoprotein particles (miRNPs) and the miRNA* is degraded . Sometimes both strands can be detected, in which case the miRNA* designates the less predominant form of the mature miRNA [15, 16].
In plants, miRNA biogenesis follows a similar process, but a miRNA seems to be fully matured into a single stranded miRNA before being exported to the cytoplasm and integrated onto the silencing complex, which partially explains why intermediate forms of plant miRNAs are only rarely detected [9, 17]. All maturation steps of plant miRNAs are processed by Dicer-like proteins . Plant miRNA precursor sequences are much more diverse in both length and secondary structure than those in animals ; there are two precursor-processing pathways that have been identified for plant miRNA genes. Besides the primary pathway involves stem-to-loop processing [20, 21], the second pathway involves loop-to-base processing in which the sequence and structure beyond the miRNA-miRNA* site are necessary and used by the cleavage pathway components to excise the mature sequence(s) [22, 23].
In this paper, we intend to develop a method to automatically arrange miRNA maturities, especially after noticing that not all miRNAs in miRBase have been classified by far. For example, from version 5 to 18, the number of classified miRNAs increases from 1176 to 13,681, while unclassified matures increases much faster from 91 to 5,777 (see Figure 1).
Statistical results of miRNA families in miRBase16 and miRBase17
Family number of each organism
Notations of datasets
Feature selection on miRBase16
Animal and plant families that hold at least 5 members in miRBase16
New families evaluation in miRBase17
Besides R1, also includes new members of novel families and existing families (contain no less than 4 members) in miRBase17.
Prediction of unclassified members in miRBase17
Animal and plant families that contain no less than 5 members, plus unclassified matures in miRBase17.
In this paper, we develop an effective clustering-based method to automatically build up a family system of miRNA maturities, and compare the result with the family system provide by miRBase. Here, n-grams  is used as the feature extraction method to transform primary sequences to numeric vectors. We then consider several feature selection methods to reduce the dimension of feature matrix, thus get a smaller group of more informative features.
An n-gram is a subsequence consisting of n spatially consecutive items extracting from a given sequence . Here, the items are base nucleotides A, C, G and U. Considering the instinct difference between these features, we adopted a weighted concentration method from miRFam  to combine these features into one vector, , and all values sum up to 1 in every single feature vector. To facilitate the illustration, we used the notation of "GramN" as the combination of 1-gram, 2-gram...and N-gram features. Consequently, Gram4 or Gram5 are chosen as features in different datasets for our study.
As explained above, there are 4 features of 1-gram, 16 of 2-gram, 64 of 3-gram, 256 of 4-gram, and 1024 of 5-gram respectively. However, the disadvantage might be that: some redundant features or noise data may exist among the large feature space. In order to effectively select a subset of representative features and construct a more robust family discovery method, three different dimension reduction methods are introduced here. Detailed information about them is as follows.
Latent semantic analysis (LSA) , a famous method in natural language processing, which extracts and represents the contextual usage meaning of words by statistical computations applied to a large corpus of text. At the beginning, a matrix of word counts per document is constructed. Then, the particular technique singular value decomposition (SVD) is performed to reduce the columns while preserving the similarity between rows. LSA has been greatly successful in many information retrieval applications such as document classification , spam filtering , and text summarization .
Locally linear embedding (LLE)  is a recently proposed unsupervised learning algorithm to compute low dimensional embedding of high dimensional data, while neighbourhood relationship is preserved. It computes the k nearest neighbours of each data point in D-dimension at first, and then generates a weight matrix that best reconstructs each data point from its neighbours, and finally computes the d dimensional embedding coordinates that reconstructed by the weight matrix.
Isometric feature mapping (Isomap) , another low-dimensional embedding method, is widely used by incorporating geodesic distances on a weighted graph with metric multidimensional scaling. It consists of three main steps. First, it constructs a neighbourhood graph after computing k nearest neighbours, and then geodesic distances are estimated between all pairs of points by computing the shortest paths in the graph. Finally, an embedding is constructed in d dimensions that best preserves the estimated geometry in the second step.
There are some differences among them. LSA is a method that basically uses SVD to reduce the dimension according to singular values, while LLE and Isomap both compute low dimensional embedding based on k nearest neighbours. LLE is aimed to maintain the local linear reconstruction relationship among points; while on the other hand, the major concern of Isomap is to get a low-dimensional representation that best preserves geometry computed from the graph.
Based on their performances on dataset R1, the one with best performance is chosen to be used in the following experiments.
The main aim of this paper is to develop an effective method to automatically discover miRNA families by performing unsupervised clustering analysis. In this study, we adopt a simple K-means  clustering method from Biopython , which is a set of freely available tools for biological computation written in python. Currently, there are many distance measurements available to evaluate the relationship between two data points , such as Euclidean distance, city block distance, and spearman's distance. However, after some tests, "city block distance" is chosen here as it always gives better clustering results. To evaluate the performance of our clustering approach, we provide two measure metrics. Generally, the most straightforward way to evaluate the performance of a clustering method is F-measure . At the beginning, F-measure is computed to balance both precision and recall values for each family. Then, an average F-measure can be calculated by weighting different family size to quantify the agreement between true families and generated clusters. Besides, we also consider a specific accuracy score after adopting a "Vote strategy" to further explore those generated clusters. It works as follows. First, small clusters with less than 5 miRNAs are thought to be of bad quality and ignored. Second, after calculating detailed distribution of families in each cluster, we assign a cluster to the family that has the largest number, and only those miRNAs are thought as correctly clustered. However, in some cases, several miRNAs from different families (e.g. plant families MIR169_1 and MIR169_2) are so similar that they are clustered together frequently. Hence, in each cluster, if the number of the second largest family exceeds 1/3 of the largest one, we will also treat that meaningful family as correctly clustered.
Unfortunately, there're still two limitations of K-means algorithm. On one hand, it is significantly sensitive to the initial randomly chosen cluster centres. To reduce this unstable factor, we run the clustering procedure multiple times (e.g. 20 here) in each experiment, and eventually choose the one with best convergence. On the other hand, it is difficult to determine the number of clusters in a specific dataset without any prior information. For most of clustering methods, this parameter needs to be determined by users. To solve this problem, we perform the clustering process several times with different cluster number. Concretely, for a specific dataset if the family number defined by miRBase is N, the argument of cluster number K is set to 6 values that uniformly distributed between N and 2*N. For example, if there are 300 families in a dataset, then the testing cluster number is set to 300, 360, 420, 480, 540, and 600, consecutively. At last, the one with the highest accuracy score is chosen as the overall accuracy of K-means clustering approach.
We also use Clustal X2 (Version 2)  to do multiple sequence alignment of some miRNA samples during evaluating the discovering novel families from unclassified miRNAs by in miRBase17 our miRCluster.
We start with the feature selection test. The methods used here are latent semantic analysis (LSA), locally linear embedding (LLE) and Isometric feature mapping (Isomap). With selected features falling between 10 and 180, experimental results showed that Isomap achieves the best clustering accuracy on our datasets. Therefore, we chose Isomap as the feature selection method for all the following study. The second experiment is to evaluate miRCluster's ability to identify new families and new members of existing families in miRBase17. Result shows that miRCluster can correctly detect novel families and new members. From this point of view, it is a good candidate to help annotating miRNAs when enormous new miRNAs are registered in miRBase these days. The purpose of the last experiment is to discover novel families from those unclassified miRNAs. Currently, there are 5,127 unclassified miRNAs in miRBase17, which account for almost 1/3 of all the mature sequences, thus it is a great challenge to find meaningful results by mining these unclassified miRNAs.
In miRBase16, there are about 1,114 animal and plant families, but more than half are too small that contain less than 5 members. As we all know, for most clustering methods, it is hard to perfectly separate small families from big families which contain the majority of miRBase. Hence, small families are ignored, and finally we get 319 animal and 75 plant families, of which the total mature sequences are 7,197 and 2,028, respectively. For example, the three biggest families in animal are let-7, mir-17 and mir-154, which hold 195, 175 and 169 members respectively. Meanwhile, the top 3 families in plant are MIR166, MIR156 and MIR395, which have 141, 140, and 119 miRNAs respectively.
In our analysis, Gram4 (combination of 1, 2, 3, 4-gram) is extracted, followed by using a concentration method illustrated in miRFam  to get original features. With K-means clustering and vote strategy, we can calculate two performance metric values: F-measure and accuracy. The average F-measure is 0.7737, and accuracy is 91%, which means 8,395 mature sequences are correctly clustered and 830 are not.
Since the original feature vector contains 340 high dimensions, direct clustering approach is always time-consuming and accuracy-lacking. Hence, we consider three dimension reduction methods: LSA, LLE and Isomap. The parameter of k nearest neighbours for LLE and Isomap is set to 10, and the selected dimensions are tested from 10 to 180 with a step size 10.
According to the above experimental results, Isomap is employed for feature selection in our clustering approach, and the dimensionality could be set between 120 and 160, in order to get the balance of time and accuracy.
Additionally, we calculate the "dead" families, which are not discovered by miRCluster due to two reasons. One is the filtering of noisy clusters (contain less than 5 members), and the other is the difficulty of distinguishing small families from bigger ones. Results are summarized in Additional file 1 Table S1. Before feature selection, we find 42 "dead" families containing 285 miRNA sequences. However, after using Isomap to select 150 informative features, "dead" families are reduced to 40, which contained 270 mature sequences. For example, mir-1422 (containing 18 members) is successfully discovered after using Isomap.
In summary, the result shows that Isomap is effective to improve the overall accuracy of our clustering approach, and it succeeds in finding some families that are not discovered before. Furthermore, experiments in later sections will demonstrate its effectiveness again.
In this section, we try to evaluate miRCluster's capability to find new families and new miRNAs that belong to existing families.
There are 72 new families and 2,383 new miRNAs added to miRBase17 compared to miRBase16. Among the new families, only 10 hold at least 4 members and the total number of mature sequences is 58. It consists of a 12-member family (mir-3851), a 10-member family (mir-3811), an 8-member family (MIR5067), and 7 4-member families (Additional file 1 Table S2). Besides, there are 605 new mature sequences which belong to existing families (holds at least 5 members). By adding these data to R1, a new dataset R2 is constructed and used to evaluate miRCluster.
Result of evaluating new families in miRBase17
Correctly clustered families
Correctly clustered members
Correctly clustered new miRNAs belonged to existing families
Before feature selection
After feature selection
For the dataset with all features, miRCluster find 9 new families that contain 34 new mature sequences. After selecting features with Isomap, we find that the clustering approach successfully found 8 new families containing 37 mature sequences when 140 features are selected..
Although the number of correctly found new families decreases from 9 to 8 after feature selection, more members of the two big families (mir-3581 and mir-3811) are identified. In fact, all the members in mir-3811 have been identified, while only half are discovered before. On the other hand, two small families (mir-3836 and mir-3817) each with 4 members disappear after feature selection.
Meanwhile, miRCluster identify 447 new members of existing families and this number does not change a lot before and after feature selection. All the results indicate that our method can correctly discover new families and new miRNAs. Concerning the fast growing of new miRNAs these days, miRCluster will be helpful in real use.
After analysing miRBase carefully, we find that many miRNAs are not classified to any known family, and the number of unclassified miRNAs increases even faster than that of classified miRNAs. For example, in miRBase 5.0, the ratio of unclassified mature sequences over classified is 7.7% (91/1,176). However, the ratio increases to nearly 42% (5,127/12,051) in miRBase17. Similar to the above experiments, we first select the 331 animal and 82 plant families that hold no less than 5 members from miRBase17, and all then unclassified miRNAs to build dataset R3.
Here, we mainly focus on 3 types of clusters. First, those cluster called "specific novel family" whose members are mainly novel miRNAs. Second, the cluster mixed by novel and existing family members at a considerable rate: member in the second largest family is more than 1/3 of member in the largest family. Lastly, the cluster constructed by novel miRNAs and an existing family, but these novel miRNAs not enough to form quality family.
Evaluating unclassified miRNA in miRBase17
To novel family1
To mixed family2
To existing family
Gram4(140 features, Isomap)
Gram5(140 features, Isomap)
Below, cluster number is set to 1,200 all the time. With Isomap, we reduce the dimension to 140, this time the number of novel families increases to from 293 to 301. However, about 400 more miRNAs are clustered into these novel families, and the number of novel miRNAs mixed with known families decreases to 1,272 (1,457 before feature selection). This indicates that feature selection is a good choice to select informative features that make novel miRNAs easier to be clustered.
In all previous experiments, Gram4 is chosen. However, the relative position information is ignored because n-gram does not consider the specific position of a small fragment in a sequence. After exploring the clustering results by ClustalX, we find that some dissimilar sequences with only several similar fragments at different positions are frequently clustered together. Hence, we use Gram5 (1,346 features) to examine how clustering performance will be impacted by longer features. With 140 features selected by Isomap, we find that the number of novel miRNAs mixed with known families decreases by 187 compared to the result with Gram4. It seems that miRCluster performs better when longer n-grams are considered.
In order to clearly show this problem, we use ClustalX to do multiple sequence alignment on some clusters.
First, an example of novel family is explored (Additional file 1 Figure S1). We add a "*" in front of the sequence name to indicate it is a novel miRNA. This figure clearly shows that miRNAs in the novel family are similar to each other, and this novel family is clustered with good quality.
Then, we look a cluster that is mixed by novel miRNAs and members of known family (Additional file 1 Figure S2). This family was not discovered when the cluster number was set to 800 with Gram4, but was well established in other 3 experiments. By considering larger n-grams and bigger cluster number, more novel miRNAs (mmu-miR-3966, sme-miR-745, cte-miR-745a, bfl-miR-22, and pma-miR-22b) were grouped to the known family miR-22.
As the final product of the complicated biogenesis, biologically speaking, mature miRNAs are more important than their precursors. Two reasons make it a more challenge problem to classify mature miRNAs. First, sequence length of maturity is usually only 1/3 to 1/20 of its precursor. Second, one single pre-miRNA can generate several distinct mature miRNAs that belong to different families. By considering larger n-grams, miRCluster can achieve acceptable accuracy (always over 90%), but the side effect is also obvious. In order to improve the efficiency of our method, we do not directly use traditional machine learning methods to select featured n-grams. Instead, we treat this as a dimension reduction problem to reduce the search space. After testing three different reduction algorithms, results show that both LSA and Isomap can get comparable results while compared with the original situation when only less than 10% n-gram features are selected. Compared with miRFam, miRCluster employs similar number of features but it does not rely on any prior knowledge.
As the core functional region of miRNAs, previous studies have revealed that seed region is more conserved than other backbone regions [14, 18, 50]. We also constructed a weighted feature vector that emphases the seed region by calculating twice those n-grams extracted from this locus. Surprisingly, experiment results do not always turn better, which reveals that the seed region might not be so conserved, especially for animal miRNAs (see Additional file 1 Table S3).
Currently, around 30 percentage (5,127/16,772) miRNAs registered in miRBase (version 17) are unclassified. And the portion is becoming larger (Figure 1). In the foreseeable future, along with the development of advanced sequencing technologies and development of relevant analysis technologies, more novel and specific miRNAs will be identified from more re-sequenced or de novel sequenced species. How to manage and arrange those data automatically and effectively will be greatly helpful for miRNA research community. By far, the main part of current miRNAs family architecture is constructed by small families whose member is less than 5. And we all know that this issue is determined by the diversification of MIRNA genes and their functions. With the proposed miRCluster, we found that around 300 novel families contain ~44% of over 5,000 unclassified mature miRNAs in miRBase 17, but it is still hard to say whether there are overlaps between these novel families and those existing small families, and what is the portion. Actually, finding the minority from the majority is still an open issue from the perspective of machine learning. Currently, one possible solution is to combine fast machine learning methods and traditional sequence alignment methods. First, constructing a family architecture only includes big families while keeping the others in a candidate pool. Then, performing sequence alignment for each miRNA in the pool to get its homologies and determine its family.
Researchers working in both wet and dry labs can get benefit from miRNA family information [18, 51]. But the traditional manual or semi-automatic miRNA classification methods can't deal well with the rapid growth of miRNAs, especially after the widespread of NGS technologies. The development of miRFam fills the gap. As the first miRNA-oriented family classification method, it validates that automatic method can be greatly helpful to solve the problem and using n-gram features is a satisfactory choice to describe miRNA primary sequences , which thus avoids the traditional multiple sequence alignment procedure.
In this work, we aim to deal with shorter maturities, which is more difficult to tackle compared with the longer precursors. Although we still use n-grams but this time a larger number of n-grams are considered. Intuitively, it will be much easier to correctly assign the miRNAs based on more sophisticated description schemes. Here, the question is also obvious, that the speed will be significantly affected. Furthermore, we introduce three different dimension reduction methods that have been validated to be effective in many fields. Finally, comparative studies show that Isomap is a better choice when selecting a feature subset to represent the whole n-gram feature set. Surprisingly, by reducing feature dimension, not only the speed of miRCluster is greatly enhanced, but also the performance is also slightly improved. Thus makes our method more efficient and effective.
A significant difference between miRCluster and miRFam is that miRCluster can recognize both novel members and novel families, while miRFam can only identify the former. Although miRCluster can only achieve comparable performances compared with miRFam in most situations, but considering that it does not require any prior knowledge, which makes it more practical and powerful in real use.
This work was supported in part by the National Basic Research Program of China under grant No. 2010CB126604 and the National Natural Science Foundation of China (NSFC) under grant No. 60873040 and No. 61173118. JHG was also supported by the Fundamental Research Funds for the Central Universities and the Shuguang Scholar Program of Shanghai Education Development Foundation.
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.