In silicomiRNA prediction in metazoan genomes: balancing between sensitivity and specificity
© van der Burgt et al. 2009
Received: 12 September 2008
Accepted: 30 April 2009
Published: 30 April 2009
Skip to main content
© van der Burgt et al. 2009
Received: 12 September 2008
Accepted: 30 April 2009
Published: 30 April 2009
MicroRNAs (miRNAs), short ~21-nucleotide RNA molecules, play an important role in post-transcriptional regulation of gene expression. The number of known miRNA hairpins registered in the miRBase database is rapidly increasing, but recent reports suggest that many miRNAs with restricted temporal or tissue-specific expression remain undiscovered. Various strategies forin silicomiRNA identification have been proposed to facilitate miRNA discovery. Notably support vector machine (SVM) methods have recently gained popularity. However, a drawback of these methods is that they do not provide insight into the biological properties of miRNA sequences.
We here propose a new strategy for miRNA hairpin prediction in which the likelihood that a genomic hairpin is a true miRNA hairpin is evaluated based on statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. These distributions are transformed into a single and continuous outcome classifier called theLscore. Using a dataset of known miRNA hairpins from the miRBase database and an exhaustive set of genomic hairpins identified in the genome ofCaenorhabditis elegans, a subset of 18 most informative descriptors was selected after detailed analysis of correlation among and discriminative power of individual descriptors. We show that the majority of previously identified miRNA hairpins have highLscores, that the method outperforms miRNA prediction by threshold filtering and that it is more transparent than SVM classifiers.
TheLscore is applicable as a prediction classifier with high sensitivity for novel miRNA hairpins. TheL-score approach can be used to rank and select interesting miRNA hairpin candidates for downstream experimental analysis when coupled to a genome-wide set ofin silico-identified hairpins or to facilitate the analysis of large sets of putative miRNA hairpin loci obtained in deep-sequencing efforts of small RNAs. Moreover, the in-depth analyses of miRNA hairpins descriptors preceding and determining theLscore outcome could be used as an extension to miRBase entries to help increase the reliability and biological relevance of the miRNA registry.
MicroRNAs (miRNAs) are ~21-nucleotide (nt) short, single stranded RNA molecules involved in post-transcriptional regulation of gene expression . They are present in higher eukaryotes and some viral genomes . Because the miRNA and small-interfering RNA (siRNA) pathways partly overlap, current understanding of miRNA biogenesis has gained from advances made in the field of RNA interference. Mature, functional miRNAs develop from degenerate palindromic repeats with a characteristic hairpin-like secondary structure [1,3,4]. Initially, experimental identification of miRNAs was achieved through direct cloning and sequencing of small RNAs [5,6]. However, such relatively low-throughput screenings were biased towards abundantly or ubiquitously expressed miRNAs  and missed many miRNAs with restricted temporal or tissue-specific expression patterns . Recently, strategies using PCR , microarrays [8,9] or ultra high-throughput sequencing [10,11] have expanded the list of known miRNAs. Many of these show tissue-specific expression [9,12] or appear to be species-specific [12,13]. In bothArabidopsis thaliana[14,15] andCaenorhabditis elegans[10,16], for example, high-throughput sequencing of small RNAs shows moderate overlap in detected miRNAs between experiments in each species, indicating that also in well studied genomes many new miRNAs remain to be discovered. In addition, the observation that many miRNA loci exhibit compelling hairpin structures on both sense and antisense strands led to the discovery of anti-sense miRNA transcription . Antisense miRNA transcription and processing yield distinct mature miRNAs. This contributes to the functional diversification of miRNA genes for a considerable fraction of the known miRNA loci [17,18].
Various methods for thein silicoprediction of miRNAs have been developed to aid in experimental studies of miRNA discovery [19,20]. These methods generally consider the hairpin-like secondary structure of the miRNA precursor, the miRNA hairpin, as the most important characteristic of a miRNA gene. They use RNA secondary structure prediction (RSSP) algorithms, such as RNAfold  or Mfold , to predict the secondary structure and thermodynamic stability of the RNA hairpin structures. Current bioinformatics approaches for the prediction of miRNAs [19,20] generally include three steps: (1) genome-wide prediction of hairpin structures; (2) filtering or scoring of those hairpins on the basis of their similarity in physical and sequence features to known miRNA hairpins and (3) experimental validation of putative candidates.
A common approach for the first step is to search for hairpin structures using a sliding window and perform RSSP on each window [5,8,23,24]. An improvement in overall calculation time over this approach is to first identify degenerate palindromic sequences and to analyse only these further with RSSP [25,26]. Unfortunately, these approaches detect vast numbers of hairpin structures in complete eukaryotic genomes. Depending on the method used, 1E3 to 4E3 hairpins per Mb of genomic sequence are found, resulting in about 1E7 hairpins identified in the human genome [8,24–27]. The challenge is, therefore, to devise an appropriate filtering method to separate the chaff from the wheat.
Different criteria for filtering candidate miRNA sequences have been proposed, generally with the aim to reduce the search space and/or to increase the specificity of prediction . Evolutionary conservation is considered an important feature of the hairpin sequence  and analysis thereof is often used to identify and focus comparisons on the conserved non-coding sequence space in different genomes [5,23]. An evolutionary approach known as phylogenetic shadowing has been used for combined selection and filtering of miRNA candidates . This study revealed a characteristic camel-shaped conservation pattern of putatively orthologous miRNAs that was useful as a criterion for finding conserved miRNA candidates in primate genomes. Other filtering criteria include intragenomic matching of candidate miRNAs and their potential targets , evidence for expression, thresholds on structural properties of hairpins, e.g. minimal folding energy (MFE), absence of repetitive or low-complexity sequences, occurrence in introns or intergenic regions , or proximity to known miRNA loci [2,24,30].
Stringent filtering is performed to attain high specificity, that is, to minimize the number of false positive predictions of miRNA genes [20,25]. Obviously, maximizing specificity increases the number of false negative predictions, that is, a decreased sensitivity . This implies that with stringent filtering, relevant miRNAs will be missed. The success of any filtering procedure depends however on the validity of the underlying assumptions. For example, filtering on evolutionary conservation will miss species-specific or fast evolving miRNAs  and low-complexity filtering is likely to miss miRNAs originating from transposable elements [32,33]. Several recent methods employ machine learning techniques such as a Support Vector Machine (SVM) [24,27,34,35] for classification. Such SVMs evaluate differences in hairpin properties between true-positive and true-negative examples of miRNAs for a given taxon to generate a prediction classifier. SVMs have been successfully used for miRNA gene prediction [24,34,35] and for the prediction of 5' Drosha processing sites in miRNA hairpins [26,11]. Although the SVM approach is claimed to outperform earlier methods , SVM-based classifications combine many features in a single kernel function and therefore do not provide direct insight into the biological significance of these features. Such insight can only be obtained through expert analyses and dedicated feature selection procedures . Moreover, the set of true-negatives which is required for training an SVM is often very difficult to define.
Here, we present an innovative strategy for miRNA prediction that focuses on attaining optimal sensitivity. We define and combine 40 different filtering criteria and, using a set of genomic hairpins identified in the genome ofCaenorhabditis elegans, show that 18 of these characteristics capture the biological variation of miRNA features present in sets of known miRNA hairpins. These 18 criteria are used to establish a combined likelihood scoreLthat assesses the likelihood that a predicted hairpin structure in a genome contains a genuine miRNA.Lis a continuous classifier that allows user-adjustable thresholds for sensitivity and specificity inab initiomiRNA prediction and miRNA analysis. Good performance ofLfor large sets of hairpins from the genomes ofC. elegansand four viruses demonstrates the added value of the new analytical strategy for future miRNA discovery and selection.
Subset of 18 most informative miRNA hairpin descriptors
ratio asymmetrical bulges vs. stem length
adjusted base pairing propensity (dP)
longest bulge in stem (nt)
longest match-stretch in stem (nt)
central loop length (nt)
max match count
matches in 24 nt
MFEahl index 
MFEahl corrected for GC- Content
Normalized Shannon entropy (Q)
surplus of GA over CU in sequence
1.12 (u) 1.05 (l)
surplus of G over C in sequence
longest poly-A stretch (nt)
longest mono-nucleotide stretch (nt) in the hairpin
longest poly-U stretch (nt)
Di-nucleotide Sequence Complexity (-)
Mono-nucleotide Sequence Complexity (-)
ratio of GU-matches vs. all matches
MFE Adjusted for hairpin length
The CDF of the fitted distribution was transformed into an LDF with outcomeS. The range ofS(between 0 and 1) has anS< 1 and anS= 1 fraction which are separated by the cut-off value derived from the 95% confidence interval of the descriptor's CDF. TheS< 1 fraction contains miRNAs hairpins with descriptor values in the tail(s) of the distribution. These have a low probability of occurring in true miRNA hairpins. TheS= 1 fraction contains miRNA hairpins with values in the remainder of the distribution and corresponds to likely properties of miRNA hairpins. Descriptors were treated differently with respect to the transformation of the tails of the CDF (Table1; [see Additional file2]). For example, for the descriptor "minimal folding energy" (MFE) of a miRNA, there is in principle no need to impose a lower bound, even though the fitted distribution (Figure1A) indicates that very low MFE values occur rarely in known miRNAs. The LDF of the descriptor MFE assigns a scoreS= 1 for all values below -23.72 kcal/mol. Higher MFEs are penalized proportional to the LDF and therefore assigned a score S < 1. MFE is an example of a descriptor where theS< 1 fraction represents the upper 5% tail of the confidence interval. For other descriptors, theS< 1 fraction is represented by the lower 5% tail of the distribution (e.g. match ratio) or by both the lower 5% and the upper 5% tail (e.g. GC-content, Figure1B). Table1and Additional file1list the unlikely tails for each descriptor.
The 40 descriptors here defined were either based on the sequence of the hairpin, the structure of the hairpin or on a combination of both (Table1; [see Additional file1]). Correlations among these were obvious, for instance, between stem length or GC-content on the one hand and the MFE of a hairpin on the other hand. Correlated descriptors will overemphasize the importance of a more general feature of a miRNA hairpin and affect the usefulness ofL. To assess the correlation among descriptors in theirS< 1 fractions, we calculated Cohen's kappa coefficientκ for all 780 possible pairs of descriptors, using the miRNA hairpins of the taxonomic set Metazoa [see Additional file3]. For each descriptor the most strongly correlated descriptor, as determined by the highest observedκis listed in Table1(columnκ) and Additional file3. The highestκwas found between descriptors 'GCratio' and 'GsurplusC' (κ= 0.995), followed by 'D' and 'Q' (κ= 0.844) and the pair 'P' and 'Z' (κ= 0.784).
The correlation and discriminative power of descriptors were used to select a non-redundant subset of most informative descriptors. Descriptors that either correlated with a more selective descriptor, using a threshold for correlation ofκ>0.4, or that had a discriminative power smaller than 1.1 were omitted. This resulted in a subset of 18 descriptors, seven of which were sequence related, nine structure related, and two descriptors with mixed properties (Table1). Remarkably, the descriptors 'GC-content' and the MFE randomization descriptors 'P'  and 'Z' , which are often used in miRNA prediction studies, were not included in the subset [see Additional file1]. The latter two ranked among the most selective descriptors, but were excluded because of their strong correlation with the most selective descriptor MFEahl index, in which the MFE is adjusted for hairpin length and GC-content [see Additional file3].
The 18 most selective descriptors (Table1) were used to define and analyze different scoring models. The CDF of the fitted distribution of all 18 descriptors was transformed into an LDF with outcomeSand theLscore for a given miRNA sequence was calculated as the product of allSvalues. Scoring model performance, defined as the power to distinguish (potential) miRNA hairpins from other (or random) genomic hairpins, was compared for different models that were built using varying settings for five parameters (see below). Scoring model performance was measured as AUC performance and, where appropriate, with selectivity measured at two values of sensitivity (95% and 75%), using genomic hairpins inC. elegansand the collection of miRNAs hairpins in the taxonomic set Metazoa.
The effect of weighting individual descriptors was studied by assigning weights to each of the 18 previously selected descriptors. Weights were equal to the square root of a descriptor's discriminative power as measured at a sensitivity of 95% and ranged from 1.06 for GAsurplusCU to 3.65 for MFEahl (Table1; [see Additional file1]). Figure6shows that a scoring model with weighted descriptors (W95% and W90%) had a significantly increased AUC performance and selectivity index relative to their unweighted models (95% and 90%). Obviously, when weighted on the basis of their discriminative power, descriptors with a high discriminative power contribute stronger to the overallLscore than descriptors with low discriminative power. As such, a better separation between miRNA hairpins and random genomic hairpins is accomplished.
Finally, combinations of weighting and varying the parameterization of descriptors were tested (models W90% and W80%, Fig.6). Scoring model W90% had the highest AUC performance on metazoan miRNA hairpins of all tested models. When weighting and parameterization were compared, weighting showed a stronger effect on selectivity measured at 95% sensitivity, whereas parameterization had a stronger effect on selectivity at 75% sensitivity. This implies that both variables have a distinct effect on the shape of the ROC curve.
Scoring model performance
Scoring model description
Chi - Squared
Default models g
Optimized models g
Comparison of threshold filtering of miRNA hairpins with the L-score classifier
Scoring model Metazoa
Fixed at specificity
Fixed at sensitivity
Identified (miRNA) hairpins in genomes of C. elegans and four viruses
Identified miRNA hairpinsa
Identified genomic hairpins
miRNA hairpins % (number)b
genomic hairpin loci % (number)b
L = 1.0
L ≥ 0.05
L ≥ 1e-5
L = 1.0
L ≥ 0.05
L ≥ 1e-5
In the four viral genomes, 25,599 hairpins were identified, including all 55 known miRNAs hairpins of these viruses. InC. elegans, 3,526,115 hairpins were predicted and only four out of 132 known miRNAs hairpins were missed (cel-mir-262, cel-mir-260, cel-mir-272 and cel-mir-256). When benchmarking the performance of the hairpin identification with the miRBase entries  of all metazoan miRNA hairpins, 3,803 out of 3,902 (97.5%) miRNA hairpins were recovered. The hairpin prediction algorithm is independent of sequence context (data not shown). This benchmark is therefore an estimate of the algorithm's good performance on metazoan genomes. Similar performance has been reported for other edit-distance based hairpins detection methods . On a genome scale, all these methods yield around 10,000–20,000 hairpins per single stranded Mb of sequence. These hairpins are predominantly overlapping and nested.
The data in Table4show that only 1.3–2.6% of the indentified hairpin loci in four viral genomes have a highLscore (L≥ 0.05). This corresponds to 69–247 loci per genome. In addition, 83–100% of all known miRNA hairpins comply with this threshold forL. Out of 6,182 hairpins predicted in the genome of EBV, only 23 hairpins, originating from 20 unique loci, had anLscore of 1.0. Ten of these were experimentally validated miRNA loci  [see Additional file5]. Further support for our miRNA prediction and scoring method comes from recently discovered miRNAs in the MDV genome  that were not included in miRBase 9.0. All five novel miRNAs (mdv1-mir-M9 to mdv1-mir-M13) were present in the set of here predicted hairpins and three of these hadLscore of 1.0 [see Additional file6]. The 15 unique loci in MDV with anLscore of 1.0 collapsed into eight unique sequences due to a large inverted repeat. Out of these eight, all five loci that did not overlap with annotated exons corresponded to known miRNA loci.
The two examples show that most hairpins in viral genomes with highLscore are true miRNA hairpins and that the absolute number of hairpin loci with highLscores is small. This allowed us to manually examine the remaining, non-miRNA loci with highLscores, using additional filtering criteria for genomic location such as proximity to known miRNAs  and intronic position . Among the remaining loci in MDV withL≥ 0.05, one appeared to be located in the transcribed strand of an intron and two others closely flanked (0.3 kb) the mdv1-mir-M1 gene in the same orientation [see Additional file6]. Similarly, out of the remaining ten EBV loci withL= 1.0, two candidate miRNAs were located directly upstream and amidst a cluster of eleven known miRNAs in an intronic region of the BART gene  [see Additional file5].
Mining the C. elegans genome for putative miRNA hairpins
Rejected miRNA hairpinsa
5 kb flanking sequence
Similarity to mature miRNAc
No overlap with exonsd
L = 1.0
L >= 1e-5
Exclude TRF overlape
Filter on 7 descriptors
Exclude known miRNA loci
similarity positioned correctlyf
The data presented for the filtering protocols "Clustered" and "Similar" (Table5) show that combined filtering onLscore, genomic context and threshold filtering allows for compilation of a priority list of candidate miRNAs that is amenable to manual inspection and experimental verification. Apart from filtering on genomic clustering or similarity to known miRNAs, filtering onLscore attains the largest data reduction. This shows that theLscore was important in compilation of the priority list and demonstrates the added value of our approach forin silicomiRNA prediction.
We here present a new computational strategy for thein silicoprediction of miRNA hairpins and show the applicability of the method for predicting new candidate miRNA hairpins in four viral genomes and in the genome ofC. elegans. While using the latter as an example for model construction, theLscore method as here optimized forC. elegansis well usable for other metazoan genomes. However, further improvement of performance of the method on different taxonomic groups can be achieved by constructing dedicated scoring models [see Additional file9]. Our strategy aims at minimizing the number of false negative predictions (optimal sensitivity), rather than at minimizing the number of false positive predictions (optimal specificity), as proposed in previous studies. Focusing on sensitivity rather than specificity should help uncover new classes of miRNA molecules in biological systems. Hairpins in genomic sequences are identified with the help of an adjusted suffix-tree based method. When the performance of the hairpin prediction was benchmarked on sets of all known miRNAs hairpins from viruses and Metazoa, all 55 viral miRNAs were recovered (100%), 128 of 132 (97%)C. elegansmiRNAs were recovered and 3,803 out of 3,902 (97.5%) metazoan miRNAs were shown to be recoverable when considered in their genomic context. Similar performance was reported for another edit distance-based hairpin identification method .
FourC. elegansmiRNAs (cel-mir-262, cel-mir-260, cel-mir-272 and cel-mir-256) remained undetected due to the absence of a stringently base-pairing area in their stems. They all had extremely poorLscores (2.7e-41, 2.5e-27, 3.5e-20 and 3.8e-10), ranking first, third, fourth and eight among theC. elegansmiRNA hairpins with lowestLscores. None of the four was found in recent high-throughput sequencing datasets, which otherwise retrieved the vast majority of knownC. elegansmiRNAs [10,16]. This suggests that these four may not be genuine miRNAs. If so, the reported performance of our hairpin prediction method is underestimated.
The biological variation and evolutionary diversity of various properties of miRNA hairpins were captured in a likelihood scoreL, based on statistics derived from accurately fitted (generally skewed normal) distributions of hairpin characteristics derived from known miRNA hairpins. In total 40 hairpin characteristics were defined and analyzed. TheLscore is a measure for a single hairpin sequence: a descriptor that captures evolutionary conservation in another species is not included. Although conservation has proven to be an extremely selective miRNA detection criterion , it conflicts with the aim to maximize sensitivity, because of the existence of species-specific miRNAs. The strategy was evaluated on genomic hairpins and known miRNAs from theC. elegansgenome. Although details of analyses and results are likely to differ when applied to other genomes and other negative sets of genomic hairpins, major trends and results were shown to be similar [see Additional file9].
Based on analyses of correlation and discriminative power, 18 hairpin characteristics were identified as most selective. Comparison of the 18 descriptor scoring model with a binary threshold filtering protocol using the same 18 descriptors (Table3) shows that a 17% higher selectivity is achieved with theLscore strategy. Binary decision thresholds on all or even a few descriptors can easily result in a major decrease of sensitivity. In the strategy developed here,L< 1 represents individual sequences that have one or more descriptors withS< 1, indicating that these descriptors have a relatively low probability of occurrence in miRNA hairpins because they occur in the tail(s) of their respective distribution. When a descriptor value falls outside the observed range of biological variation, the continuous likelihood scoreLallows for the compensation of an unlikely score for a single descriptor by likely scores for other descriptors. In such a case, the sequence is nota priorirejected as a miRNA hairpin candidate. TheLscore thus allows for more deviations from 'genuine' miRNA characteristics than binary selection (yes/no) on the basis of the same characteristics. This is an important improvement over the use of pre-defined thresholds for filtering on single or multiple descriptors published previously . Assigning scores to descriptors, as opposed to binary selection on pre-defined thresholds, has also been used in previous work. For example, MIRscan [5,6] employs an heuristic score assignment to seven features and assigns weights based on the relative entropy between known miRNA hairpins and genomic hairpins. PalGrade  uses a statistical distribution by arbitrarily binning the ordered vector of descriptor values. SVM kernels achieve descriptor scoring and weighting as part of the SVM . Novel to the approach here developed is that the score assignment is based entirely on the statistical evaluation of the variation in physical and sequence properties observed in known miRNAs and no binary selection prior to building a scoring model is used.
TheLscore is a relative measure of the likelihood that a given hairpin is a miRNA hairpin candidate. An important parameter contributing to usefulLscores is the number of miRNAs used to derive the discriminating statistics of the individual characteristics. A minimum of about a thousand miRNA hairpins is required, but results indicate that the larger the available data set, the better the scoring model captures the variation in miRNA hairpin characteristics and performs. In the context of machine learning, the input miRNA set could be considered a training set, although in theLscore derivation no formal 'training' is included. It is noteworthy that the evolutionary distance between the species contained in the taxonomic set for a scoring model influencesLconsiderably. In general, the scoring model based on the set that is taxonomically closest to the organism being analyzed performs best. This indicates that over a wide range of characteristics, miRNA hairpins within (related) species are substantially more alike than miRNA hairpin sequences between less-related species. This should be taken into account when searching for similarity between miRNA hairpins from distantly related species. However, we observed that, for example, taxonomic sets that do not contain human miRNA hairpins can yield scoring models that accurately identify human miRNA hairpins. In addition, descriptor weighting and parameterization appeared to considerably influence the performance of a scoring model. The analyses presented allow the selection of optimal scoring models, with maximal discriminative power to distinguish true miRNA hairpins from other genomic (or random) hairpins for a selected set and a given data set. However, each set of data, for example in case of a new genome sequence, will require its own analysis to build an optimal scoring model.
Whereas the analyses started with 40 descriptors, based on extensive correlation and selectivity analyses, a subset model based on 18 descriptors was performing better than the model comprising all 40 descriptors. In view of the importance of the taxonomic composition of the set used in the model, it should be pointed out, that this may reflect a taxonomic bias for metazoan sequences that may not be valid for other taxonomic groups, such as, for example plant miRNA hairpins. In the set of 18 most informative descriptors selected, it is remarkable that three descriptors that are generally considered important are not represented: GC content and the MFE randomization descriptors P and Z. GC-content is widely used as a pre-filtering step ofin silicomiRNA prediction methods , but is not among the 18 descriptors used in the final scoring models. The GC-content showed the highest variability in the (skewed-normal) mean over different taxonomic sets (data not shown), reflecting the apparently large variation in GC-content found among miRNA hairpins from different species. Any scoring model that includes the fitted distribution of GC-content would disqualify hairpin structures in species containing miRNAs with relatively high or low GC-content. Also the descriptors P and Z, based on MFE randomization shown to be significantly lower for miRNAs hairpins than for randomized sequences [37,38], were not among the18 descriptors selected. Although P and Z ranked among the most selective descriptors, they were excluded because of their strong correlation [see Additional file3] with the most selective descriptor MFEahl index, in which the MFE is adjusted for hairpin length and GC-content.
SVMs have similar aims as theL-score strategy and perform well in miRNA classification . In assessing and comparing the performance of different methods, however, several caveats should be considered. First, methods that use evolutionary conservation perform well on conserved miRNAs [25,11] but fail to detect species-specific or fast evolving miRNAs . The importance of the latter should not be underestimated. Second, the particular data set(s) on which the performance is achieved is important to the evaluation and comparison of the results from different methods. Unfortunately, there is no benchmark set of both positive and negative examples of miRNA hairpins available. Many methods are tailored on a specific organism and are likely to perform best in that exact context. Assembling a set of true negative sequences is particularly challenging: hairpins in non-coding RNAs, e.g. the set of tRNAs as used in , are likely to possess different and more diverse hairpin features than miRNA hairpins, whereas a set of genomic hairpins might containbona fidemiRNA hairpins. SVMs explicitly require negative examples for training and testing, whereas theLscoring method uses such a set only for benchmarking purposes. Third, data on prediction performance are not reported consistently in literature. Our method enables reporting of AUC performance as well as sensitivity and specificity values over the entire range of the ROC curve. Comparison with binary classifiers from other methods therefore requires transformation of the continuous outcome to a binary outcome by choosing an arbitrary threshold forLand using the associated sensitivity and specificity values as measure of the performance.
With these caveats in mind, we compared the performance of our method to three leading SVM-based methodsmiPred, RNAmicro  and miRNA SVM  Note that RNAmicro is based on multiple sequence alignments. Using different positive and negative datasets, these methods report the following values of sensitivity and specificity; 1)MiPred: 86.69% and 97.68%, using 323 human miRNAs as positive and 646 human genomic hairpins as negative set; 2)MiPred: 87.65% and 97.75%, using 1,918 Metazoan, non-human miRNAs as positive and 3,836 human genomic hairpins as negative set; 3) RNAmicro: 90% and 99%, using 147 Metazoan miRNA hairpin alignments as positive and 383 shuffled miRNA hairpin and tRNA alignments as negative set, and; 4) miRNA SVM: 90% and 95%, using 322 human miRNAs as positive and 3,000 random human genomic hairpins as negative set. These performances compare well to the values of 87.26% and 97.02% obtained in the 10 fold-cross validated performance of ourLscore model, using 203 Metazoan miRNA hairpins as positive and 200,000 randomly selected genomic hairpins fromC. elegansas negative set. The performance of theLscoring model is most similar to that ofmiPred, which does not include sequence conservation as a parameter. An analysis of the performance of our model on sets of genomic hairpins other than those derived for theC. elegansgenome is provided as Additional file [see Additional file9].
A major challenge of any SVM is understanding its behavior in, for example, a biological context. While SVMs are known to produce classifiers that perform well in case of unseen data , SVMs are essentially black-box classifiers. This makes it difficult to judge the relative importance of individual descriptors or to translate results in biologically relevant understanding. As all parameters are embedded in the kernel function of the SVM, SVM classifiers are also difficult to adjust, although they do not require the pre-selection of parameters required for theLscore strategy here presented. Classifier selection based on the detailed descriptor analysis presented here may improve future SVM approaches. A key advantage of theLscore strategy over SVMs is that the contribution of individual descriptors to a scoring model can be analyzed in a straightforward way by adding or removing a descriptor or changing the parameterization or weight of descriptors. This way the scoring model becomes better tailored to the biologist's needs in a particular research environment.
At the laboratory bench, the criterion that is of most interest is simply how many putative hairpins should be evaluated by experimentation. With current developments in microarray analysis and high-throughput sequencing, the numbers of potential candidates that can be screened with relative ease will increase dramatically. Still large numbers of new miRNAs may be identified. Yet, for the time being, the individual laboratory would like to see as little putative candidates as possible with as high a success rate as feasible. The highestLscore is 1, implying that the given hairpin scores are maximal for all descriptors in the model. In the 100 Mb large genome ofC. elegans, still 3,110 hairpin loci remain that cannot be ranked further on the basis ofL. It implies that relatively large numbers of genomic hairpins (3,110 loci from 3,526,115 hairpins, i.e. 0.09%; see Table4) comply with all miRNA hairpin descriptors, whereas it is unlikely that they all generate mature miRNAs, given the relative small number of 132 currently knownC. elegansmiRNAs. It is likely that this situation will occur in most genomic contexts. If so, several strategies are open. The model parameters could be adjusted, so that the individual descriptor is less likely to get the maximal score. This way, theLscore approach will convert to more traditional threshold filtering. Given that the analysis requires high sensitivity, it would however be more advantageous to incorporate more biological expert knowledge in the selection process, such as the presence of the hairpin in an intron, sequence similarity to known miRNAs, etc. When following this strategy, one has to be aware that the number of novel miRNAs that could be discovered is constrained by the filtering on genomic context. For example, in the set of 132 miRNAs inC. elegans(miRBase 9.0), 87 occur in singletons when clustered on distance, 56 have no other family representative and 102 are not located in an intron . These numbers can thus be considered as indicative for the fraction of true miRNAs that will remain concealed when filtering on genomic context. With the filtering protocols ("Clustered" and "Similar") we show that a combination of filtering onLscore, genomic context and threshold filtering allows for compilation of a priority list of manageable size for manual inspection and further experimentation.
In addition to good performance in comparison with other leading (SVM-based) methods and a user-defined selectivity, an additional advantage of theLscore approach over threshold filtering and support vector machine classifiers is that the prior analysis of taxonomically defined sets and fitted distributions, correlations, and discriminative power of descriptors gives detailed insight in the behavior of a scoring model and can accommodate expert knowledge. It should therefore appeal to the experimental biologist, despite the fairly time-consuming construction of a suitable scoring model.
The scoring model proposed here is independent of the hairpin prediction step and can therefore be coupled to anyin silicoor experimental miRNA prediction method. It can facilitate the analysis of large sets of putative miRNA hairpin loci obtained in deep-sequencing efforts of small RNAs [10,14–16]. TheLscore approach can be used to rank and select interesting miRNA hairpin candidates for downstream experimental analysis in search for novel miRNAs. Moreover, our in-depth analyses of known miRNA hairpins from miRBase , our detailed descriptor analyses (Figure9) and theLscore approach here presented are likely to increase the reliability and evidence of miRBase entries and will help to further increase the biological relevance of the miRBase repository.
The complete set of 3,498 non-plant miRNA hairpin sequences were retrieved from the web resource miRBase version 9.0 . In addition, 474 miRNA hairpin sequences from human and chimpanzee  and 18 fromC. elegans were obtained from the supplementary material of the respective publications. Secondary structures of the sequences were predicted using RNAfold version 1.6  with the constrained folding option (-C) used to position the mature miRNA sequence(s) in the stem of the hairpin. The hairpin structure of six sequences, three from miRBase and three from the human/chimpanzee set , deviated considerably from the predicted characteristics of miRNA hairpins. These six sequences were therefore excluded from all subsequent analyses [see Additional file10]. The resulting 3,984 miRNA hairpin sequences were included in this study, 3,902 from Metazoa and 82 from virus genomes. Sequence and annotation of theC. elegansgenome (build 150) was obtained from Ensembl C. elegansunigenes (build 28) were downloaded from NCBI UniGene . Viral genome data for the Epstein-Barr virus [EMBL: AJ507799], Human cytomegalovirus [EMBL: X17403], Kaposi sarcoma-associated herpesvirus [EMBL: U75698] and Mareks disease virus [EMBL: AF243438] were obtained from EMBL .
Supplemental data are available through the Additional data files and the accompanying web documentμRNALL, which can be downloaded athttp://appliedbioinformatics.wur.nl/murnall/. All statistical analyses were performed using the package R , as integrated in python through Rpy .
A set of 40 potentially discriminative features of miRNA hairpins, hereafter referred to as descriptors, was defined based on the set of 3,984 miRNA hairpins. The descriptors include both physical and sequence characteristics of miRNA hairpins [Table1; see Additional files1and2]. A subset of descriptors is given in Table1. To take the evolutionary diversity of the descriptors into account in the statistical analyses, miRNA sequences were divided in hierarchically organized subsets based on their taxonomic relationships. Taxonomic sets that comprised at least 100 sequences were used for analysis. In total, 23 taxonomic sets were defined [see Additional file4], including one set containing all metazoan miRNA sequences, eleven sets representing metazoan taxa and eleven species-specific sets. The virus set was not used because it contained only 82 sequences. Unless stated otherwise, all results presented in this paper use the combined taxonomic set 'Metazoa' (3,902 miRNAs).
For all 3,902 sequences, descriptor values were calculated and their distributions within each of the 23 taxonomic sets were fitted to an appropriate probability distribution. Goodness-of-fit was determined by a Chi-square test. For each distribution, the probability that the descriptor takes a value less than or equal to a specified value was calculated as the cumulative distribution function (CDF) and transformed into a likelihood distribution function (LDF). For the LDF, a default cut-off value was set at 0.05, corresponding to the 95% confidence interval of the fitted distribution of the descriptor. For each descriptor, values of the CDF above the cut-off value were transformed to the LDF likelihood scoreS= 1. Values below the cut-off were transformed to the likelihood scoreS= (CDF/cut-off). Table1and Additional file2list for each descriptor whether the lower tail of the CDF, upper tail or both were transformed. As a result of this transformation, each descriptor in the taxonomic set has a likelihood distributionScomprising anS< 1 and anS= 1 fraction.S= 1 indicates a descriptor for which characteristics of the individual sequence are in 95% of the distribution.
To obtain a single metric for a given taxonomic set, the likelihood scoresSfor all descriptors were multiplied to obtain the combined likelihood scoreL. The ensemble of likelihood scoresSfor a given set of hairpin sequences is referred to as the scoring model.Lis the outcome of the scoring model and functions as classifier for miRNA hairpin sequences.Lranges between 0 and 1 and represents the likelihood of a hairpin sequence to be a true miRNA hairpin given the underlying descriptors used in the scoring model. It is possible to incorporate additional expert knowledge in the scoring model by assigning a relative weight to theSscore of an individual descriptor. In the default setting reported here, no difference between descriptors is made (assigned weight = 1). AnLscore of 1.0 for a hairpin sequence indicates thatS= 1 for each descriptor of the set.Lis only affected by descriptors with a valueS< 1.
To prevent potential over-penalization of hairpin sequences when combining correlated descriptors, we determined the independence (orthogonality) of all descriptors in theS< 1 fraction by calculating Cohen's kappa  for each combination of descriptors. The valueκ= 0 indicates that there is no more correlation between descriptors than expected by chance alone, andκ= 1 indicates that the descriptors are fully dependent. The discriminative power of a descriptor, i.e. its ability to distinguish true miRNA hairpins from non-miRNA hairpins, was calculated as the ratio of percentages of miRNA hairpins and genomic hairpins that comply with a given threshold for this descriptor. As threshold the descriptor's limiting value betweenS= 1 andS< 1 of the LDF was chosen (95% of the CDF). Discriminative power was calculated using known miRNA hairpins from the taxonomic set Metazoa and genomic hairpins from a set of 3,526,115 hairpins identified inC. elegans(see section Identification of putative miRNA hairpin structures). It is calculated with the formula for selectivity (see below), but for the sake of clarity we will here use the term 'discriminative power' for the performance of a single descriptor and the term 's electivity' for the performance of a scoring model.
To select a subset of descriptors that was most informative for the combined assessment of miRNA hairpins by theLclassifier, descriptors that either correlated with a more discriminative descriptor (κ>0.4) or that showed low discriminative power (< 1.1) were discarded from the initial set. The resulting subset was used to evaluate the impact of different settings of variables. For all models,Lscores were calculated for 100,000 randomly selected hairpins from theC. elegansgenome. We evaluated (1) the effect of the size of the input set, which refers to the number of miRNA hairpins in a given taxonomic set; (2) the impact of evolutionary distance between taxa; (3) the impact of different combinations of descriptors in a scoring model; (4) the effect of parameterization of descriptors and (5) the effect of weighting of descriptors.
TP and FN were counted from taxonomic sets of known miRNA hairpins, TN and FP were determined as a fraction of genome-wide identified hairpins. Although these sets of genomic hairpins contained an unknown number of true miRNAs (so FN and TP), this number was expected to be sufficiently small to be ignored. For uniform comparison, we benchmarked selectivity at discrete values of sensitivity (95% and/or 75%). Discrete points on the ROC curve correspond to pairs of sensitivity and specificity values, and as such describe the shape of the curve.
The performance of the classifierLwas compared with sensitivity, specificity and selectivity of threshold filtering on descriptors of miRNA and genomic hairpins. For the 18 most informative descriptors, the threshold used did represent the same cut-off value between theS= 1 andS< 1 fraction of the LDF, at 95% of the CDF at the side(s) of the distribution as listed in Table1. This cut-off was used as a binary decision criterion: below the threshold, the (miRNA) hairpin was included; above it was rejected.
A 10-fold cross-validation was performed on 200,000 randomly chosen genomic hairpins and repeated ten times. As input set, a non-redundant variant of the taxonomic set Metazoa was constructed. This involved clustering miRNA hairpins with identical mature miRNA seed sequences and an overall hairpin sequence identity larger than 90%. All but a single representative for each cluster were then removed, yielding a subset of 2,033 sequences.
The suffix-tree based tool VMatch  was used to identify small genomic hairpin structures in the genomes ofC. elegansand four viruses, using a sliding window of 1,000 nt with an overlap of 200 nt. The latter value exceeds the length of the largest known metazoan miRNA hairpin (153 nt). Each sequence window was stored as a VMatch database (index) and its reverse complement was used as query sequence in a VMatch search for degenerate palindromic sequences, allowing GU-base pairing. Parameter settings that allowed exhaustive retrieval of known miRNA hairpins were found empirically (data not shown). Such a palindromic sequence consists of two inverse complementary sequences for the stem, at a physical distance representing the loop of a putative hairpin. Palindromes were discarded if the distance was larger than 50 nt, which represents the upper limit of loop size in the vast majority of metazoan miRNAs. Overlapping palindromes were merged if they had at most 8 non-overlapping nucleotides on either side. With these parameter settings, only a small number of known miRNA hairpins was missed. The remaining set of palindromic sequences was used for secondary structure prediction using RNAfold version 1.6 with the constrained folding option (-C) to enforce the stem structure in the folding of the molecule . All hairpin structures were filtered for five threshold values: (1) minimal hairpin length = 45 nt; (2) minimal number of base pairs in the stem = 15; (3) minimal number of paired bases in the most stringently paired window of 24 positions in the hairpin stem = 15; (4) maximum length of a bulge in the stem = 29 nt; (5) minimal ratio of the number of paired positions divided by all positions in the stem (match-ratio) = 0.45.
Many of the genomic hairpins identified were overlapping or nested. Such hairpins were grouped into unique loci when the centers of their loops were less than 20 nt apart, regardless of the strand on which the hairpins were located.
Area Under the (ROC) Curve
cumulative distribution function
Kaposi sarcoma-associated herpesvirus
likelihood distribution function
Mareks disease Virus
minimal folding energy (kcal/mol)
Receiver Operating Characteristic
RNA secondary structure prediction
Skewed Normal distribution
Support Vector Machine
We thank prof. Cajo ter Braak (Biometris, WUR) for helpful discussions regarding the statistics of scoring models and their interpretation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.