In silico miRNA prediction in metazoan genomes: balancing between sensitivity and specificity
© van der Burgt et al; licensee BioMed Central Ltd. 2009
Received: 12 September 2008
Accepted: 30 April 2009
Published: 30 April 2009
MicroRNAs (miRNAs), short ~21-nucleotide RNA molecules, play an important role in post-transcriptional regulation of gene expression. The number of known miRNA hairpins registered in the miRBase database is rapidly increasing, but recent reports suggest that many miRNAs with restricted temporal or tissue-specific expression remain undiscovered. Various strategies for in silico miRNA identification have been proposed to facilitate miRNA discovery. Notably support vector machine (SVM) methods have recently gained popularity. However, a drawback of these methods is that they do not provide insight into the biological properties of miRNA sequences.
We here propose a new strategy for miRNA hairpin prediction in which the likelihood that a genomic hairpin is a true miRNA hairpin is evaluated based on statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. These distributions are transformed into a single and continuous outcome classifier called the L score. Using a dataset of known miRNA hairpins from the miRBase database and an exhaustive set of genomic hairpins identified in the genome of Caenorhabditis elegans, a subset of 18 most informative descriptors was selected after detailed analysis of correlation among and discriminative power of individual descriptors. We show that the majority of previously identified miRNA hairpins have high L scores, that the method outperforms miRNA prediction by threshold filtering and that it is more transparent than SVM classifiers.
The L score is applicable as a prediction classifier with high sensitivity for novel miRNA hairpins. The L- score approach can be used to rank and select interesting miRNA hairpin candidates for downstream experimental analysis when coupled to a genome-wide set of in silico-identified hairpins or to facilitate the analysis of large sets of putative miRNA hairpin loci obtained in deep-sequencing efforts of small RNAs. Moreover, the in-depth analyses of miRNA hairpins descriptors preceding and determining the L score outcome could be used as an extension to miRBase entries to help increase the reliability and biological relevance of the miRNA registry.
MicroRNAs (miRNAs) are ~21-nucleotide (nt) short, single stranded RNA molecules involved in post-transcriptional regulation of gene expression . They are present in higher eukaryotes and some viral genomes . Because the miRNA and small-interfering RNA (siRNA) pathways partly overlap, current understanding of miRNA biogenesis has gained from advances made in the field of RNA interference. Mature, functional miRNAs develop from degenerate palindromic repeats with a characteristic hairpin-like secondary structure [1, 3, 4]. Initially, experimental identification of miRNAs was achieved through direct cloning and sequencing of small RNAs [5, 6]. However, such relatively low-throughput screenings were biased towards abundantly or ubiquitously expressed miRNAs  and missed many miRNAs with restricted temporal or tissue-specific expression patterns . Recently, strategies using PCR , microarrays [8, 9] or ultra high-throughput sequencing [10, 11] have expanded the list of known miRNAs. Many of these show tissue-specific expression [9, 12] or appear to be species-specific [12, 13]. In both Arabidopsis thaliana [14, 15] and Caenorhabditis elegans [10, 16], for example, high-throughput sequencing of small RNAs shows moderate overlap in detected miRNAs between experiments in each species, indicating that also in well studied genomes many new miRNAs remain to be discovered. In addition, the observation that many miRNA loci exhibit compelling hairpin structures on both sense and antisense strands led to the discovery of anti-sense miRNA transcription . Antisense miRNA transcription and processing yield distinct mature miRNAs. This contributes to the functional diversification of miRNA genes for a considerable fraction of the known miRNA loci [17, 18].
Various methods for the in silico prediction of miRNAs have been developed to aid in experimental studies of miRNA discovery [19, 20]. These methods generally consider the hairpin-like secondary structure of the miRNA precursor, the miRNA hairpin, as the most important characteristic of a miRNA gene. They use RNA secondary structure prediction (RSSP) algorithms, such as RNAfold  or Mfold , to predict the secondary structure and thermodynamic stability of the RNA hairpin structures. Current bioinformatics approaches for the prediction of miRNAs [19, 20] generally include three steps: (1) genome-wide prediction of hairpin structures; (2) filtering or scoring of those hairpins on the basis of their similarity in physical and sequence features to known miRNA hairpins and (3) experimental validation of putative candidates.
A common approach for the first step is to search for hairpin structures using a sliding window and perform RSSP on each window [5, 8, 23, 24]. An improvement in overall calculation time over this approach is to first identify degenerate palindromic sequences and to analyse only these further with RSSP [25, 26]. Unfortunately, these approaches detect vast numbers of hairpin structures in complete eukaryotic genomes. Depending on the method used, 1E3 to 4E3 hairpins per Mb of genomic sequence are found, resulting in about 1E7 hairpins identified in the human genome [8, 24–27]. The challenge is, therefore, to devise an appropriate filtering method to separate the chaff from the wheat.
Different criteria for filtering candidate miRNA sequences have been proposed, generally with the aim to reduce the search space and/or to increase the specificity of prediction . Evolutionary conservation is considered an important feature of the hairpin sequence  and analysis thereof is often used to identify and focus comparisons on the conserved non-coding sequence space in different genomes [5, 23]. An evolutionary approach known as phylogenetic shadowing has been used for combined selection and filtering of miRNA candidates . This study revealed a characteristic camel-shaped conservation pattern of putatively orthologous miRNAs that was useful as a criterion for finding conserved miRNA candidates in primate genomes. Other filtering criteria include intragenomic matching of candidate miRNAs and their potential targets , evidence for expression, thresholds on structural properties of hairpins, e.g. minimal folding energy (MFE), absence of repetitive or low-complexity sequences, occurrence in introns or intergenic regions , or proximity to known miRNA loci [2, 24, 30].
Stringent filtering is performed to attain high specificity, that is, to minimize the number of false positive predictions of miRNA genes [20, 25]. Obviously, maximizing specificity increases the number of false negative predictions, that is, a decreased sensitivity . This implies that with stringent filtering, relevant miRNAs will be missed. The success of any filtering procedure depends however on the validity of the underlying assumptions. For example, filtering on evolutionary conservation will miss species-specific or fast evolving miRNAs  and low-complexity filtering is likely to miss miRNAs originating from transposable elements [32, 33]. Several recent methods employ machine learning techniques such as a Support Vector Machine (SVM) [24, 27, 34, 35] for classification. Such SVMs evaluate differences in hairpin properties between true-positive and true-negative examples of miRNAs for a given taxon to generate a prediction classifier. SVMs have been successfully used for miRNA gene prediction [24, 34, 35] and for the prediction of 5' Drosha processing sites in miRNA hairpins [26, 11]. Although the SVM approach is claimed to outperform earlier methods , SVM-based classifications combine many features in a single kernel function and therefore do not provide direct insight into the biological significance of these features. Such insight can only be obtained through expert analyses and dedicated feature selection procedures . Moreover, the set of true-negatives which is required for training an SVM is often very difficult to define.
Here, we present an innovative strategy for miRNA prediction that focuses on attaining optimal sensitivity. We define and combine 40 different filtering criteria and, using a set of genomic hairpins identified in the genome of Caenorhabditis elegans, show that 18 of these characteristics capture the biological variation of miRNA features present in sets of known miRNA hairpins. These 18 criteria are used to establish a combined likelihood score L that assesses the likelihood that a predicted hairpin structure in a genome contains a genuine miRNA. L is a continuous classifier that allows user-adjustable thresholds for sensitivity and specificity in ab initio miRNA prediction and miRNA analysis. Good performance of L for large sets of hairpins from the genomes of C. elegans and four viruses demonstrates the added value of the new analytical strategy for future miRNA discovery and selection.
Subset of 18 most informative miRNA hairpin descriptors
ratio asymmetrical bulges vs. stem length
adjusted base pairing propensity (dP)
longest bulge in stem (nt)
longest match-stretch in stem (nt)
central loop length (nt)
max match count
matches in 24 nt
MFEahl index 
MFEahl corrected for GC- Content
Normalized Shannon entropy (Q)
surplus of GA over CU in sequence
1.12 (u) 1.05 (l)
surplus of G over C in sequence
longest poly-A stretch (nt)
longest mono-nucleotide stretch (nt) in the hairpin
longest poly-U stretch (nt)
Di-nucleotide Sequence Complexity (-)
Mono-nucleotide Sequence Complexity (-)
ratio of GU-matches vs. all matches
MFE Adjusted for hairpin length
Descriptor data fit
Likelihood score S for miRNA hairpin descriptors
The CDF of the fitted distribution was transformed into an LDF with outcome S. The range of S (between 0 and 1) has an S < 1 and an S = 1 fraction which are separated by the cut-off value derived from the 95% confidence interval of the descriptor's CDF. The S < 1 fraction contains miRNAs hairpins with descriptor values in the tail(s) of the distribution. These have a low probability of occurring in true miRNA hairpins. The S = 1 fraction contains miRNA hairpins with values in the remainder of the distribution and corresponds to likely properties of miRNA hairpins. Descriptors were treated differently with respect to the transformation of the tails of the CDF (Table 1; [see Additional file 2]). For example, for the descriptor "minimal folding energy" (MFE) of a miRNA, there is in principle no need to impose a lower bound, even though the fitted distribution (Figure 1A) indicates that very low MFE values occur rarely in known miRNAs. The LDF of the descriptor MFE assigns a score S = 1 for all values below -23.72 kcal/mol. Higher MFEs are penalized proportional to the LDF and therefore assigned a score S < 1. MFE is an example of a descriptor where the S < 1 fraction represents the upper 5% tail of the confidence interval. For other descriptors, the S < 1 fraction is represented by the lower 5% tail of the distribution (e.g. match ratio) or by both the lower 5% and the upper 5% tail (e.g. GC-content, Figure 1B). Table 1 and Additional file 1 list the unlikely tails for each descriptor.
Correlation of descriptors
The 40 descriptors here defined were either based on the sequence of the hairpin, the structure of the hairpin or on a combination of both (Table 1; [see Additional file 1]). Correlations among these were obvious, for instance, between stem length or GC-content on the one hand and the MFE of a hairpin on the other hand. Correlated descriptors will overemphasize the importance of a more general feature of a miRNA hairpin and affect the usefulness of L. To assess the correlation among descriptors in their S < 1 fractions, we calculated Cohen's kappa coefficient κ  for all 780 possible pairs of descriptors, using the miRNA hairpins of the taxonomic set Metazoa [see Additional file 3]. For each descriptor the most strongly correlated descriptor, as determined by the highest observed κ is listed in Table 1 (column κ) and Additional file 3. The highest κ was found between descriptors 'GCratio' and 'GsurplusC' (κ = 0.995), followed by 'D' and 'Q' (κ = 0.844) and the pair 'P' and 'Z' (κ = 0.784).
Discriminative power of descriptors
Delimiting a subset of most informative descriptors
The correlation and discriminative power of descriptors were used to select a non-redundant subset of most informative descriptors. Descriptors that either correlated with a more selective descriptor, using a threshold for correlation of κ>0.4, or that had a discriminative power smaller than 1.1 were omitted. This resulted in a subset of 18 descriptors, seven of which were sequence related, nine structure related, and two descriptors with mixed properties (Table 1). Remarkably, the descriptors 'GC-content' and the MFE randomization descriptors 'P'  and 'Z' , which are often used in miRNA prediction studies, were not included in the subset [see Additional file 1]. The latter two ranked among the most selective descriptors, but were excluded because of their strong correlation with the most selective descriptor MFEahl index, in which the MFE is adjusted for hairpin length and GC-content [see Additional file 3].
Assessment of scoring model performance
The 18 most selective descriptors (Table 1) were used to define and analyze different scoring models. The CDF of the fitted distribution of all 18 descriptors was transformed into an LDF with outcome S and the L score for a given miRNA sequence was calculated as the product of all S values. Scoring model performance, defined as the power to distinguish (potential) miRNA hairpins from other (or random) genomic hairpins, was compared for different models that were built using varying settings for five parameters (see below). Scoring model performance was measured as AUC performance and, where appropriate, with selectivity measured at two values of sensitivity (95% and 75%), using genomic hairpins in C. elegans and the collection of miRNAs hairpins in the taxonomic set Metazoa.
(1) Size of taxonomic set
(2) Composition of taxonomic set
(3) Composition of descriptor subset
(4) Parameterization of the LDF
(5) Weighting of descriptors
The effect of weighting individual descriptors was studied by assigning weights to each of the 18 previously selected descriptors. Weights were equal to the square root of a descriptor's discriminative power as measured at a sensitivity of 95% and ranged from 1.06 for GAsurplusCU to 3.65 for MFEahl (Table 1; [see Additional file 1]). Figure 6 shows that a scoring model with weighted descriptors (W95% and W90%) had a significantly increased AUC performance and selectivity index relative to their unweighted models (95% and 90%). Obviously, when weighted on the basis of their discriminative power, descriptors with a high discriminative power contribute stronger to the overall L score than descriptors with low discriminative power. As such, a better separation between miRNA hairpins and random genomic hairpins is accomplished.
Finally, combinations of weighting and varying the parameterization of descriptors were tested (models W90% and W80%, Fig. 6). Scoring model W90% had the highest AUC performance on metazoan miRNA hairpins of all tested models. When weighting and parameterization were compared, weighting showed a stronger effect on selectivity measured at 95% sensitivity, whereas parameterization had a stronger effect on selectivity at 75% sensitivity. This implies that both variables have a distinct effect on the shape of the ROC curve.
Building optimal scoring models
Scoring model performance
Scoring model description
Chi – Square d
Default models g
Optimized models g
Combined likelihood score L for miRNA hairpin descriptors
Comparison of the L score approach to threshold filtering
Comparison of threshold filtering of miRNA hairpins with the L-score classifier
Scoring model Metazoa
Fixed at specificity
Fixed at sensitivity
Predicting hairpins from a genomic sequence
Identified (miRNA) hairpins in genomes of C. elegans and four viruses
Identified miRNA hairpins a
Identified genomic hairpins
miRNA hairpins % (number) b
genomic hairpin loci % (number) b
L = 1.0
L ≥ 0.05
L ≥ 1e-5
L = 1.0
L ≥ 0.05
L ≥ 1e-5
In the four viral genomes, 25,599 hairpins were identified, including all 55 known miRNAs hairpins of these viruses. In C. elegans, 3,526,115 hairpins were predicted and only four out of 132 known miRNAs hairpins were missed (cel-mir-262, cel-mir-260, cel-mir-272 and cel-mir-256). When benchmarking the performance of the hairpin identification with the miRBase entries  of all metazoan miRNA hairpins, 3,803 out of 3,902 (97.5%) miRNA hairpins were recovered. The hairpin prediction algorithm is independent of sequence context (data not shown). This benchmark is therefore an estimate of the algorithm's good performance on metazoan genomes. Similar performance has been reported for other edit-distance based hairpins detection methods . On a genome scale, all these methods yield around 10,000–20,000 hairpins per single stranded Mb of sequence. These hairpins are predominantly overlapping and nested.
Analyses of miRNA hairpin candidates in viral genomes
The data in Table 4 show that only 1.3–2.6% of the indentified hairpin loci in four viral genomes have a high L score (L ≥ 0.05). This corresponds to 69–247 loci per genome. In addition, 83–100% of all known miRNA hairpins comply with this threshold for L. Out of 6,182 hairpins predicted in the genome of EBV, only 23 hairpins, originating from 20 unique loci, had an L score of 1.0. Ten of these were experimentally validated miRNA loci  [see Additional file 5]. Further support for our miRNA prediction and scoring method comes from recently discovered miRNAs in the MDV genome  that were not included in miRBase 9.0. All five novel miRNAs (mdv1-mir-M9 to mdv1-mir-M13) were present in the set of here predicted hairpins and three of these had L score of 1.0 [see Additional file 6]. The 15 unique loci in MDV with an L score of 1.0 collapsed into eight unique sequences due to a large inverted repeat. Out of these eight, all five loci that did not overlap with annotated exons corresponded to known miRNA loci.
The two examples show that most hairpins in viral genomes with high L score are true miRNA hairpins and that the absolute number of hairpin loci with high L scores is small. This allowed us to manually examine the remaining, non-miRNA loci with high L scores, using additional filtering criteria for genomic location such as proximity to known miRNAs  and intronic position . Among the remaining loci in MDV with L ≥ 0.05, one appeared to be located in the transcribed strand of an intron and two others closely flanked (0.3 kb) the mdv1-mir-M1 gene in the same orientation [see Additional file 6]. Similarly, out of the remaining ten EBV loci with L = 1.0, two candidate miRNAs were located directly upstream and amidst a cluster of eleven known miRNAs in an intronic region of the BART gene  [see Additional file 5].
Mining the C. elegans genome for putative miRNA hairpins
Mining the C. elegans genome for putative miRNA hairpins
Rejected miRNA hairpins a
5 kb flanking sequence
Similarity to mature miRNA c
No overlap with exons d
L = 1.0
L >= 1e-5
Exclude TRF overlap e
Filter on 7 descriptors
Exclude known miRNA loci
similarity positioned correctly f
The data presented for the filtering protocols "Clustered" and "Similar" (Table 5) show that combined filtering on L score, genomic context and threshold filtering allows for compilation of a priority list of candidate miRNAs that is amenable to manual inspection and experimental verification. Apart from filtering on genomic clustering or similarity to known miRNAs, filtering on L score attains the largest data reduction. This shows that the L score was important in compilation of the priority list and demonstrates the added value of our approach for in silico miRNA prediction.
We here present a new computational strategy for the in silico prediction of miRNA hairpins and show the applicability of the method for predicting new candidate miRNA hairpins in four viral genomes and in the genome of C. elegans. While using the latter as an example for model construction, the L score method as here optimized for C. elegans is well usable for other metazoan genomes. However, further improvement of performance of the method on different taxonomic groups can be achieved by constructing dedicated scoring models [see Additional file 9]. Our strategy aims at minimizing the number of false negative predictions (optimal sensitivity), rather than at minimizing the number of false positive predictions (optimal specificity), as proposed in previous studies. Focusing on sensitivity rather than specificity should help uncover new classes of miRNA molecules in biological systems. Hairpins in genomic sequences are identified with the help of an adjusted suffix-tree based method. When the performance of the hairpin prediction was benchmarked on sets of all known miRNAs hairpins from viruses and Metazoa, all 55 viral miRNAs were recovered (100%), 128 of 132 (97%) C. elegans miRNAs were recovered and 3,803 out of 3,902 (97.5%) metazoan miRNAs were shown to be recoverable when considered in their genomic context. Similar performance was reported for another edit distance-based hairpin identification method .
Four C. elegans miRNAs (cel-mir-262, cel-mir-260, cel-mir-272 and cel-mir-256) remained undetected due to the absence of a stringently base-pairing area in their stems. They all had extremely poor L scores (2.7e-41, 2.5e-27, 3.5e-20 and 3.8e-10), ranking first, third, fourth and eight among the C. elegans miRNA hairpins with lowest L scores. None of the four was found in recent high-throughput sequencing datasets, which otherwise retrieved the vast majority of known C. elegans miRNAs [10, 16]. This suggests that these four may not be genuine miRNAs. If so, the reported performance of our hairpin prediction method is underestimated.
The biological variation and evolutionary diversity of various properties of miRNA hairpins were captured in a likelihood score L, based on statistics derived from accurately fitted (generally skewed normal) distributions of hairpin characteristics derived from known miRNA hairpins. In total 40 hairpin characteristics were defined and analyzed. The L score is a measure for a single hairpin sequence: a descriptor that captures evolutionary conservation in another species is not included. Although conservation has proven to be an extremely selective miRNA detection criterion , it conflicts with the aim to maximize sensitivity, because of the existence of species-specific miRNAs. The strategy was evaluated on genomic hairpins and known miRNAs from the C. elegans genome. Although details of analyses and results are likely to differ when applied to other genomes and other negative sets of genomic hairpins, major trends and results were shown to be similar [see Additional file 9].
Based on analyses of correlation and discriminative power, 18 hairpin characteristics were identified as most selective. Comparison of the 18 descriptor scoring model with a binary threshold filtering protocol using the same 18 descriptors (Table 3) shows that a 17% higher selectivity is achieved with the L score strategy. Binary decision thresholds on all or even a few descriptors can easily result in a major decrease of sensitivity. In the strategy developed here, L < 1 represents individual sequences that have one or more descriptors with S < 1, indicating that these descriptors have a relatively low probability of occurrence in miRNA hairpins because they occur in the tail(s) of their respective distribution. When a descriptor value falls outside the observed range of biological variation, the continuous likelihood score L allows for the compensation of an unlikely score for a single descriptor by likely scores for other descriptors. In such a case, the sequence is not a priori rejected as a miRNA hairpin candidate. The L score thus allows for more deviations from 'genuine' miRNA characteristics than binary selection (yes/no) on the basis of the same characteristics. This is an important improvement over the use of pre-defined thresholds for filtering on single or multiple descriptors published previously . Assigning scores to descriptors, as opposed to binary selection on pre-defined thresholds, has also been used in previous work. For example, MIRscan [5, 6] employs an heuristic score assignment to seven features and assigns weights based on the relative entropy between known miRNA hairpins and genomic hairpins. PalGrade  uses a statistical distribution by arbitrarily binning the ordered vector of descriptor values. SVM kernels achieve descriptor scoring and weighting as part of the SVM . Novel to the approach here developed is that the score assignment is based entirely on the statistical evaluation of the variation in physical and sequence properties observed in known miRNAs and no binary selection prior to building a scoring model is used.
The L score is a relative measure of the likelihood that a given hairpin is a miRNA hairpin candidate. An important parameter contributing to useful L scores is the number of miRNAs used to derive the discriminating statistics of the individual characteristics. A minimum of about a thousand miRNA hairpins is required, but results indicate that the larger the available data set, the better the scoring model captures the variation in miRNA hairpin characteristics and performs. In the context of machine learning, the input miRNA set could be considered a training set, although in the L score derivation no formal 'training' is included. It is noteworthy that the evolutionary distance between the species contained in the taxonomic set for a scoring model influences L considerably. In general, the scoring model based on the set that is taxonomically closest to the organism being analyzed performs best. This indicates that over a wide range of characteristics, miRNA hairpins within (related) species are substantially more alike than miRNA hairpin sequences between less-related species. This should be taken into account when searching for similarity between miRNA hairpins from distantly related species. However, we observed that, for example, taxonomic sets that do not contain human miRNA hairpins can yield scoring models that accurately identify human miRNA hairpins. In addition, descriptor weighting and parameterization appeared to considerably influence the performance of a scoring model. The analyses presented allow the selection of optimal scoring models, with maximal discriminative power to distinguish true miRNA hairpins from other genomic (or random) hairpins for a selected set and a given data set. However, each set of data, for example in case of a new genome sequence, will require its own analysis to build an optimal scoring model.
Whereas the analyses started with 40 descriptors, based on extensive correlation and selectivity analyses, a subset model based on 18 descriptors was performing better than the model comprising all 40 descriptors. In view of the importance of the taxonomic composition of the set used in the model, it should be pointed out, that this may reflect a taxonomic bias for metazoan sequences that may not be valid for other taxonomic groups, such as, for example plant miRNA hairpins. In the set of 18 most informative descriptors selected, it is remarkable that three descriptors that are generally considered important are not represented: GC content and the MFE randomization descriptors P and Z. GC-content is widely used as a pre-filtering step of in silico miRNA prediction methods , but is not among the 18 descriptors used in the final scoring models. The GC-content showed the highest variability in the (skewed-normal) mean over different taxonomic sets (data not shown), reflecting the apparently large variation in GC-content found among miRNA hairpins from different species. Any scoring model that includes the fitted distribution of GC-content would disqualify hairpin structures in species containing miRNAs with relatively high or low GC-content. Also the descriptors P and Z, based on MFE randomization shown to be significantly lower for miRNAs hairpins than for randomized sequences [37, 38], were not among the18 descriptors selected. Although P and Z ranked among the most selective descriptors, they were excluded because of their strong correlation [see Additional file 3] with the most selective descriptor MFEahl index, in which the MFE is adjusted for hairpin length and GC-content.
SVMs have similar aims as the L-score strategy and perform well in miRNA classification . In assessing and comparing the performance of different methods, however, several caveats should be considered. First, methods that use evolutionary conservation perform well on conserved miRNAs [25, 11] but fail to detect species-specific or fast evolving miRNAs . The importance of the latter should not be underestimated. Second, the particular data set(s) on which the performance is achieved is important to the evaluation and comparison of the results from different methods. Unfortunately, there is no benchmark set of both positive and negative examples of miRNA hairpins available. Many methods are tailored on a specific organism and are likely to perform best in that exact context. Assembling a set of true negative sequences is particularly challenging: hairpins in non-coding RNAs, e.g. the set of tRNAs as used in , are likely to possess different and more diverse hairpin features than miRNA hairpins, whereas a set of genomic hairpins might contain bona fide miRNA hairpins. SVMs explicitly require negative examples for training and testing, whereas the L scoring method uses such a set only for benchmarking purposes. Third, data on prediction performance are not reported consistently in literature. Our method enables reporting of AUC performance as well as sensitivity and specificity values over the entire range of the ROC curve. Comparison with binary classifiers from other methods therefore requires transformation of the continuous outcome to a binary outcome by choosing an arbitrary threshold for L and using the associated sensitivity and specificity values as measure of the performance.
With these caveats in mind, we compared the performance of our method to three leading SVM-based methods miPred , RNAmicro  and miRNA SVM  Note that RNAmicro is based on multiple sequence alignments. Using different positive and negative datasets, these methods report the following values of sensitivity and specificity; 1) MiPred: 86.69% and 97.68%, using 323 human miRNAs as positive and 646 human genomic hairpins as negative set; 2) MiPred: 87.65% and 97.75%, using 1,918 Metazoan, non-human miRNAs as positive and 3,836 human genomic hairpins as negative set; 3) RNAmicro: 90% and 99%, using 147 Metazoan miRNA hairpin alignments as positive and 383 shuffled miRNA hairpin and tRNA alignments as negative set, and; 4) miRNA SVM: 90% and 95%, using 322 human miRNAs as positive and 3,000 random human genomic hairpins as negative set. These performances compare well to the values of 87.26% and 97.02% obtained in the 10 fold-cross validated performance of our L score model, using 203 Metazoan miRNA hairpins as positive and 200,000 randomly selected genomic hairpins from C. elegans as negative set. The performance of the L scoring model is most similar to that of miPred, which does not include sequence conservation as a parameter. An analysis of the performance of our model on sets of genomic hairpins other than those derived for the C. elegans genome is provided as Additional file [see Additional file 9].
A major challenge of any SVM is understanding its behavior in, for example, a biological context. While SVMs are known to produce classifiers that perform well in case of unseen data , SVMs are essentially black-box classifiers. This makes it difficult to judge the relative importance of individual descriptors or to translate results in biologically relevant understanding. As all parameters are embedded in the kernel function of the SVM, SVM classifiers are also difficult to adjust, although they do not require the pre-selection of parameters required for the L score strategy here presented. Classifier selection based on the detailed descriptor analysis presented here may improve future SVM approaches. A key advantage of the L score strategy over SVMs is that the contribution of individual descriptors to a scoring model can be analyzed in a straightforward way by adding or removing a descriptor or changing the parameterization or weight of descriptors. This way the scoring model becomes better tailored to the biologist's needs in a particular research environment.
At the laboratory bench, the criterion that is of most interest is simply how many putative hairpins should be evaluated by experimentation. With current developments in microarray analysis and high-throughput sequencing, the numbers of potential candidates that can be screened with relative ease will increase dramatically. Still large numbers of new miRNAs may be identified. Yet, for the time being, the individual laboratory would like to see as little putative candidates as possible with as high a success rate as feasible. The highest L score is 1, implying that the given hairpin scores are maximal for all descriptors in the model. In the 100 Mb large genome of C. elegans, still 3,110 hairpin loci remain that cannot be ranked further on the basis of L. It implies that relatively large numbers of genomic hairpins (3,110 loci from 3,526,115 hairpins, i.e. 0.09%; see Table 4) comply with all miRNA hairpin descriptors, whereas it is unlikely that they all generate mature miRNAs, given the relative small number of 132 currently known C. elegans miRNAs. It is likely that this situation will occur in most genomic contexts. If so, several strategies are open. The model parameters could be adjusted, so that the individual descriptor is less likely to get the maximal score. This way, the L score approach will convert to more traditional threshold filtering. Given that the analysis requires high sensitivity, it would however be more advantageous to incorporate more biological expert knowledge in the selection process, such as the presence of the hairpin in an intron, sequence similarity to known miRNAs, etc. When following this strategy, one has to be aware that the number of novel miRNAs that could be discovered is constrained by the filtering on genomic context. For example, in the set of 132 miRNAs in C. elegans (miRBase 9.0), 87 occur in singletons when clustered on distance, 56 have no other family representative and 102 are not located in an intron . These numbers can thus be considered as indicative for the fraction of true miRNAs that will remain concealed when filtering on genomic context. With the filtering protocols ("Clustered" and "Similar") we show that a combination of filtering on L score, genomic context and threshold filtering allows for compilation of a priority list of manageable size for manual inspection and further experimentation.
In addition to good performance in comparison with other leading (SVM-based) methods and a user-defined selectivity, an additional advantage of the L score approach over threshold filtering and support vector machine classifiers is that the prior analysis of taxonomically defined sets and fitted distributions, correlations, and discriminative power of descriptors gives detailed insight in the behavior of a scoring model and can accommodate expert knowledge. It should therefore appeal to the experimental biologist, despite the fairly time-consuming construction of a suitable scoring model.
The scoring model proposed here is independent of the hairpin prediction step and can therefore be coupled to any in silico or experimental miRNA prediction method. It can facilitate the analysis of large sets of putative miRNA hairpin loci obtained in deep-sequencing efforts of small RNAs [10, 14–16]. The L score approach can be used to rank and select interesting miRNA hairpin candidates for downstream experimental analysis in search for novel miRNAs. Moreover, our in-depth analyses of known miRNA hairpins from miRBase , our detailed descriptor analyses (Figure 9) and the L score approach here presented are likely to increase the reliability and evidence of miRBase entries and will help to further increase the biological relevance of the miRBase repository.
Sequence and annotation data
The complete set of 3,498 non-plant miRNA hairpin sequences were retrieved from the web resource miRBase version 9.0 . In addition, 474 miRNA hairpin sequences from human and chimpanzee  and 18 from C. elegans  were obtained from the supplementary material of the respective publications. Secondary structures of the sequences were predicted using RNAfold version 1.6  with the constrained folding option (-C) used to position the mature miRNA sequence(s) in the stem of the hairpin. The hairpin structure of six sequences, three from miRBase and three from the human/chimpanzee set , deviated considerably from the predicted characteristics of miRNA hairpins. These six sequences were therefore excluded from all subsequent analyses [see Additional file 10]. The resulting 3,984 miRNA hairpin sequences were included in this study, 3,902 from Metazoa and 82 from virus genomes. Sequence and annotation of the C. elegans genome (build 150) was obtained from Ensembl C. elegans unigenes (build 28) were downloaded from NCBI UniGene . Viral genome data for the Epstein-Barr virus [EMBL: AJ507799], Human cytomegalovirus [EMBL: X17403], Kaposi sarcoma-associated herpesvirus [EMBL: U75698] and Mareks disease virus [EMBL: AF243438] were obtained from EMBL .
Informatics and statistics
Supplemental data are available through the Additional data files and the accompanying web document μRNALL, which can be downloaded at http://appliedbioinformatics.wur.nl/murnall/. All statistical analyses were performed using the package R , as integrated in python through Rpy .
Definition of miRNA descriptors
A set of 40 potentially discriminative features of miRNA hairpins, hereafter referred to as descriptors, was defined based on the set of 3,984 miRNA hairpins. The descriptors include both physical and sequence characteristics of miRNA hairpins [Table 1; see Additional files 1 and 2]. A subset of descriptors is given in Table 1. To take the evolutionary diversity of the descriptors into account in the statistical analyses, miRNA sequences were divided in hierarchically organized subsets based on their taxonomic relationships. Taxonomic sets that comprised at least 100 sequences were used for analysis. In total, 23 taxonomic sets were defined [see Additional file 4], including one set containing all metazoan miRNA sequences, eleven sets representing metazoan taxa and eleven species-specific sets. The virus set was not used because it contained only 82 sequences. Unless stated otherwise, all results presented in this paper use the combined taxonomic set 'Metazoa' (3,902 miRNAs).
Individual likelihood score S for each descriptor
For all 3,902 sequences, descriptor values were calculated and their distributions within each of the 23 taxonomic sets were fitted to an appropriate probability distribution. Goodness-of-fit was determined by a Chi-square test. For each distribution, the probability that the descriptor takes a value less than or equal to a specified value was calculated as the cumulative distribution function (CDF) and transformed into a likelihood distribution function (LDF). For the LDF, a default cut-off value was set at 0.05, corresponding to the 95% confidence interval of the fitted distribution of the descriptor. For each descriptor, values of the CDF above the cut-off value were transformed to the LDF likelihood score S = 1. Values below the cut-off were transformed to the likelihood score S = (CDF/cut-off). Table 1 and Additional file 2 list for each descriptor whether the lower tail of the CDF, upper tail or both were transformed. As a result of this transformation, each descriptor in the taxonomic set has a likelihood distribution S comprising an S < 1 and an S = 1 fraction. S = 1 indicates a descriptor for which characteristics of the individual sequence are in 95% of the distribution.
Likelihood score L for the combined descriptor values
To obtain a single metric for a given taxonomic set, the likelihood scores S for all descriptors were multiplied to obtain the combined likelihood score L. The ensemble of likelihood scores S for a given set of hairpin sequences is referred to as the scoring model. L is the outcome of the scoring model and functions as classifier for miRNA hairpin sequences. L ranges between 0 and 1 and represents the likelihood of a hairpin sequence to be a true miRNA hairpin given the underlying descriptors used in the scoring model. It is possible to incorporate additional expert knowledge in the scoring model by assigning a relative weight to the S score of an individual descriptor. In the default setting reported here, no difference between descriptors is made (assigned weight = 1). An L score of 1.0 for a hairpin sequence indicates that S = 1 for each descriptor of the set. L is only affected by descriptors with a value S < 1.
Correlation and discriminative power of descriptors
To prevent potential over-penalization of hairpin sequences when combining correlated descriptors, we determined the independence (orthogonality) of all descriptors in the S < 1 fraction by calculating Cohen's kappa  for each combination of descriptors. The value κ = 0 indicates that there is no more correlation between descriptors than expected by chance alone, and κ = 1 indicates that the descriptors are fully dependent. The discriminative power of a descriptor, i.e. its ability to distinguish true miRNA hairpins from non-miRNA hairpins, was calculated as the ratio of percentages of miRNA hairpins and genomic hairpins that comply with a given threshold for this descriptor. As threshold the descriptor's limiting value between S = 1 and S < 1 of the LDF was chosen (95% of the CDF). Discriminative power was calculated using known miRNA hairpins from the taxonomic set Metazoa and genomic hairpins from a set of 3,526,115 hairpins identified in C. elegans (see section Identification of putative miRNA hairpin structures). It is calculated with the formula for selectivity (see below), but for the sake of clarity we will here use the term 'discriminative power' for the performance of a single descriptor and the term 'selectivity' for the performance of a scoring model.
Descriptor selection and model evaluation
To select a subset of descriptors that was most informative for the combined assessment of miRNA hairpins by the L classifier, descriptors that either correlated with a more discriminative descriptor (κ>0.4) or that showed low discriminative power (< 1.1) were discarded from the initial set. The resulting subset was used to evaluate the impact of different settings of variables. For all models, L scores were calculated for 100,000 randomly selected hairpins from the C. elegans genome. We evaluated (1) the effect of the size of the input set, which refers to the number of miRNA hairpins in a given taxonomic set; (2) the impact of evolutionary distance between taxa; (3) the impact of different combinations of descriptors in a scoring model; (4) the effect of parameterization of descriptors and (5) the effect of weighting of descriptors.
Performance of L
TP and FN were counted from taxonomic sets of known miRNA hairpins, TN and FP were determined as a fraction of genome-wide identified hairpins. Although these sets of genomic hairpins contained an unknown number of true miRNAs (so FN and TP), this number was expected to be sufficiently small to be ignored. For uniform comparison, we benchmarked selectivity at discrete values of sensitivity (95% and/or 75%). Discrete points on the ROC curve correspond to pairs of sensitivity and specificity values, and as such describe the shape of the curve.
The performance of the classifier L was compared with sensitivity, specificity and selectivity of threshold filtering on descriptors of miRNA and genomic hairpins. For the 18 most informative descriptors, the threshold used did represent the same cut-off value between the S = 1 and S < 1 fraction of the LDF, at 95% of the CDF at the side(s) of the distribution as listed in Table 1. This cut-off was used as a binary decision criterion: below the threshold, the (miRNA) hairpin was included; above it was rejected.
A 10-fold cross-validation was performed on 200,000 randomly chosen genomic hairpins and repeated ten times. As input set, a non-redundant variant of the taxonomic set Metazoa was constructed. This involved clustering miRNA hairpins with identical mature miRNA seed sequences and an overall hairpin sequence identity larger than 90%. All but a single representative for each cluster were then removed, yielding a subset of 2,033 sequences.
Identification of putative miRNA hairpin structures
The suffix-tree based tool VMatch  was used to identify small genomic hairpin structures in the genomes of C. elegans and four viruses, using a sliding window of 1,000 nt with an overlap of 200 nt. The latter value exceeds the length of the largest known metazoan miRNA hairpin (153 nt). Each sequence window was stored as a VMatch database (index) and its reverse complement was used as query sequence in a VMatch search for degenerate palindromic sequences, allowing GU-base pairing. Parameter settings that allowed exhaustive retrieval of known miRNA hairpins were found empirically (data not shown). Such a palindromic sequence consists of two inverse complementary sequences for the stem, at a physical distance representing the loop of a putative hairpin. Palindromes were discarded if the distance was larger than 50 nt, which represents the upper limit of loop size in the vast majority of metazoan miRNAs. Overlapping palindromes were merged if they had at most 8 non-overlapping nucleotides on either side. With these parameter settings, only a small number of known miRNA hairpins was missed. The remaining set of palindromic sequences was used for secondary structure prediction using RNAfold version 1.6 with the constrained folding option (-C) to enforce the stem structure in the folding of the molecule . All hairpin structures were filtered for five threshold values: (1) minimal hairpin length = 45 nt; (2) minimal number of base pairs in the stem = 15; (3) minimal number of paired bases in the most stringently paired window of 24 positions in the hairpin stem = 15; (4) maximum length of a bulge in the stem = 29 nt; (5) minimal ratio of the number of paired positions divided by all positions in the stem (match-ratio) = 0.45.
Grouping identified genomic hairpins into unique loci
Many of the genomic hairpins identified were overlapping or nested. Such hairpins were grouped into unique loci when the centers of their loops were less than 20 nt apart, regardless of the strand on which the hairpins were located.
Area Under the (ROC) Curve
cumulative distribution function
Kaposi sarcoma-associated herpesvirus
likelihood distribution function
Mareks disease Virus
minimal folding energy (kcal/mol)
Receiver Operating Characteristic
RNA secondary structure prediction
Skewed Normal distribution
Support Vector Machine.
We thank prof. Cajo ter Braak (Biometris, WUR) for helpful discussions regarding the statistics of scoring models and their interpretation.
- Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116 (2): 281-297. 10.1016/S0092-8674(04)00045-5.View ArticlePubMedGoogle Scholar
- Pfeffer S, Sewer A, Lagos-Quintana M, Sheridan R, Sander C, Grasser FA, van Dyk LF, Ho CK, Shuman S, Chien M, et al: Identification of microRNAs of the herpesvirus family. Nature methods. 2005, 2 (4): 269-276. 10.1038/nmeth746.View ArticlePubMedGoogle Scholar
- Zeng Y, Cullen BR: Efficient processing of primary microRNA hairpins by Drosha requires flanking nonstructured RNA sequences. The Journal of biological chemistry. 2005, 280 (30): 27595-27603. 10.1074/jbc.M504714200.View ArticlePubMedGoogle Scholar
- Berezikov E, Plasterk RH: Camels and zebrafish, viruses and cancer: a microRNA update. Human molecular genetics. 2005, 14 (Spec No 2): R183-190. 10.1093/hmg/ddi271.View ArticlePubMedGoogle Scholar
- Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP: The microRNAs of Caenorhabditis elegans. Genes & development. 2003, 17 (8): 991-1008. 10.1101/gad.1074403.View ArticleGoogle Scholar
- Lagos-Quintana M, Rauhut R, Meyer J, Borkhardt A, Tuschl T: New microRNAs from mouse and human. RNA (New York, NY). 2003, 9 (2): 175-179.View ArticleGoogle Scholar
- Hertel J, Lindemeyer M, Missal K, Fried C, Tanzer A, Flamm C, Hofacker IL, Stadler PF: The expansion of the metazoan microRNA repertoire. BMC genomics. 2006, 7: 25-10.1186/1471-2164-7-25.PubMed CentralView ArticlePubMedGoogle Scholar
- Bentwich I, Avniel A, Karov Y, Aharonov R, Gilad S, Barad O, Barzilai A, Einat P, Einav U, Meiri E, et al: Identification of hundreds of conserved and nonconserved human microRNAs. Nature genetics. 2005, 37 (7): 766-770. 10.1038/ng1590.View ArticlePubMedGoogle Scholar
- Beuvink I, Kolb FA, Budach W, Garnier A, Lange J, Natt F, Dengler U, Hall J, Filipowicz W, Weiler J: A novel microarray approach reveals new tissue-specific signatures of known and predicted mammalian microRNAs. Nucleic acids research. 2007, 35 (7): e52-10.1093/nar/gkl1118.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP: Large-scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans. Cell. 2006, 127 (6): 1193-1207. 10.1016/j.cell.2006.10.040.View ArticlePubMedGoogle Scholar
- Stark A, Kheradpour P, Parts L, Brennecke J, Hodges E, Hannon GJ, Kellis M: Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome research. 2007, 17 (12): 1865-1879. 10.1101/gr.6593807.PubMed CentralView ArticlePubMedGoogle Scholar
- Berezikov E, Thuemmler F, van Laake LW, Kondova I, Bontrop R, Cuppen E, Plasterk RH: Diversity of microRNAs in human and chimpanzee brain. Nature genetics. 2006, 38 (12): 1375-1377. 10.1038/ng1914.View ArticlePubMedGoogle Scholar
- Lindow M, Krogh A: Computational evidence for hundreds of non-conserved plant microRNAs. BMC genomics. 2005, 6: 119-10.1186/1471-2164-6-119.PubMed CentralView ArticlePubMedGoogle Scholar
- Rajagopalan R, Vaucheret H, Trejo J, Bartel DP: A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. Genes & development. 2006, 20 (24): 3407-3425. 10.1101/gad.1476406.View ArticleGoogle Scholar
- Fahlgren N, Howell MD, Kasschau KD, Chapman EJ, Sullivan CM, Cumbie JS, Givan SA, Law TF, Grant SR, Dangl JL, et al: High-throughput sequencing of Arabidopsis microRNAs: evidence for frequent birth and death of MIRNA genes. PLoS ONE. 2007, 2 (2): e219-10.1371/journal.pone.0000219.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Ding L, Cheung TH, Dong MQ, Chen J, Sewell AK, Liu X, Yates JR, Han M: Systematic identification of C. elegans miRISC proteins, miRNAs, and mRNA targets by their interactions with GW182 proteins AIN-1 and AIN-2. Molecular cell. 2007, 28 (4): 598-613. 10.1016/j.molcel.2007.09.014.PubMed CentralView ArticlePubMedGoogle Scholar
- Tyler DM, Okamura K, Chung WJ, Hagen JW, Berezikov E, Hannon GJ, Lai EC: Functionally distinct regulatory RNAs generated by bidirectional transcription and processing of microRNA loci. Genes & development. 2008, 22 (1): 26-36. 10.1101/gad.1615208.View ArticleGoogle Scholar
- Stark A, Bushati N, Jan CH, Kheradpour P, Hodges E, Brennecke J, Bartel DP, Cohen SM, Kellis M: A single Hox locus in Drosophila produces functional microRNAs from opposite DNA strands. Genes & development. 2008, 22 (1): 8-13. 10.1101/gad.1613108.View ArticleGoogle Scholar
- Yoon S, De Micheli G: Computational identification of microRNAs and their targets. Birth Defects Res C Embryo Today. 2006, 78 (2): 118-128. 10.1002/bdrc.20067.View ArticlePubMedGoogle Scholar
- Lindow M, Gorodkin J: Principles and limitations of computational microRNA gene and target finding. DNA and cell biology. 2007, 26 (5): 339-351. 10.1089/dna.2006.0551.View ArticlePubMedGoogle Scholar
- Hofacker IL: Vienna RNA secondary structure server. Nucleic acids research. 2003, 31 (13): 3429-3431. 10.1093/nar/gkg599.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic acids research. 2003, 31 (13): 3406-3415. 10.1093/nar/gkg595.PubMed CentralView ArticlePubMedGoogle Scholar
- Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome biology. 2003, 4 (7): R42-10.1186/gb-2003-4-7-r42.PubMed CentralView ArticlePubMedGoogle Scholar
- Sewer A, Paul N, Landgraf P, Aravin A, Pfeffer S, Brownstein MJ, Tuschl T, van Nimwegen E, Zavolan M: Identification of clustered microRNAs using an ab initio prediction method. BMC bioinformatics. 2005, 6: 267-10.1186/1471-2105-6-267.PubMed CentralView ArticlePubMedGoogle Scholar
- Hertel J, Stadler PF: Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics (Oxford, England). 2006, 22 (14): e197-202. 10.1093/bioinformatics/btl257.View ArticleGoogle Scholar
- Helvik SA, Snove O, Saetrom P: Reliable prediction of Drosha processing sites improves microRNA gene prediction. Bioinformatics (Oxford, England). 2007, 23 (2): 142-149. 10.1093/bioinformatics/btl570.View ArticleGoogle Scholar
- Nam JW, Shin KR, Han J, Lee Y, Kim VN, Zhang BT: Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic acids research. 2005, 33 (11): 3570-3581. 10.1093/nar/gki668.PubMed CentralView ArticlePubMedGoogle Scholar
- Berezikov E, Guryev V, Belt van de J, Wienholds E, Plasterk RH, Cuppen E: Phylogenetic shadowing and computational identification of human microRNA genes. Cell. 2005, 120 (1): 21-24. 10.1016/j.cell.2004.12.031.View ArticlePubMedGoogle Scholar
- Lindow M, Jacobsen A, Nygaard S, Mang Y, Krogh A: Intragenomic matching reveals a huge potential for miRNA-mediated regulation in plants. PLoS computational biology. 2007, 3 (11): e238-10.1371/journal.pcbi.0030238.PubMed CentralView ArticlePubMedGoogle Scholar
- Baskerville S, Bartel DP: Microarray profiling of microRNAs reveals frequent coexpression with neighboring miRNAs and host genes. RNA (New York, NY). 2005, 11 (3): 241-247.View ArticleGoogle Scholar
- Brennecke J, Cohen SM: Towards a complete description of the microRNA complement of animal genomes. Genome biology. 2003, 4 (9): 228-10.1186/gb-2003-4-9-228.PubMed CentralView ArticlePubMedGoogle Scholar
- Llave C, Xie Z, Kasschau KD, Carrington JC: Cleavage of Scarecrow-like mRNA targets directed by a class of Arabidopsis miRNA. Science (New York, NY). 2002, 297 (5589): 2053-2056.View ArticleGoogle Scholar
- Piriyapongsa J, Marino-Ramirez L, Jordan IK: Origin and evolution of human microRNAs from transposable elements. Genetics. 2007, 176 (2): 1323-1337. 10.1534/genetics.107.072553.PubMed CentralView ArticlePubMedGoogle Scholar
- Xue C, Li F, He T, Liu GP, Li Y, Zhang X: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC bioinformatics. 2005, 6: 310-10.1186/1471-2105-6-310.PubMed CentralView ArticlePubMedGoogle Scholar
- Ng KL, Mishra SK: De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics (Oxford, England). 2007, 23 (11): 1321-1330. 10.1093/bioinformatics/btm026.View ArticleGoogle Scholar
- Grad Y, Aach J, Hayes GD, Reinhart BJ, Church GM, Ruvkun G, Kim J: Computational and experimental identification of C. elegans microRNAs. Molecular cell. 2003, 11 (5): 1253-1263. 10.1016/S1097-2765(03)00153-9.View ArticlePubMedGoogle Scholar
- Freyhult E, Gardner PP, Moulton V: A comparison of RNA folding measures. BMC bioinformatics. 2005, 6: 241-10.1186/1471-2105-6-241.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonnet E, Wuyts J, Rouze P, Peer Van de Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics (Oxford, England). 2004, 20 (17): 2911-2917. 10.1093/bioinformatics/bth374.View ArticleGoogle Scholar
- Ng Kwang Loong S, Mishra SK: Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA. 2007, 13 (2): 170-187. 10.1261/rna.223807.PubMed CentralView ArticlePubMedGoogle Scholar
- Azzalini A, Capitanio A: Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999, 61 (3): 579-602. 10.1111/1467-9868.00194.View ArticleGoogle Scholar
- Cohen J: A Coefficient of Agreement for Nominal Scales. Educational and psychological measurement. 1960, 20 (1): 37-10.1177/001316446002000104.View ArticleGoogle Scholar
- Applied Bioinformatics (PRI/WUR). [http://appliedbioinformatics.wur.nl/murnall/]
- Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research. 2001, 29 (22): 4633-4642. 10.1093/nar/29.22.4633.PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths-Jones S: miRBase: the microRNA sequence database. Methods in molecular biology (Clifton, NJ). 2006, 342: 129-138.Google Scholar
- Yao Y, Zhao Y, Xu H, Smith LP, Lawrie CH, Watson M, Nair V: MicroRNA profile of Marek's disease virus-transformed T-cell line MSB-1: predominance of virus-encoded microRNAs. Journal of virology. 2008, 82 (8): 4007-4015. 10.1128/JVI.02659-07.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai X, Schafer A, Lu S, Bilello JP, Desrosiers RC, Edwards R, Raab-Traub N, Cullen BR: Epstein-Barr virus microRNAs are evolutionarily conserved and differentially expressed. PLoS pathogens. 2006, 2 (3): e23-10.1371/journal.ppat.0020023.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999, 27 (2): 573-580. 10.1093/nar/27.2.573.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, et al: Database resources of the National Center for Biotechnology. Nucleic acids research. 2003, 31 (1): 28-33. 10.1093/nar/gkg033.PubMed CentralView ArticlePubMedGoogle Scholar
- Schölkopf B: Support Vector Learning. 1997, R. Oldenbourg Verlag, MunichGoogle Scholar
- Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T: Ensembl 2008. Nucleic acids research. 2008, D707-714. 36 Database
- Kulikova T, Akhtar R, Aldebert P, Althorpe N, Andersson M, Baldwin A, Bates K, Bhattacharyya S, Bower L, Browne P: EMBL Nucleotide Sequence Database in 2006. Nucleic acids research. 2007, D16-20. 10.1093/nar/gkl913. 35 Database
- The R Project for Statistical Computing. [http://www.R-project.org/]
- RPy (R from Python). [http://rpy.sourceforge.net/index.html]
- Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143 (1): 29-36.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.