Characterization of statistical features for plant microRNA prediction
© Thakur et al; licensee BioMed Central Ltd. 2011
Received: 7 August 2010
Accepted: 16 February 2011
Published: 16 February 2011
Skip to main content
© Thakur et al; licensee BioMed Central Ltd. 2011
Received: 7 August 2010
Accepted: 16 February 2011
Published: 16 February 2011
Several tools are available to identify miRNAs from deep-sequencing data, however, only a few of them, like miRDeep, can identify novel miRNAs and are also available as a standalone application. Given the difference between plant and animal miRNAs, particularly in terms of distribution of hairpin length and the nature of complementarity with its duplex partner (or miRNA star), the underlying (statistical) features of miRDeep and other tools, using similar features, are likely to get affected.
The potential effects on features, such as minimum free energy, stability of secondary structures, excision length, etc., were examined, and the parameters of those displaying sizable changes were estimated for plant specific miRNAs. We found most of these features acquired a new set of values or distributions for plant specific miRNAs. While the length of conserved positions (nucleus) in mature miRNAs were relatively longer in plants, the difference in distribution of minimum free energy, between real and background hairpins, was marginal. However, the choice of source (species) of background sequences was found to affect both the minimum free energy and miRNA hairpin stability. The new parameters were tested on an Illumina dataset from maize seedlings, and the results were compared with those obtained using default parameters. The newly parameterized model was found to have much improved specificity and sensitivity over its default counterpart.
In summary, the present study reports behavior of few general and tool-specific statistical features for improving the prediction accuracy of plant miRNAs from deep-sequencing data.
MicroRNAs (miRNAs) are ~21 nucleotides long sequences, which are endogenously generated both in plants and animals. They are one of the key players in gene regulation, typically inhibitory in nature, and act either at the post-transcriptional level (by triggering target messenger RNA degradation) or at the translational level (by inhibition of translation) (see review by ). In plants, miRNA genes are initially transcribed as a primary miRNA sequence (i.e., pri-miRNA), which folds into a hairpin loop with small overhanging regions at both ends. The pri-miRNAs are then processed into hairpin sequences (i.e., precursor miRNA) by a riboendonuclease, named DICER. The DICER protein removes the loop region from the hairpin, and the remaining duplex is transported out of the nucleus, where the complementary sequence (also called star) is removed to allow mature miRNA to be ready for action .
Various tools have been developed to discover conserved and/or novel miRNAs. Initially, the miRNAs have been discovered by direct cloning and sequencing [3, 4], which suggested a high degree of sequence conservation of miRNA across species . With the availability of a complete genome sequence, it is possible to use computational approaches to discover the miRNA homologs. Recent progress in high-throughput sequencing further enabled genome-wide discovery of miRNAs, including novel miRNAs, and their expression profiling.
Over the past few years, several web-servers and standalone applications, analyzing deep sequencing data for miRNA discovery and/or expression profiling, have been reported; the web-servers include miRCat , miRAnalyzer , miRTools , whereas the standalone tools include miRDeep , miRExpress , MiroPipeline , etc. In addition to the above categories, there have been few miRNA discovery studies, reported in species, like Arabidopsis  rice , maize , and human , largely carried out using in-house scripts. There are two major limitations with these tools or protocols: 1) several of them use known miRNAs to identify only the homologs present in deep-sequencing data, thus limiting the scope of discovery of novel miRNAs, despite the availability of complete genome sequence; 2) several others are available as web-servers, which constrains the analysis in terms of upload/analysis time for larger datasets and also by limitations in the choice of reference genomes. In contrast, miRDeep does not have these constraints. In addition, miRDeep scores the predictions, which makes it easier to pick the better candidates from the rest .
The scoring used in miRDeep is based on computation of posterior probabilities of statistical features, like minimum free energy (MFE), conservation of the core region of mature miRNA, etc.. Some of these features are commonly used by other tools/analyses (see Methods for details). The posterior probabilities are usually computed based on parameters derived from a known set of real and background precursors; the current parameter set used in miRDeep is from a nematode, C. elegans, and reported to be effective for other animals as well. However, there are major differences between the properties of plant and animal miRNAs. For example, the maximum length of plant miRNA precursors can be about ~900 nt long; the extent of pairing and bulge size of duplex (of mature and star) in plants is very different from that of animals . Therefore, the differences in properties between plant and animal miRNAs can affect the parameters required for statistical scoring used in miRDeep.
In the current study we examined the key differences between plant and animal miRNAs, and their effects on general and miRDeep-specific statistical features, and further estimated the plant specific parameters of these features. With these new miRDeep parameters, we validated the prediction by applying it to a set of newly discovered miRNA in maize seedlings. The results were compared to that obtained using default miRDeep parameters.
In order to examine if the source (species) of precursor sequences has any bias on the above observation, the MFE distribution of real and background precursors from four additional plant species (with complete genome sequence), namely sorghum (S. bicolor), rice (O. sativa), Arabidopsis thaliana, and Medicago trancatula, were compared. While the MFE distributions of background precursors of sorghum and rice converged to that of maize, the distributions of Arabidopsis and Medicago were very distinct (Figure 2B). This led us to speculate that differences possibly exists at the level of monocots and dicots. For real precursors, although differences did exist between distributions of monocot and dicot species, however, these were statistically insignificant (at confidence level ≤0.05) (data not shown). These observations imply that the miRDeep training, in a plant species, is largely independent of the choice for the real precursors, but sensitive for the background precursors. This also affects the log-odds of MFE in dicots, as the MFE distributions show large differences between real and background (Additional file 2: Figure 1), as against the monocots (Figure 2A).
List of miRDeep parameters estimated for plant specific miRNAs.
Original parameters of miRDeep
Plant specific (monocot)
Dicot specific (if any)
Cumulative Distribution Function:
Location = 32; Scale = 5.5
Cumulative Distribution Function:
Location = 23; Scale = 4.8
a = 1.339e-12
b = 2.778e-13
c = 45.843
a = 4.46e-4
b = 9.125e-5
c = 26.929
Nucleus conservation (log-odds)
could be as high as 5 nt #
Maximum multiple hits of deep-seq read
Another feature which distinguishes real precursors from background precursors is the marked difference in secondary structure when the sequence of a candidate precursor is shuffled (referred henceforth as stability). Bonnet et al.  reported that MFE of real miRNA precursors significantly differ from their shuffled counterparts. The miRDeep implements this feature through the use of a tool RANDfold  to distinguish real and background precursors by examining the stability of secondary structure on shuffling. RANDfold first generates an ensemble of shuffled sequences, either by mononucleotide or dinucleotide shuffling, and computes the frequency (p-value) of shuffled sequences exceeding MFE of the candidates in question. So a p-value of 0.01, for instance, signifies that only 1% of the shuffled sequences have similar or better folding than the candidates. Typically, those candidates with a p-value of up to 0.05 are categorized as stable.
We further examined whether the choice of source of background precursors affected the stability. As in the case of MFE, the stability of the precursors differs substantially between monocots and dicots (Additional file 4: Figure 1). The dicot background precursors were found to have a higher fraction of stable structures (for example, 50-55% for 300 nt size range), as opposed to their monocot counterparts (only 20-25% for the corresponding length). Therefore, in dicots, the higher frequency of stable background precursors will make the distinction between real and background precursors more blurred.
Duplex (miRNA-miRNA*) associated parameters in miRDeep compared to plants miRNAs.
Up to 8 (for 22nt long miRNA)
Up to 5
Up to 8
Up to 5-6
Consecutive bases in bulge
Up to 5-6
While computational structure prediction yields minimum free energy structures according to the Turner model , the way miRNA precursors fold in their cellular context may differ from the one predicted by this model. This could be more likely for longer miRNA precursors as they may have higher degrees of freedom in which to be folded. Therefore, we examined how faithfully the software RNAfold predicted the secondary structure of real plant precursors longer than the mean length. We used RNAfold  to generate structures of 11 real plant miRNA precursors, of size in the range of 375-425 nt. The predicted structures were consistent with the secondary structures reported in miRBase .
The probability distribution of mapping of reads to the real and background precursors modeled by geometric distribution in miRDeep, was assumed to be same in plants, as the nature of mapping of reads to mature-loop and star-region are likely to be the same in both plants and animals. However, a minor difference (in the distribution) cannot be ruled out due to the longer loop region in plants.
Validation of miRNAs identified by parameterized miRDeep.
Confirmation by RACE
If identified by parameterized miRDeep
miRNA overlapping with CDS
poor abundance of reads
miRNA homologous to CDS
miRNA homologous to CDS
MiRNA overlapping with CDS
miRNA homologous to CDS
miRNA overlapping with CDS
The miRDeep algorithm with default parameters, on the contrary, predicted only two candidates from chromosome 5 at a log-odds score of >0, and both failed to match known miRNAs. On a whole genome scale, it predicted a total of sixteen candidates, out of which only two overlapped with known miRNAs. When checked for the presence of basic features typical to plant miRNAs, we found the majority of them failed to meet the desired metrics. Most of the discrepancies are related to the number of mismatches in the duplex region, the bulge size, non-specific homology, or precursor length, etc. Therefore, the newly parameterized miRDeep consistently identified known miRNAs (from chromosome 5, expressed in maize seedlings) with higher specificity and sensitivity than the original parameterization.
The plant miRNAs differ from animal miRNAs in several aspects, mainly in the hairpin length and in the nature of complementarity with the star sequence . Although the length of mature sequences largely remains the same, the length of the loop region differs substantially in plants, owing to their recent evolution . These differences strongly influence the statistical features used for their prediction.
The minimum free energy (MFE), a commonly used measure for characterizing secondary structure of different types of RNA [20, 21], is also being used for characterization and/or prediction of miRNAs [3, 4, 22–24]. For screening miRNA candidates, the majority of previous studies have either used a fixed MFE threshold (for example, -18 kcal/mol) , or a variable threshold . The miRDeep, however, involves comparison of (posterior probabilities of) MFE of real and background hairpins, enabling a more robust discrimination between them. This comparison in plants, however, did not prove to be so straightforward, as diversity in miRNA hairpin length results in multiple distributions of MFE.
Although, the effect of hairpin length on MFE in plants has been reported earlier , these reports did not give a systematic evaluation of potential impacts of this relationship on prediction of plant miRNAs. In the present study, we observed that the MFE distributions become length-free by normalizing the MFE of precursor with its length, which renders the MFE of hairpins, of different length, comparable.
We further observed only minor differences between the MFE distributions of real and background precursors of comparable length. This suggested that MFE alone may not be a good discriminator between real and background miRNAs, and more weight should be placed on other measures besides MFE. These findings however do not hold for dicot species, as they do show substantial differences between real and background, rendering MFE a more important discriminating feature.
Differences between the secondary structures of candidate precursors and their shuffled counterparts is another important feature exploited for miRNA prediction . The real precursors generally display substantial difference in the nature of folding (as well as in MFE) from their shuffled counterparts . This difference is quantified by the p-value, which is the fraction of shuffled sequences with MFE lower than original precursors; a candidate with p-value ≤0.05 can be statistically considered as stable. Although, plant and animal miRNA precursors, of the same length, are expected to have a similar frequency of stable precursors, due to the diversity in length of plant miRNA precursors, the parameter estimated for animals becomes inapplicable to plants. In the latter case, while p-values of real precursors remains almost the same even with increased length, that of background precursors declined substantially with length. These criteria become more effective when it comes to predicting longer candidate precursors, which is often the case with plants.
Furthermore, the conservation of mature miRNAs is yet another important feature for miRNA discovery, exploited by miRDeep and several other tools, where the former takes into account the conservation in the nucleus region of mature miRNAs . The positions which constitute the nucleus in animals are 7-8 nt in length, starting from position 2 [27–29], whereas in plants, there is near-perfect complementarity all along its length. We examined the nature of the positional conservation pattern in plants, and found a relatively longer conserved motif, wherein two conservation blocks were apparent: positions 2-13, and 16-19, with position 4 completely conserved. Implementation of positional conservation pattern in plants has improved the specificity of miRNA homolog prediction.
Since plant miRNA precursors show a relatively broader distribution of length compared to animals, this in turn, necessitates a different choice of excision length(s) for candidate prediction. While Sunkar et al.  considered 200 nt as a threshold at which 90% of the real precursors in rice were covered, Jones-Rhoades et al.  took a higher precursor length (500 nt) for prediction in A. thalina/O. sativa. In another study by Adai et al. , in A. thaliana again, the maximum precursor size was set to 400 nt. Based on the distribution of length of plant miRNA precursors from miRBase database (release 14), length(s) which covered the maximum number of miRNAs, and at the same time, had an adequate number of precursors (30, for instance, which have properties of a normally distributed population) for parameter estimation, were chosen. We observed two thresholds satisfying the above constraints, 276 and 336, covering 96% and 98% of the population, respectively.
To show how much the new parameterization improves the prediction accuracy, we used the miRDeep with the default parameters to predict miRNA candidates. Results suggested that a major fraction of miRNAs, predicted using the default parameters, did not match with experimentally identified miRNAs. The observed values of key features, such as number of total mismatches, bulges, nucleus conservation, excision length, etc., of the predicted candidates were atypical for plant miRNAs. Moreover, shorter nucleus size in default miRDeep led to identification of several false miRNA homologs. However, prediction using new parameters on the same dataset showed very high prediction accuracy, with good sensitivity and even better specificity.
Notably, any proposed improvement in plant miRNA discovery must meet the criteria laid out for miRNA annotation in plants [12, 31]. Despite the parameter adjustments in the miRDeep algorithm, the primary criterion for miRNA annotation, namely precise excision of mature miRNA from the stem of a stem-loop precursor, is implemented faithfully. The parameterization doesn't interfere in miRDeep's core method. Further, two of the miRDeep's statistical features, namely characterization of stem-loop and mapping of reads onto precursors, are enough to prevent a siRNA being misclassified as miRNA. Besides, there have been recent reports of few plant miRNAs being processed by riboendonucleases other than DCL1 , therefore, the predictive methods should also be capable of their identification. This however does not pose much problem to the tools based on deep-sequencing reads, as their methods are guided primarily by the sequences. So, a DCL3 processed miRNA, for instance, will be analyzed just like the DCL1 generated miRNAs, despite the longer product size of the former. Furthermore, there have been rare reports of multi-functional stem-loops , which poses challenges to the tools available for miRNA discovery. We are skeptical about the ability of the current form of miRDeep algorithm to handle such complexity.
This study also brings forward some issues that can be studied in the future. Increasing the number of plant genomes can allow researchers to further test whether MFE distributions of monocots and dicots truly differ and if so, study the underlying mechanisms. Furthermore, improved genome annotation will also improve the discovery of miRNAs missed out due to overlap with an otherwise incorrectly annotated CDS. Other desired advancements include modules for identification of other kinds of sRNA and the ability to characterize multi-functional stem-loops.
The sequences of precursor and mature miRNA of plant and animal species were downloaded from miRBase (release 14)  and pooled into respective sets. Additionally, plant specific miRNA sequences were also extracted (as on April 15th 2010) from the 'Plant miRNA Database' , which contains additional numbers of precursors, which are largely computationally predicted.
This dataset was further processed to remove redundancy, as in several instances, the precursor sequences are almost identical except for a few changes in nucleotides. Such sequences are likely to create bias towards the over-represented members of a miRNA family. This bias may affect some of the important analysis such as cumulative distribution of MFE and the effect of length on MFE. For obtaining non-redundant sets, the precursors were first divided into individual families and were then subjected to multiple alignment. Based on the alignment, the highly similar sequences were manually discarded, resulting in a set of 1904 precursors out of 2034.
A background miRNA precursor is one that exhibits similar physical properties as that of real precursors, however, they are never transcribed into a miRNA. Since it is known that plant miRNAs are transcribed largely from the intergenic region and coding sequences are unlikely to contain any miRNA. Therefore, we used protein-coding sequences of five different species namely maize , sorghum, rice, Medicago, and Arabidopsis (sequences of remaining four species were obtained from ), for generation of background precursors (in the present study, maize was the default choice for background sequences unless otherwise specified). We used de novo miRNA discovery program, miRCheck , for identification of background sequences, over a broad length range, with default parameter settings, except the excision length was increased to 500 nt.
where P (pre|data) is posterior probability of a test precursor being a 'real precursor' (pre) given the values of multiple statistical features (data), and P (bgr|data) is posterior probability of a test precursor being 'background precursor' (bgr) given the data.
where P(data|pre) is conditional probability of observing data in real precursors, P(data|bgr) is conditional probability of observing data in background precursors, and P(pre) and P(bgr) are prior probabilities of real and background precursors, respectively.
Out of these five probability distributions, P(abs|pre) and P(sig|pre) are continuous distributions (takes any value within specified range), while the remaining three are discrete distributions (takes one of the values from a set). The above also applies to the probability distributions for background (bgr) data. The cumulative frequency distribution of MFE, P(abs|pre) or P(abs|bgr), shows gumbel (or extreme value) distribution, while that of signature, P(sig|pre) or P(sig|bgr), shows geometric distribution . The stability (rel) takes either of the two values: 0 if frequency of unstable is more than 5%, otherwise 1. Similarly, if nucleus regions is non-conserved at the specified positions, the value of nuc will be 0, otherwise 1. The value of star will be 0, if any read fails to map with star region, otherwise 1.
The miRDeep essentially involves determination of these probability distributions, mentioned in Eq. (3), for known datasets of real (pre) and background (bgr) precursor/mature sequences. Here in current study, these distributions will be estimated for plants miRNAs.
The frequency distribution of MFE, plotted in Figure 1, was obtained from secondary structure prediction (by RNAfold ) of all known animal and plant miRNA precursors (available in miRBase release-14). To obtain a relationship between precursor length and MFE (as in Additional file 1: Figure 1), samples of plant precursors of different lengths, with sample size 60 (however for highest length, the sample size dropped to 21), each sample being homogeneous in length with a variation of 5 nucleotides (however 10 for the samples having members less than said sample size) were obtained. The MFE was computed by RNAfold and mean MFE values were plotted against the average precursor length of each sample.
For comparison of distributions of MFE of background precursors from multiple plant species (shown in Figure 2B), two-sample Kolmogorov-Smirnov tests were conducted using 'ks.test()' function of R statistical package .
where a, b, and c are fitting parameters.
To study the positional conservation in mature miRNA families of plants, we obtained all members of 18 miRNA families, sampled randomly. It was however insured that each family must have representatives from at least four species. Members of each miRNA family were aligned using CLUSTALX  and a position with ≥ 0.9 conservation was marked conserved. Then positional conservation profile of all 18 families were summed. That is, count of families conserved at each of the positions, starting from 1 to 23, was obtained and plotted. Since higher evolutionary divergence is likely across families (than within family), therefore threshold for defining a position as conserved was further relaxed, that is, ≥ 0.75 (Figure 4).
In order to obtain the frequency of nucleus conservation in real miRNAs, a strategy of 10-fold cross-validation was applied. That is, the complete set of known mature miRNAs were first shuffled, and divided into two fractions with ratio 9:1. The smaller fraction was used as a 'test set' and its conservation (at positions 2-12) was examined in sequences of larger fraction, and the frequency of sequences from test-set hitting 'training-set' was recorded. This was repeated 10 times, and an average of all ten frequencies was obtained. For background, the conservation of a large set of putative mature sequences, against the training-set described above, were examined in a similar way.
The log-odds of observing conserved (nuc = 1) and non-conserved (nuc = 0) nucleus were and , respectively (Table 1).
where R is the number of shuffled sequences with MFE greater than that of original sequence, N is the number of iterations.
For background precursors, the analysis was carried on three samples of lengths: 100, 200, and 300 nt, each sample having 100 sequences. They were subjected to 499 iterations.
The estimated log-odd scores for stable (rel = 1) and unstable (rel = 0) were given by, and , respectively (Table 1).
The Illumina reads, generated from Maize seedlings transcriptome, were downloaded from NCBI Gene Expression Omnibus (ID: GSM448856) . The reads, already trimmed for adaptors, were 7.922 millions in number.
The pre-processing of Illumina reads was done using MiroPipeline , which involved low complexity filtering, inclusion of reads without adaptor, and replacing identical sequences with single representatives. For mapping reads in miRDeep, the default choice of mapper, namely mega-blast , was replaced with one of faster mapper, namely SOAP-v2.2 . Mapping was done onto unmasked Maize genome , with repeated hits allowed, and the maximum number of mismatches was set to 1. The hits were later filtered for number of multiple best hits, a maximum of 20 was set as a threshold (so that reads repetitive in nature get excluded, at the same time those most likely mapping to multiple members of a miRNA gene family are considered), and then converted into miRDeep compatible format. The hits with protein coding sequences and with various types of non-coding RNA (eg., rRNA, tRNA) were also filtered out [44, 45]. The excision length was set to a maximum of 300. In order to speed up the mapping of filtered reads onto precursors (default program: auto_blast.pl of miRDeep package), we used MiroPipeline (configured for using seqmap as an alignment tool) . The core program (miRDeep.pl) was run with score threshold 0 and stability check on. One of the miRDeep's filtering criteria, to exclude candidates with bifurcation in secondary structure, was set off to improve sensitivity.
The key syntactic changes for incorporating plant specific parameters in PERL scripts of miRDeep are listed in Additional file 5.
The difference in properties of plant and animal miRNAs has large impact on the statistical features used for miRNA prediction. The parameters used for animal miRNA prediction cannot be used to predict plant miRNAs. Among the statistical features, the minimum free energy was found to have marginal difference between real and background in monocots. However dicots showed a different behavior wherein MFE scoring is potentially a key discriminator. The stability pattern in plant miRNAs was different to animals, in particular among the background sequences. The positional conservation profile was relatively longer in plants, so does the associated frequencies. The new set of parameters identified in this study will substantially improve our capacity to predict plant miRNAs.
The authors declare that they do not have competing interests.
The current study was supported by Bill and Melinda Gates Foundation.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.