An integrative approach to identify hexaploid wheat miRNAome associated with development and tolerance to abiotic stress

Background Wheat is a major staple crop with broad adaptability to a wide range of environmental conditions. This adaptability involves several stress and developmentally responsive genes, in which microRNAs (miRNAs) have emerged as important regulatory factors. However, the currently used approaches to identify miRNAs in this polyploid complex system focus on conserved and highly expressed miRNAs avoiding regularly those that are often lineage-specific, condition-specific, or appeared recently in evolution. In addition, many environmental and biological factors affecting miRNA expression were not yet considered, resulting still in an incomplete repertoire of wheat miRNAs. Results We developed a conservation-independent technique based on an integrative approach that combines machine learning, bioinformatic tools, biological insights of known miRNA expression profiles and universal criteria of plant miRNAs to identify miRNAs with more confidence. The developed pipeline can potentially identify novel wheat miRNAs that share features common to several species or that are species specific or clade specific. It allowed the discovery of 199 miRNA candidates associated with different abiotic stresses and development stages. We also highlight from the raw data 267 miRNAs conserved with 43 miRBase families. The predicted miRNAs are highly associated with abiotic stress responses, tolerance and development. GO enrichment analysis showed that they may play biological and physiological roles associated with cold, salt and aluminum (Al) through auxin signaling pathways, regulation of gene expression, ubiquitination, transport, carbohydrates, gibberellins, lipid, glutathione and secondary metabolism, photosynthesis, as well as floral transition and flowering. Conclusion This approach provides a broad repertoire of hexaploid wheat miRNAs associated with abiotic stress responses, tolerance and development. These valuable resources of expressed wheat miRNAs will help in elucidating the regulatory mechanisms involved in freezing and Al responses and tolerance mechanisms as well as for development and flowering. In the long term, it may help in breeding stress tolerant plants. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1490-8) contains supplementary material, which is available to authorized users.


Additional file 1. Supplementary Methods from S1 to S4
Method S1: Plant treatment To identify miRNAs associated with stress responses and tolerance, a control library from plants grown under normal conditions was constructed except for the Al treated library from spring wheat Bounty (sensitive genotype).
We used the control library from winter wheat Atlas66 (the tolerant genotype) as a control since in our previous study; the expression level of most genes was similar in control plants from Atlas66 and Bounty. Only 1.26% of genes were differentially expressed in the microarray analysis between Atlas66 and Bounty control demonstrating that most genes are expressed at a similar level (Houde and Diallo, 2008). For cold and salt treatments, seeds were grown in 3 L pots containing mixed soil (compost, black earth, vermiculite, 1: 1: 1) at 20°C under a 16 h photoperiod with a light intensity of 250 µmol m -2 s -1 and 70% relative humidity. To control the soil water content, one week before starting treatment, all plants were watered with the same amount of water, so they were exposed to the same soil moisture content. . Seedlings were exposed to Al stress as described in Hamel et al., (1998) using 5 µM Al for Bounty and 50 µM Al for Atlas66.
To cluster the ESTs, each one has been aligned with the program blastx (Gish and States 1993) with default parameters. Each EST is annotated with a Uniref100 protein if it has at least 60% identity, 60% query coverage and/or hit coverage and E-value inferior to 5E-5.

Method S3: Removing adaptor and read mapping
All sequences are from SOLID sequencing and had a length of 35 with T as first character. Since expected mature miRNA length is around 21nt, reads should contain a part of the adaptor P2. Adapters were removed in color space (represented by numbers between 0 and 3) using the program cutadapt v0.9 (Schulte et al., 2010) with the following parameters: -c -e 0.12 -a 330201030313112312 -maq. Those parameters accept at most 12% of mismatches over the adaptor, therefore adjusting the number of mismatch depending on sequence length. The output format was adapted to the program MAQ (Ondov et al., 2008) for mapping the adaptor free reads to the ESTs. Identical or different reads with the same mapping results (same region in the EST) have then been grouped as a small RNA.
We computed the abundance of a small RNA as the sum of all reads that exhibit the given small RNA after mapping.

S4.1 Collected data
All miRNAs and pre-miRNAs were downloaded from miRBase release 18, which contains 21643 miRNAs or miRNAs* for 18226 pre-miRNAs. The negative dataset was generated within the same data, except the start instances.

S4.2 Model creation
To perform the feature selection and create models, we used Weka and its libraries (Hall et al., 2009). For all trained models, we applied a 10 fold cross validation.
In MiRdup*, we chose the combination of Adaboost M1 method (Freund & Schapire 1995) and forest of random trees (Breiman 2001). Adaboost is a machine learning algorithm specialized in problem optimisation and the search for the best global optimum (Osman 1996). It is used in combination with many other machine learning algorithms in order to improve their performances. We trained it with 10 iterations, reweighting, and a weight threshold of 100.
Concerning other classifiers, the SVM classifier, working with libSVM library (Chang and Lin 2001), was trained with radial kernel and the C4.5 tree (Quinlan 1993) was trained with Adaboost.
Sensitivity and specificity, which measures the proportion of positives and negative instances which are correctly classified as such respectively, were calculated with the well-known formulas: We set 35 features that can characterize the duplex miRNA and its complement sequence on the hairpin (Additional file 2: Table S3). To calculate them, our model requires the mature sequence of the miRNA, the pre-miRNA hairpin loop sequence and its secondary structure. Few are based on the position and length of particular structural elements in the miRNA, like bulges and bases pairs, or the situation of the miRNA itself in the hairpin, like the distance relative to the loop. We also included features that characterize the nucleotide sequence, where we check for missing nucleotides in the miRNA, percentage of presence of certain nucleotides and GC content. Furthermore, we calculate the minimum folding energy of the duplex.
Those features reveal some configuration errors found in pre-miRNA prediction results. In Figure SM1b, we show examples of misplaced miRNAs (red) on a predicted pre-miRNA (black), where the miRNA sequence might extend in the loop or in a big bulge. We assume that could affect the biological process and thus, conclude that the miRNA cannot be handled by the given pre-miRNA. Those should be invalidated. On the opposite, miRNAs (yellow) which have a good position on the hairpin are validated, depending of the training set characteristics.

S4.3 Features ranking
Not all features have the same impact on classifiers. Some are too similar between the positive and negative datasets and can influence badly the classifier, especially in over fitting and computational time (Zhou et al., 2006).
Then, in order to get the most important features, we ranked them, as shown in Additional file 2: Table S3 for all the miRBase dataset. We observe that the most influent feature concerns the distance from the terminal loop and the weakest is the presence of U. In addition, presence and percentage in the miRNA of a certain nucleotide tend to be less influential features. The miRNA length is a positive control and has no difference between positive and negative datasets, because generation of the negative dataset is done by moving the miRNA while preserving its length. As shown by Leclercq et al., (2013), depending on the kingdom of species chosen for the study, the rank of important features will change. Hence, to ensure quality, we trained the model on all miRBase, all plants and monocot datasets.

S4.4 Training models on classifiers
Various classifiers were tested in order to find the best one which could discriminate all experimentally validated miRNAs from miRBase dataset depending on the class (

S4.5 Evaluation of the designed models
After training the classifiers on all features, few datasets were submitted to construct the models. These include miRBase and plant miRNAs (PMRD) databases, and published results from MIRcheck and miRDeep. Results are shown in Table SM2, where we have the total instances of every dataset, the portion of validated instances and the mean absolute error. An instance is a miRNA and its corresponding precursor. The mean absolute error informs about the average error per predicted instance made by the model. We note in the

S4.6 Conclusion of the MiRdup* model
We trained several classifiers, and boosting on forest of random trees showed best overall results. It has the best classification score of 80.29% and ability to identify negatives results (specificity) 86.5%. Its sensitivity is not the top, but still gives a high score of 74.1%. The evaluation of miRBase on this model has shown a high score of 99.75% validated instances of mature miRNAs and corresponding pre-miRNA(s).
It's not the first time where SVM, Adaboost, trees and random forest are used in this field to classify real or false miRNAs and pre-miRNAs (Guan et al., 2011;Helvik et al., 2007;Jiang et al., 2007;Yan et al., 2007;Zhou et al., 2007). Adaboost on tree kept a high score level in this study, but is always lower than forest of random trees.
Compared to these classifiers, the SVM has the limit of the training time, where several hours are needed to train the model. This is a problem to try the classifier with diverse parameters, and would be a disadvantage to the usage of MiRdup*. Perhaps several tests on parameters could improve its overall results.
We know that relying only on miRBase could limit the discovery of novel miRNAs. However, we can assume that this database contains many examples of what characterizes a miRNA on its pre-miRNA. Machine learning algorithms are very useful for this kind of problem. They can accept certain diversity around usual characteristics in order to assess new members that fulfill the majority of the statistics. Also, the majority of hairpins in miRBase are single loop hairpins, but several hundred contain multiloops. We designed this algorithm so that we can accept multiloop hairpins, since we only need to focus on the duplex containing the miRNA and its complement.
We noticed in this study that general statistics on features change between set of species, especially between different groups of plants. Those differences denote that specific datasets should be used depending on the hairpin predictions we want to validate. The objective of MiRdup* here is to reduce a noise as post treatment of predicted hairpins.