Employing machine learning for reliable miRNA target identification in plants
BMC Genomics volume 12, Article number: 636 (2011)
miRNAs are ~21 nucleotide long small noncoding RNA molecules, formed endogenously in most of the eukaryotes, which mainly control their target genes post transcriptionally by interacting and silencing them. While a lot of tools has been developed for animal miRNA target system, plant miRNA target identification system has witnessed limited development. Most of them have been centered around exact complementarity match. Very few of them considered other factors like multiple target sites and role of flanking regions.
In the present work, a Support Vector Regression (SVR) approach has been implemented for plant miRNA target identification, utilizing position specific dinucleotide density variation information around the target sites, to yield highly reliable result. It has been named as p-TAREF (plant-Target Refiner). Performance comparison for p-TAREF was done with other prediction tools for plants with utmost rigor and where p-TAREF was found better performing in several aspects. Further, p-TAREF was run over the experimentally validated miRNA targets from species like Arabidopsis, Medicago, Rice and Tomato, and detected them accurately, suggesting gross usability of p-TAREF for plant species. Using p-TAREF, target identification was done for the complete Rice transcriptome, supported by expression and degradome based data. miR156 was found as an important component of the Rice regulatory system, where control of genes associated with growth and transcription looked predominant. The entire methodology has been implemented in a multi-threaded parallel architecture in Java, to enable fast processing for web-server version as well as standalone version. This also makes it to run even on a simple desktop computer in concurrent mode. It also provides a facility to gather experimental support for predictions made, through on the spot expression data analysis, in its web-server version.
A machine learning multivariate feature tool has been implemented in parallel and locally installable form, for plant miRNA target identification. The performance was assessed and compared through comprehensive testing and benchmarking, suggesting a reliable performance and gross usability for transcriptome wide plant miRNA target identification.
miRNAs have emerged as a major regulatory components of cell system, which are active in almost all of the multicellular organisms. These noncoding RNA elements are around 21 bp long and bind the target mRNA sequences which share complementarity with the targeting miRNA sequences. However, for a long time it has been believed that miRNA targeting in plants requires almost complete complementarity while in animal it is incomplete complementarity where seed regions play the critical role in binding and subsequent targeting [1, 2]. Some recent studies have emerged out where translational repression and some inexact complementarity have been suggested to be existent in plant miRNA targeting too [3–5]. Some groups, encouraged with these findings, have started looking into such aspects in more detail, studying interactions which may not display exact complementarity as well as instances which are left undetected by existing plant miRNA target prediction tools [5, 6]. Li et al conducted an experiment, where they suggested that complementarity and homology based target identification tools, which compose the major approach of target identification in plants, may miss out several valid targets in plants. Such targets actually may not obey conservation, homology or exact complementarity . The major drawbacks of most of the existing plant miRNA target prediction tools have been that they follow the exact complementarity, most of them do not consider any flanking region sequence contribution to better the target prediction, they hardly leverage from machine learning like powerful approaches to handle multiple features for target prediction more accurately. Most of them lack the realistic time approach to handle the genome or transcriptome wide data to facilitate faster target predictions as most of them are serially coded and web-server based. A major reason could be a predominant belief that unlike animal system, targeting in plants has been not much complex. Pertaining to this, exact complementarity search centered tools were used for plant target predictions while animal target identification witnessed large number of innovations . Few of the most frequently used plant miRNA target prediction tools relied strongly upon exact pattern search and local alignments. PatScan  was a tool developed to look for exact similar matching patterns for target, where users could modify the match and mismatch values as well as select for wobble. However this tool did not consider bulge or seed specific scoring and its use has been nonspecific as it is used for other pattern match based purposes too, besides target finding. Another tool, miRNAassist, used BLAST search for complementary regions of miRNAs . Using BLAST, already known miRNAs from other species were used as a database to search against Brassica EST sequences. Following almost similar approach, Carrington group proposed another protocol where BLAST was replaced by FASTA34 . They also introduced some scoring rules of alignment to separate the seed region from rest of the regions as well as relaxed values for mismatches and wobbles. However BLAST based approaches are good for instances where the query length is longer as for smaller sequences, hits come up with very low significance making a random hit case. Considering this Zhang  developed a new tool, miRU, which replaced BLAST with Smith-Waterman local alignment, weighting more for seed regions and allowed bulges. These all tools were centered around complementarity search. Acknowledgement for limitations of exact complementarity and alignment based methods was conspicuous with release of new generation tools like TAPIR . TAPIR worked with two different options: 1) Scan for targets using FASTA program based alignment or 2) By applying more sensitive approach of running RNAhybrid  and considered thermodynamic and mismatch factors together. Use of RNAhybrid in the back-end also ensured that unlike previously employed tools, TAPIR was able to detect multiple target sites in a given mRNA sequence. Contemporary to this, Xie and Zhang developed a novel tool Target-align . Target-align was implemented by considering some rules while performing alignments. These rules were about the number of allowed mismatches, consecutive mismatches, number of allowed gaps and strict mismatch conditions in the seed region. However unlike TAPIR, focus of Target-align was on Smith-Waterman based alignment for complementarity search with several conditions. An advantage with Target-align has been its availability as local standalone version, unlike majority of plant miRNA target identification tools. Very recently, Dai et al acknowledged about the various lacunae in existing plant miRNA target identification tools, including centrality of alignment based approach, no proper consideration for imperfect complementarity, no consideration for role of flanking regions, inability to detect multiple sites as well as unavailability of locally downloadable standalone version to perform large and genomic scale studies . Considering the various existing demerits, this group implemented the role of target site accessibility and flanking regions by using RNAup . RNAup is a tool to predict RNA-RNA interaction, considering single strandedness of a given RNA sequence while deriving partition function for various nucleotides in secondary structures. RNAup and similar approaches have been used frequently in animals for miRNA target identification with likes of Sfold , PITA  and MicroTAR . However, applications of such tools have some limitations, as they are based on single sequence secondary structure and energy based features, whose accuracy and reliability drop drastically with increase in the length of sequences [21, 22]. Considering this, Heikham and Shankar  had proposed a novel approach to consider the flanking region sequence information, bypassing the chances of getting trapped into the issues arising from limitations of thermodynamics and structure based modeling. It successfully applied varying dinucleotide density profile with respect to putative target positions to decipher the role of flanking region in miRNA targeting in animal system. In case of plants, considering such approach becomes more relevant as unlike animals, where targeting is preferred in the 3' UTR regions, in plants miRNA targeting can occur to any region of the full length mRNA.
In the present work, these findings have been extended with flanking regions sequence information role in determining miRNA targets , by applying and assessing the theory on plant system too. Here a machine learning based reliable approach with multiple features oriented statistical learning has been applied, having a clear edge over rule based approaches. Arabidopsis thaliana has been used as the source to derive plant specific features which were modeled using Support Vector Regression to classify as well as to implement an effective scoring scheme through regression score. Besides this, a concurrent architecture with multi-threads has been implemented, making the tool application easily deployable even on simple desktop machine in concurrent mode, enabling it to scan plant mRNA sequences for targets in transcriptome wide manner.
Basic working approach
p-TAREF has been designed specifically to detect plant miRNA targets, applying the following basic steps: A) Conversion of target:miRNA pairing into single dimension encoded pattern for interactions, which retains the various interaction features combinations found in plant system. This is done for experimentally validated as well as predicted RNA:miRNA interactions. An uniform alignment step precedes it to maintain a common alignment approach B) Using target:miRNA binding thermodynamic, implemented through RNAhybrid, initial set of targets are generated. This is followed by an optional filtering step. The library of experimentally known encoded patterns is scanned against the predicted interaction patterns as combinations of match, mismatch, wobble and bulge may hold interaction state information for target:miRNA C) Evaluation of putative miRNA target site as a potential target site based on plant specific flanking region dinucleotide density profile variation in position specific manner with respect to the possible target site. It uses multivariate classifier with capability to transform between non-linear and linear spaces. When applied with Support Vector Regression, the position specific dinucleotide density profile variation patterns were found to possess strong discriminative power to precisely classify targets and non-targets. Dinucleotide density variation pattern also retains nearest neighbor information for nucleotides, a property useful in determining the accessibility and structural conditions of RNAs. The entire process has been implemented in a parallel mode. Figure 1 shows the working implementation of p-TAREF along-with concurrency. The following sections give more details about the implementation of the entire approach.
The present work has used several sequence resources. miRNA sequences for plants were downloaded from Mirbase version 16 . 243 mature miRNA sequences were retrieved for Arabidopsis, 414 for Oryza, 234 for Populus, 51 for Medicago and 37 for Tomato. All these miRNAs have been integrated in the presented tool. Experimentally validated Arabidopsis thaliana miRNA targets and their corresponding targeting miRNAs were retrieved from ASRP database  as well as from the list of miRNA:target pairs validated through RACE PCR as reported in the supplementary material provided by Beauclair et al. Arabidopsis sequences were downloaded from TAIR, version 10. Experimentally validated targets for Medicago and Rice were retrieved from various literatures [7, 26]. Negative instances of false targets were built from the dataset used previously as well as random sequences [13, 23].
Plant specific encoded interaction pattern generation
Instances were extracted, using the list of RACE PCR validated miR:target interactions for Arabidopsis, submitted by Beauclair et al in their supplementary material. Experimentally validated miRNA and target interactions for other plants species like Rice, Medicago, Tomato, Populus, were also derived from various literatures [7, 26, 27]. All miRNAs and target partners were retrieved for a separate run of RNAhybrid. RNAhybrid predicts miRNA:Target interaction by considering thermodynamic parameters for interactions and multiple-sites while applying information from statistical distribution in its backdrop. Also RNAhybrid run is a common step between encoded interaction pattern generation for experimentally validated instances as well as during the prediction run over any unknown query sequences. This way, it maintains a common approach. Output of RNAhybrid over experimental datasets provided exact binding pictures of interactions, which was further refined by applying Needleman-Wunch global alignment algorithm based local alignment tool, Stretcher, from Emboss-package. In order to consider the G:U wobble, the scoring matrix was adjusted accordingly with +1 advantage for G:U wobble, gap opening penalty of -15 and extension penalty of -5. Through this, sequence similarity as well as thermodynamic considerations was implemented to derive the interaction patterns. Using local scripts, all such interactions were converted into single encoded patterns, where information was reduced to single dimension alone, with match states of nucleotides i.e. bulge on miRNA strand, bulge on target strand, mismatch, match and wobble. All experimentally validated interactions were finally represented into only this form. Same protocol was used by the tool to generate interaction patterns for the predicted targets automatically. For every predicted target, the entire library of experimentally validated encoded patterns is scanned for similarity with scope to look for inexactness. This step defines the primary filtering step based on similarity of interaction patterns with experimentally known interactions. At present, total 268 different interaction patterns have been included considering miRNA:target interaction cases from Arabidopsis (157), Medicago (7), Populus (42), Tomato (11) and Rice (51).
Support Vector Regression (SVR) model building for plants
Unlike rule based approaches of identification and classification, machine learning approaches have emerged much superior for the process of classification. Among them, Support Vector Machine (SVM) has appeared as highly reliable one as it can handle large number of features together to derive a suitable classifier using multivariate statistical learning, which is comparatively tough to achieve by rule based approach of classification. Another advantage of SVM has been that unlike other machine learning approaches it concentrates upon evolving a classifier boundary with maximum margins, lowering the chance of misclassification and error drastically. This property is also controlled by the type of kernel selected for training and classification purpose, as linear kernel applies linear boundary, Gaussian kernel applies normal distribution boundary while polynomial kernel has capability to evolve convolute boundary to handle the cases where instances from different classes are very mixed up for the given set of features. The final classification by SVM assigns the classified instances their respective class as either 1, 0 or -1. However, this does not come with any clear confident value for the classification. This degree of confidence could be derived through some scoring scheme, which is provided by the Support Vector Regression (SVR). In the current study, a more evolved Support Vector approach, the SVR, has been used to implement training and classification along with a scoring scheme, regressions score. For training purpose, a sequence dataset comprising 104 experimentally validated Arabidopsis sequence instances reported by Beauclair et al (Supplementary Material, 2010) as well as negative target instances used by Heikham and Shankar  was formed. The negative target sequences has randomly generated sequences as well as some experimentally validated negative targets which were predicted as targets but experimentally validated as false positives. 75 bases flanking regions around the target sites in negative as well as positive instances are considered through 20 bases long sliding windows, estimating the dinucleotide density and its variations with respect to the target-site. Discrimination through dinucleotide density variation with respect to position was found to be the best for window size of 20. Mean distribution based feature selection procedure was applied to learn about the most discriminating features in plants. The Support Vector Regression Machine was applied through SVMTorch , where every learning instance was converted into position specific dinucleotide density variation profile with respect to the (possible) target sites. Training and model generation were performed separately for three different Kernel classes: Linear, Gaussian and Polynomial. The best emerging models for plant systems for each Kernel class were saved and integrated into the plant target identification tool developed. This way the user gets three choices of plant models to select from.
Expression data support integration and visualization
Various array expression experiments and data (Affymetrix Rice Genome Array, Affymatrix Arabidopsis Tilling Array 1.0 R and AT-TAX) were used in the present study. Data normalization was done using gcRMA method implemented in "R" Statistical Package. The expression data ('.CEL' format) was downloaded from GEO for Oryza sativa. Expression studies and data for 17 Oryza miRNA families (156, 159, 160, 166, 168, 172, 396, 444, 528, 806, 810, 820, 1318, 1875, 2055, 2906, 395) and 57,359 RNA sequences (excluding miRNAs) were used. For Arabidopsis miRNAs, the available expression related studies and data for 31 miRNA families (156, 157, 159, 163, 164, 165, 166, 167, 169, 171, 172, 319, 390, 391, 393, 394, 396, 398, 399, 401, 403, 404, 405, 406, 407, 413, 414, 417, 447, 824, 834) and 30,166 mRNA transcripts were considered. For several of these array based experimental data, RT-PCR based validations for sets of associated representative genes were reported by the submitting authors. The RNA sequences for Arabidopsis were downloaded from TAIR and Oryza RNA sequences from RiceGE.
To calculate correlation coefficient, the submitted target(s) is first searched in the locally installed database of Oryza or Arabidopsis (to be opted by the user) using BLASTn. The top most hit amongst all the hits, found by BLASTn, is extracted. The identifier of best hit is scanned across the inbuilt library of expression data files to finally calculate the Pearson Correlation Coefficient for co-expression. Modules for scanning and data parsings for expression correlation analysis part were implemented through codes developed in PERL, PHP and Java. miRNA:target association graph was generated using graphviz and Java libraries, JgraphT and JGraph.
Introduction of Concurrency
Concurrency enables the system to perform the same task with higher speed by harnessing the available logical processors on a given machine. Currently, even a simple desktop or laptop comes with multicore CPUs, having two or more processors/cores, which can go upto more than 50 in current generation servers. Implementation of concurrency was done using Java Concurrent Library (JCL) while applying multi-threaded processing of tasks. The developed tool provides the user an option to select the total number of processors to be used for target scanning. Accordingly, multithreads are created to process the query sequences. A single query sequence is chopped into several small subsequences with minimum 50 bp length (considering that usually a miRNA:target interaction stays below 50 bp), in overlapping manner and distributed across the number of processors selected, to run the following steps of target identification. For every such processor and batch of allocated sequences, RNAhybrid is run separately; output is manipulated and parsed for coordinates, separately and concurrently. Similarly, the alignment step is run concurrently. Only the Support Vector Regression step is not concurrent as it is quite faster. The RNAhybrid, alignment, parsing and union steps are quite time consuming and application of concurrency saves the time by providing manifolds acceleration while performing analysis on large amount of data.
Standalone and Server Implementation
The entire tool has been developed as a web-server as well as Linux based standalone GUI version. The web-server version has been developed using Linux-Apache-PHP, along with concurrency. The standalone version has core programs and scripts written in Python, PERL, Java and C, while its GUI wrapper has been developed using QT C++ GUI library. The standalone version, too, supports concurrency.
Six major different tests were done to assess the performance of the developed tool, p-TAREF, for miRNA target-identification: 1) Testing for performance on dataset containing training set (total 104 positive and 119 negative instances) 2) Dataset containing 287 Arabidopsis positive instances from ASRP database. 3) Dataset containing experimentally validated targets from Rice, Medicago, Populus and Tomato 4) Comparison of p-TAREF with TAPIR and Target-align, for performance over TAPIR/Target-align reference dataset 5) Performance comparison between p-TAREF, Target-align and psRNATarget . 6) Comparison between Target-align and p-TAREF for time performance on a given set of sequences. The performance measure terms, Sensitivity (Sn), Specificity (Sp), Accuracy (Ac) and Mathew Correlation Coefficient (MCC) were calculated using the following equations:
ROC curve based on 10 fold cross validation was done to estimate the performance and robustness of the classifier models and associated tests.
Gene Ontology and enrichment studies
Gene Ontology information for Rice transcriptome was derived from Ensemble Plants. Enrichment analysis for gene categories predominant in miRNA target system was conducted through two different ways: A) Using multiple Binomial tests. B) Using Hyper-geometric exact tests. The null hypothesis was derived using the distribution of various GO categories and their terms in whole transcriptome of rice. For multiple Binomial tests, we developed in-house script in "R", while hyper-geometric tests were conducted using Cytoscape module of Bingo .
Result and Discussion
Web interface of p-TAREF server and GUI Standalone
p-TAREF comes as a server as well as standalone version. The web-server takes single as well as batch mode submission of the query sequences. However, considering the connectivity dependence upon network, it is quite advisable to use the web-server version for single sequence or small number of sequences. The input of sequence requires FASTA manner entry where the first line starts with ">" followed by "AT" and accession ID or numeric digits to identify the sequence, without any gap, followed by next line having the sequence. Query could be either pasted directly or uploaded through some text file. The users are given with three choices 1) Type-I: Just submit the query sequence and run the tool from beginning, starting from RNAhybrid step. 2) Type-II: Submit the target mRNA sequence along-with predicted target sequence. 3) Type-III: Choose some miRNA from a drop down menu to identify targets on the submitted query sequences. Type-I facilitates the user to perform all tasks on the given query sequence, while Type-II is more for confirmation and validation of already predicted target by some other method, applying support vector regression module directly. Unlike Type-II, Type-I is more computationally intensive as it involves time consuming step of RNAhybrid, dynamic programming based alignment step, pattern encoding and search as well as large amount of parsing. Considering this, the option of concurrency has been given to the user for Type-I, where the user could choose the number of processors to be used to run the server concurrently and get results faster. Type-I also provides the user with options to select the allowed number of mismatches while estimating similarity between the predicted and experimentally validated encoded patterns for interactions between miRNA and targets. The maximum allowed level goes upto four mismatches. Higher the mismatch level cut-off, more number of total targets may emerge out. There is an option to set the threshold energy cut-off parameter for RNAhybrid run, which is -10 kcal/mol by default. A decisive step in parameter selection is the selection of plant model according to the Kernel (Choice of Kernel). Here, p-TAREF provides three options to choose from: Linear Kernel, Radial Basis Function (Gaussian) and Polynomial function. Linear function runs straight with least accommodative power, Gaussian is moderately stringent and Polynomial tries to cover more spread and deviating instances correctly. The Type-II option is more for validation purpose in case if a user wishes to confirm the predicted target by some other tool or method, by applying SVR approach. In this case, the user has to paste the predicted target region sequence as well as the sequence in which the target was predicted. Based upon the dinucleotide density profile variation method for refinement, SVR scores will be generated for the query. Figure 2 provides the look for Type-I form of the server. Besides this the server provides Type-II option to perform SVR validation for already predicted targets. It takes the predicted target sequence as well as the mRNA sequence, to which the target region belongs. Type-III option provides a list of miRNAs to opt from and perform analysis on the user submitted query sequence.
The server version also provides a provision to scan for possible expression data based expression correlation measurement for the given user query and associated miRNA, found targeting it. The user is asked to select the species to which the sequence belonged or is expected to share a homologous sequence. The server has inbuilt, normalized, expression data for plant miRNAs as well as genes, currently for Arabidopsis and Rice. Along-with the expression data, the associated mRNA sequences are also formated for similarity search tools like BLAST, which is enabled to run on multiple processors. The user opts for the species to be scanned for the target gene, in turn, the server preforms a BLAST run to consider the longest and most identical hit, most similar to the query sequence. The corresponding expression data for the target and targeting miRNA is retrieved for expression correlation measurement, which is displayed to the user. The publicly available expression data for all known plant miRNAs and genes will be continuously updated with every release and for various species. It needs to be mentioned that array expression data could be not of much use in case of translational repression by miRNA. A possible analogous facility may be provided in future for targeting cases where translational repression could be involved. The final output page displays the target sequence ID, targeting miRNA, the predicted interaction pattern and closest experimentally validated pattern along-with the partner miRNA, SVR score and choice to scan for expression analysis based validation across different species. The SVR score comes positive for potential miRNA targets while it is negative for non-targets. Higher the absolute value of the SVR score better is the confidence of classification.
The Standalone GUI version of p-TAREF was developed considering the realistic approach to scan large amount of transcriptome data for miRNA targeting. Performing such task on web-server is a time consuming approach which takes lots of time in loading the data itself and fetching it back, while both of the processes are connection and INTERNET availability dependent. In that way, Standalone GUI version could be very helpful in running p-TAREF locally and in user friendly manner. The entire interface has been developed using QT C++ Library which is also available for download from the server page's download section. The Standalone GUI is easily deployable even on a simple laptop or desktop machine as well as on high-end servers. In case, if the required dependencies are not pre-installed on user's machine, p-TAREF installation system verifies this and automatically installs all the per-requisites itself. The standalone GUI version provides an option to load sequence query file in a batch mode, adjust the mismatch level for experimentally validated interactions similarity search, options to adjust the binding free energy cut-off, options to choose for kernel dependent plant models as well as adjust the number of processors to be used to implement concurrency. A progress bar appears to display the running job status and amount of sequences scanned. Figure 3 shows the running state of Standalone GUI version of p-TAREF.
As already discussed in the introduction section, unlike the animal system based miRNA target identification tools, plant miRNA target identification tools have witnessed limited growth till recently. Many of them revolved around complementarity search, using either heuristics like BLAST and FASTA or Smith-Waterman in their core. Most of them are web-server based and barring psRNAtarget, none of them provides the scope of concurrency to enable analysis of large amount of sequence data. Considering the revolutions made by next generation sequencing and systems biology approach, it becomes imperative to analyze transcriptome/genome level data at one go, with high accuracy as well as speed. BLAST and FASTA dependent methods do not require concurrency due to innate advantage of FASTA and BLAST to be much faster, though at the cost of accuracy and reliability. For that, some authors tried Smith-Waterman local alignment to detect complementarity, which becomes sharply slower with increment in the number and length of sequences to be searched and more so if all to all search has to be performed without the prior knowledge of the miRNA. We compared one such tool, Target-align, with p-TAREF, for time performance as among the very few tools available as the standalone version, Target-align is a recently published software with widespread use. We executed Target-align and p-TAREF on 205 plant genes from Arabidopsis and recorded the time taken to finish the job. Though p-TAREF run could be accelerated through concurrency and use of more processors, no such facility was available with Target-align, making us to run it with single processor and compare the performance for time taken. Table 1 summarizes the time performance and impact of introduction of concurrency. Figure 4 displays the plot showing reduction in execution period on introduction of concurrency when run over 790 mRNA sequences associated with plant secondary metabolite pathway. The processing speed of p-TAREF shot up with inclusion of more processors, making it a better choice to look for whole transcriptome wide scanning. Besides p-TAREF, only psRNAtarget provides the advantage of concurrency. However, comparison between them for execution speed was not possible as psRNAtarget is available only as a web-server and its concurrency has been implemented through cluster computers having several processors and large volumes of memory. Unlike psRNAtarget, p-TAREF is easily deployable on any level of machines and can run concurrently even on simple desktop machine.
Contrary to the strictly similarity search based tools, as described above, p-TAREF, takes advantage from three different approaches: Similarity based, Thermodynamics based (RNAhybrid) and Machine Learning based. Barring psRNATarget, hardly any of the existing plant miRNA target identification tools consider the thermodynamic aspect as well as contribution of flanking region in deciding the target. In animal system it is now well proven that flanking regions reasonably participate in determination of the target site [23, 30]. However unlike psRNAtarget, which measures the miRNA-RNA interaction and gross secondary structure of the mRNA using RNAup program  at its back-end, p-TAREF prefers to apply dinucleotide density variation profile around the target site as the multivariate feature set for discrimination through statistical machine learning approach, Support Vector Regression. Our previous work with animal system had already shown the effectiveness of detecting miRNA targets in animal system without getting trapped into the limitations of RNA structure prediction [21, 23]. In overall, p-TAREF was compared with psRNATarget , Target-align  as well as TAPIR , through several tests. In the first two tests, p-TAREF was compared with Target-align and psRNAtarget, over the dataset created using experimentally validated targets reported in the supplementary material of Beauclair et al as well as experimentally validated instances reported in ASRP database  along-with the mentioned negative dataset. Performance of p-TAREF was measured for three different kernel based models. It was found that even the lowest performing linear kernel based plant model performed better than psRNAtarget as well as Target-align for the given datasets. Table 2 presents the result of this performance comparison along with Additional File 1 (Table 2). This is to mention that while performing this assessment, the instances taken in training sets were entirely different from the one used in training, keeping it clear to test its performance on a dataset with never seen before instances.
In the next test, performance of p-TAREF was compared with Target-align and TAPIR, for the reference dataset used by TAPIR  and Target-align . Both the tools had used a common dataset for their performance benchmarking. On the same dataset, p-TAREF was run and found performing better than TAPIR and Target-align with sensitivity level of 100%, which is higher than the ones observed for TAPIR (93.14%) and Target-align (97.05%). Though it does not affect much even if same set is used for training and testing, in the present study it has been tried all over to keep the two sets entirely different and unseen. Here also, the number of instances which were common between the sets used for testing by TAPIR and Target-align were searched. Only seven out 105 instances were found common between TAPIR/Target-align reference test set and training set of p-TAREF. For this test, the benchmarking protocol applied by Bonnet et al for TAPIR  and Target-align , was followed exactly, where the predicted targets falling outside the experimentally validated regions were considered as the negative instances to calculate the false positive rates. Compared to these two tools, Target-align and TAPIR, p-TAREF had lower false positive rate. Table 3 presents the result for this benchmarking exercise.
Besides this, p-TAREF was also compared with psRNAtarget for experimentally validated dataset, which was used previously for performance benchmarking of psRNAtarget [31–34]. For all experimentally validated 46 instances of targets, p-TAREF identified 45 of them. Further experimentally validated target instances specific for Tomato, Populus and Medicago were collected and the performance of p-TAREF was measured on them. For available nine experimentally validated target instances in Medicago truncatula specific miRNAs, p-TAREF scored 100%. For all of the available eight experimentally validated targets from tomato, p-TAREF attained 100% accuracy. For Populus trichocarpa, 17 out of 21 experimentally validated and submitted instances were available, out of which 16 targets were identified successfully, notching an accuracy of 94.11%. For Populus euphratica 21 targets out of 24 known instances, were successfully identified (Accuracy% = 87.5%). All the details regarding performance, benchmarking and associated tests are explained elaborately on the performance page of the server as well as in Additional File 1.
Using 10-fold cross validation, the performance robustness of all the three kernels and different tests was estimated. The Area Under Curve (AUC) values for most of the tests scored above 0.9, suggesting the robustness of the working theory, model built under the three kernels and their reliable performance. Figure 5 shows the Receiver Operating Characteristic curve (ROC) for the three models along-with their respective AUC values. The last two ROC plots are about performance of Target-align and p-TAREF, for the reference dataset used for benchmarking of Target-align by its authors, following the same protocol which they used to judge the false positive rate. The recorded AUC for p-TAREF was reasonably higher than the one observed for Target-align, suggesting more consistent performance by p-TAREF.
Target identification in Rice transcriptome and emergence of miR156 as a prominent regulator
In the beginning of this part of the study, the validation and performance benchmarking process over the already known and experimentally validated miRNA target instances in rice transcriptome was done. Recently, Sunkar had group performed a degradome sequencing based study to report 153 miRNA targets . For 29 rice specific miRNAs, the authors had reported 56 targets. For validation work the same experiment was used to validate targets identified by p-TAREF in the rice transcriptome. The sequence data was found available for 52 such target genes and p-TAREF identified most of the targets with overall accuracy of 97.33%. Encouraged by this, whole transcriptome analysis for miRNA targets in rice transcriptome sequences was carried out, excluding those sequences on which the above mentioned analysis had been performed already in order to avoid redundancy, looking for new targets and save time.
p-TAREF was run over 57,995 mRNA sequences from rice transcriptome dataset, with upto 4 mismatch level between experimental and predicted interaction patterns and polynomial kernel plant model. Initially, total 36,916 targets were identified for upto four differences from experimentally validated interaction patterns for target:miR interactions. Total 7,996 unique genes were found being targeted. Additional File 2 contains details of all identifications made at different mismatch levels. To validate the predicted targets with support of experimental data, the microarray expression data for all of the predicted target:miRNA pairs was checked. Out of 36,916 predicted miRNA targets, the expression data was available for 33,709 pairs to estimate the expression correlation between the target gene and corresponding miRNAs. After performing the expression correlation analysis, for 27,586 predicted target:miRNA pairs inverse expression correlation was observed, for different experimental conditions and tissue types, suggesting strong concordance with the predicted targets (81.8%). The expression correlation was compared with their respective SVR scoring and a reasonable agreement between the two was found with Pearson correlation coefficient of 0.7. The remaining 18.2% of identified targets had no agreement with expression correlation, which may also include condition like translational repression by miRNAs, which can't be interpreted well through inverse correlation estimation. While discussing this, it needs to be mentioned that expression data has certain limitations for inferences. It could be useful in case of transcript disruption, which is mostly prevalent in the plants. Though unlike animal system where translational repression has been reported more prevalent than transcript decay during miRNA targeting, recent studies have reported existence of translational repression in plants too, as discussed above. In such condition array expression data could not be much of help in inferring the process of targeting by microRNAs.
For this run, targets were found mainly for ~20 different miRNA families, with over-representation by certain miRNA families (Figure 6). All such targeting miRNAs and associated target genes, along-with their available expression correlation, mismatch level for nearest interaction patterns and SVR scores are listed in Additional File 3. For the identified miRNA targets, showing strong inverse correlation with the associated miRNAs, the targets were grouped according to the miRNAs targeting them, and separate Gene Ontology analyses were performed over every such group. The related data is given in the Additional File 4.
From this study, miR156 family emerged as an important miRNA in Oryza system, with largest number of targets (526 unique genes), many of which also scored high for negative expression correlation with miR156. One of the possible reasoning for observing such high number of targets for miR156 could be attributed to existence of purine richness (GA/AG tract) in miR156 sequence, causing poly-pyrimidine regions to be counted as the targets due to complementarity. Though the algorithm design of p-TAREF has capacity minimize the noise, especially those arising through mere complementarity, yet a couple of analyses were performed to verify the above mentioned possibility. Maintaining the constant dinucleotide composition, a permuted miR156 sequence was generated. If the polypyrimidine tracts could influence the result significantly, one may expect to see the frequency of targets for such permuted miRNA with identical dinculeotide composition as almost of same level. However, when p-TAREF was run with most liberal parameters to find the permuted miR156 targets, only 105 genes were found being targeted and with consideration of only miR156 specific encoded interaction pattern comparisons, absolutely no hit was found for the permuted miR156. The same test was repeated with few more permuted miRNAs and almost similar pattern of lower number of random targets were observed, with absolutely no targets reported when miR156 specific encoded interaction patterns were considered. This suggests high reliability of identifications done by p-TAREF, where the user could also apply the different options parameters to limit the result of interest. Further, a search for polypyrimidine SSR regions in the rice transcriptome reported ~1000 genes with polypyrimidine tracts. When mapped for the target genes for miR156, only 56 genes were found common between these two sets of genes. For several of these 56 genes the target site was found non-overlapping with the polypyrimidine tracts. Therefore, these findings suggest a very limited possible role of repetitiveness/randomness in the observed abundance of miR156 targets. Also, this needs to be mentioned that the mentioned number of target genes for miR156 is the gross number of targets for miR156 obtained with the parameters described in the beginning of this section. Search could be refined further by applying various filters and options provided with p-TAREF, including SVR score cut-off, interaction pattern differences and expression correlation score, etc. Additional Files 2 and 3 hold all such details for rice, which could be used to refine the results further, based upon filters like SVR score/Correlation Score/Differences in encoded pattern/Selection of miRNA specific encoded patterns etc. Applying one of such cut-offs for inverse correlation for expression, we performed an analysis upon the top scoring targets for miR156, as demonstrated below.
Figure 7 shows a group of identified targets for miR156 and their corresponding inverse correlation scores while Table 4 details about the possible functions and identification of top fifty of these targets along-with their SVR scores. All these targets scored inverse expression correlation values higher than 0.8 (i.e between -0.8 to -1). Gene Ontology studies over the identified miR156 target genes and associated statistical testing for enrichment provided some interesting informations. For miR156 certain biological terms were found enriched. Table 5 shows the top 20 significantly enriched GO terms found associated with miR156 targets in Oryza sativa. Figure 8 shows the result of statistical enrichment analysis using hyper-geometric tests, for molecular function categories of genes. For this part of the study, it can be seen that miR156 targeting was found significantly enriched for genes associated with process of transcription, nucleotide transfer process during transcription and catalytic activities. For further analysis, targets were searched for instances where a miR156 target was found being targeted by other miRNAs too. For such pairs some enriched miRNA pair instances were found. miR156-miR160 coexisted in 110 unique genes and transcripts, enriched for molecular functions like RNA polymerase activity (P-value: 1.7E-03). miR156-miR166 coexisted in 25 unique genes and 36 unique transcripts, where genes associated with molecular functions like Beta-Galactosidase activity (P-value: 1.6E-03), were found enriched. miR156-miR396 coexisted in 208 unique genes and 263 unique transcripts, showing enrichment for genes associated with molecular functions like Brassinosteriod-sulfotransferase activities (P-value: 4.9E-04), Fructokinase activities (P-value: 7.3E-04), Glucokinase activities (P-value: 7.3E-04) and UDP-gluco-4-aminobenzoate activities (p-value: 7.3E-04).
Previously done studies have reported critical role of miR156 in plant growth and developmental stage transitions like flowering, fruit ripening and shoot development, controlling some important transcription factors like SPL [35, 36]. Some recent studies now suggest that miR156 could be an eternal regulator of vegetative growth in plants and found critical in growth phase transitions . The present study found strong affinity of miR156 towards targeting genes involved in the process of transcription, growth and development which goes in sync with findings made previously with mentioned studies for miR156.
Like animals, in plant systems too, the role of flanking regions in determining miRNA targets appears as critical one. This was successfully tested for plants and implemented through the developed tool, p-TAREF. It works on statistical machine learning principle, deriving maximum margin classification decision boundary while considering multiple variables, which in the present work has been plant specific dinucleotide density profiles variations with respect to the possible target position. The confidence over that assigned class is derived through the scoring scheme of Support Vector Regression score. Besides this, implementation of concurrency provides p-TAREF an accelerated processing capability to harness multiple processors even on simple desktop machine as well as on big servers. p-TAREF web-server provides scope for expression based evidence for predicted targets, providing confidence on prediction, besides SVR scoring system to gather confidence on identification. The expression data and other associated publicly available information will be updated regularly with release of new data sources. The expression analysis and data in the present work were mainly based upon array experiments, which have some innate limitations. Though such array experiments may not produce the most accurate expression results, they have been used extensively for expression and abundance analysis at genome wide level and may provide a reasonable estimation of expression. For several of these experiments, RT-PCR based validation had been reported for the representative genes. More sensitive expression data from NGS and RT/q-PCR could be added in the upcoming versions of p-TAREF, depending upon the kind of experiments performed on these platforms and their public availability. For performance assessment, one of the most comprehensive performance measurements and comparisons with most recent and contemporary tools for miRNA target identification in plant system has been done, suggesting better performance by p-TAREF. Using p-TAREF, whole transcriptome level targets for rice transcriptome have been identified where miR156 was found as a critical miRNA in rice system. The reported targets were validated in two ways: using support from co-expression data as well as accurate identification of degradome analysis based targets. The identified targets could be an important resource to get clearer picture of regulation in rice. With all this, p-TAREF could be very helpful for the study of gene regulation and becomes more relevant considering the amount of data being produced by next generation sequencing projects, where p-TAREF could be applied over novel plant transcriptomes to discover miRNA targets.
Availability and requirements
Project name: p-TAREF
Operating system(s): Platform independent web-server version as well as Linux specific standalone version.
Programming language: Python, PERL, Java, R
Other requirements: Web-server is recommended for single or small number of sequences. For batch mode analysis, prefer to use the standalone GUI version.
Any restrictions to use by non-academics: None
Receiver Operating Characteristic Curve
Mathew Correlation Coefficient
Area Under Curve
Next Generation Sequencing
Graphical User Interface
Java Concurrent Library.
Rhoades MW, Reinhart BJ, Lim LP, Burge CB, Bartel B, Bartel DP: Prediction of plant microRNA targets. Cell. 2002, 110: 513-520. 10.1016/S0092-8674(02)00863-2.
Dugas DV, Bartel B: Sucrose induction of Arabidopsis miR398 represses two Cu/Zn superoxide dismutases. Plant Mol Biol. 2008, 67: 403-417. 10.1007/s11103-008-9329-1.
Brodersen P, Sakvarelidze-Achard L, Bruun-Rasmussen M, Dunoyer P, Yamamoto YY, Sieburth L, Voinnet O: Widespread Translational Inhibition by Plant miRNAs and siRNAs. Science. 2008, 320: 1185-1190. 10.1126/science.1159151.
Lanet E, Delannoy E, Sormani R, Floris M, Brodersen P, Cre' te' P, Voinnet O, Robaglia C: Biochemical Evidence for Translational Repression by Arabidopsis MicroRNAs. Plnat cell. 2009, 21: 1762-1768. 10.1105/tpc.108.063412.
Beauclair L, Yu A, Bouché N: microRNA-directed cleavage and translational repression of the copper chaperone for superoxide dismutase mRNA in Arabidopsis. Plant J. 2010, 62: 454-462. 10.1111/j.1365-313X.2010.04162.x.
Brodersen P, Voinnet O: Revisiting the principles of microRNA target recognition and mode of action. Nat Rev Mol Cell Biol. 2009, 10: 141-148.
Li Y, Zheng Y, Addo-Quaye C, Zhang L, Saini A, Jagadeeswaran G, Axtell MJ, Zhang W, Sunkar R: Transcriptome-wide identification of microRNA targets in rice. Plant J. 2010, 62: 742-759. 10.1111/j.1365-313X.2010.04187.x.
Mendes ND, Freitas AT, Sagot MF: Current tools for the identification of miRNA genes and their targets. Nucleic Acids Res. 2007, 8: 2419-2433.
Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet. 1997, 13: 497-498.
Xie FL, Huang SQ, Guo K, Xiang AL, Zhu YY, Nie L, Yang ZM: Computational identification of novel microRNAs and targets in Brassica napus. FEBS Lett. 2007, 581: 1464-1474. 10.1016/j.febslet.2007.02.074.
Fahlgren N, Carrington JC: miRNA Target Prediction in Plants. Methods Mol Biol. 2010, 592: 51-57. 10.1007/978-1-60327-005-2_4.
Zhang Y: miRU: an automated plant miRNA target prediction server. Nucleic Acids Res. 2005, 33: W701-W704. 10.1093/nar/gki383.
Bonnet E, He Y, Billiau K, Peer YV: TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. Bioinformatics. 2010, 12: 1566-1568.
Kruger J, Rehmsmeier M, RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res. 2006, 34: 451-454. 10.1093/nar/gkj455.
Xie F, Zhang B: Target-align: a tool for plant microRNA target identification. Bioinformatics. 2010, 23: 3002-3003.
Dai X, Zhuang Z, Zhao PX: Computational analysis of miRNA targets in plants: current status and challenges. Brief Bioinform. 2011, 12: 115-121. 10.1093/bib/bbq065.
Mückstein U, Tafer H, Hackermüller , Bernhart SH, Stadler PF, Hofacker IL: Thermodynamics of RNA-RNA binding. Bioinformatics. 2006, 22: 1177-1182. 10.1093/bioinformatics/btl024.
Ding Y, Chan CY, Lawrence CE: Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res. 2004, 32: W135-W141. 10.1093/nar/gkh449.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat. Genet. 2007, 39: 1278-1284. 10.1038/ng2135.
Thadani R, Tammi MT: MicroTar: predicting microRNA targets from RNA duplexes. BMC Bioinformatics. 2006, 7: S20.
Gardner PP, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004, 5: 140-10.1186/1471-2105-5-140.
Andronescu M, Zhang Z C, Condon A: Secondary structure prediction of interacting RNA molecules. J Mol Biol. 2005, 4: 987-1001.
Heikham R, Shankar R: Flanking region sequence information to refine microRNA target predictions. J Biosci. 2010, 35: 105-118. 10.1007/s12038-010-0013-7.
Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011, 39: D152-D157. 10.1093/nar/gkq1027.
Tyler Backman, Christopher Sullivan, Jason Cumbie, Zachary Miller, Elisabeth Chapman, Noah Fahlgren, Scott Givan, James Carrington, Kristin Kasschau: Update of ASRP: the Arabidopsis Small RNA Project database. Nucleic Acids Res. 2008, 36: D982-D985.
Jagadeeswaran G, Zheng Y, Li Y, Shukla LI, Matts J, Hoyt P, Macmil SL, Wiley GB, Roe BA, Zhang W, Sunkar R: Cloning and characterization of small RNAs from Medicago truncatula reveals four novel legume-specific microRNA families. New Phytol. 2009, 184: 85-98. 10.1111/j.1469-8137.2009.02915.x.
Li B, Qin Y, Duan H, Yin W, Xia X: Genome-wide characterization of new and drought stress responsive microRNAs in Populus euphratica. J Exp Bot. 2011, 10.1093/jxb/err051
Collobert R, Bengio S: SVMTorch: support vector machines for large-scale regression problems. The Journal of Machine Learning Research. 2001, 1: 143-160.
Dai X, Zhao PX: psRNATarget; a plant small RNA target analysis server. Nucleic Acids Res. 2011, 1-5.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Eran Segal E: The role of site accessibility in microRNA target recognition. Nature Genetics. 2007, 39: 1278-1284. 10.1038/ng2135.
Jones-Rhoades MW, Bartel DP, Bartel B: MicroRNAS and their regulatory roles in plants. Annu Rev Plant Biol. 2006, 57: 19-53. 10.1146/annurev.arplant.57.032905.105218.
Wang XJ, Reyes JL, Chua NH, Gaasterland T: Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 2004, 5: R65-10.1186/gb-2004-5-9-r65.
Moldovan D, Spriggs A, Yang J, Pogson BJ, Dennis ES, Wilson IW: Hypoxia-responsive microRNAs and trans-acting small interfering RNAs in Arabidopsis. J Exp Bot. 2010, 61: 165-77. 10.1093/jxb/erp296.
Jones-Rhoades MW, Bartel DP: Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol Cell. 2004, 18: 787-99.
Gang Wu, Scott Poethig R: Temporal regulation of shoot development in Arabidopsis thaliana by miR156 and its target SPL3. Development. 2006, 133: 3539-47. 10.1242/dev.02521.
Moxon S, Jing R, Szittya G, Schwach F, Rusholme Pilcher RL, Moulton V, Dalmay T: Deep sequencing of tomato short RNAs identifies microRNAs targeting genes involved in fruit ripening. Genome Res. 2008, 18: 1602-1609. 10.1101/gr.080127.108.
Wang Jia-Wei, Mee Park, Wang Ling-Jian, Koo Yeonjong, Chen Xiao-Ya, Weigel Detlef, Poethig RS: MiRNA Control of Vegetative Phase Change in Trees. PLoS Genet. 7: e1002012.
We thank Heikham Russiachand Singh, Vandna Chawla and Mrigaya Mehra for helping us in this study. Ashwani Jha is thankful to Department of Biotechnology ( DBT, Govt. of India ) for his fellowship. The MS has IHBT communication ID: 2212.
The work was supported by Department of Biotechnology(DBT), Government of India, through project grant: BTPR/11098/BID/07/261/2008.
AJ developed the codes, implemented the multi-core parallel versions, developed the web-server and GUI versions of the tool and conducted the entire study and performed the analysis part. RS planned and designed the entire study, developed the computational protocols, algorithms and theory, core basic codes, performed analysis and supervised the entire study. AJ and RS drafted the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1:Performance tests and benchmarking related details. This additional file contains the details about the performance benchmarking and tests done for p-TAREF. In overall six different major tests were done for performance benchmarking. (PDF 277 KB)
Additional file 2:miRNA target predictions made on rice transcriptome. This file contains result data on Rice transcriptome specific miRNA targets, with corresponding targeting miRNA, encoded interaction pattern differences and SVR score details. (XLS 5 MB)
Additional file 3:Expression correlation between miRNAs and targets. The file contains details about the miRNA targets found in Rice transcriptome, along with expression correlation values between the target and targeting miRNA. (XLS 5 MB)
Additional file 4:miRNA groups and their corresponding functional category enrichments with p-values. miRNA targets in Rice transcriptome were grouped according to the miRNA targeting them and their associated GO functional categories for Molecular function and Biological processes. (PDF 31 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Jha, A., Shankar, R. Employing machine learning for reliable miRNA target identification in plants. BMC Genomics 12, 636 (2011). https://doi.org/10.1186/1471-2164-12-636