Employing machine learning for reliable miRNA target identification in plants
© Jha and Shankar; licensee BioMed Central Ltd. 2011
Received: 8 September 2011
Accepted: 29 December 2011
Published: 29 December 2011
miRNAs are ~21 nucleotide long small noncoding RNA molecules, formed endogenously in most of the eukaryotes, which mainly control their target genes post transcriptionally by interacting and silencing them. While a lot of tools has been developed for animal miRNA target system, plant miRNA target identification system has witnessed limited development. Most of them have been centered around exact complementarity match. Very few of them considered other factors like multiple target sites and role of flanking regions.
In the present work, a Support Vector Regression (SVR) approach has been implemented for plant miRNA target identification, utilizing position specific dinucleotide density variation information around the target sites, to yield highly reliable result. It has been named as p-TAREF (plant-Target Refiner). Performance comparison for p-TAREF was done with other prediction tools for plants with utmost rigor and where p-TAREF was found better performing in several aspects. Further, p-TAREF was run over the experimentally validated miRNA targets from species like Arabidopsis, Medicago, Rice and Tomato, and detected them accurately, suggesting gross usability of p-TAREF for plant species. Using p-TAREF, target identification was done for the complete Rice transcriptome, supported by expression and degradome based data. miR156 was found as an important component of the Rice regulatory system, where control of genes associated with growth and transcription looked predominant. The entire methodology has been implemented in a multi-threaded parallel architecture in Java, to enable fast processing for web-server version as well as standalone version. This also makes it to run even on a simple desktop computer in concurrent mode. It also provides a facility to gather experimental support for predictions made, through on the spot expression data analysis, in its web-server version.
A machine learning multivariate feature tool has been implemented in parallel and locally installable form, for plant miRNA target identification. The performance was assessed and compared through comprehensive testing and benchmarking, suggesting a reliable performance and gross usability for transcriptome wide plant miRNA target identification.
miRNAs have emerged as a major regulatory components of cell system, which are active in almost all of the multicellular organisms. These noncoding RNA elements are around 21 bp long and bind the target mRNA sequences which share complementarity with the targeting miRNA sequences. However, for a long time it has been believed that miRNA targeting in plants requires almost complete complementarity while in animal it is incomplete complementarity where seed regions play the critical role in binding and subsequent targeting [1, 2]. Some recent studies have emerged out where translational repression and some inexact complementarity have been suggested to be existent in plant miRNA targeting too [3–5]. Some groups, encouraged with these findings, have started looking into such aspects in more detail, studying interactions which may not display exact complementarity as well as instances which are left undetected by existing plant miRNA target prediction tools [5, 6]. Li et al conducted an experiment, where they suggested that complementarity and homology based target identification tools, which compose the major approach of target identification in plants, may miss out several valid targets in plants. Such targets actually may not obey conservation, homology or exact complementarity . The major drawbacks of most of the existing plant miRNA target prediction tools have been that they follow the exact complementarity, most of them do not consider any flanking region sequence contribution to better the target prediction, they hardly leverage from machine learning like powerful approaches to handle multiple features for target prediction more accurately. Most of them lack the realistic time approach to handle the genome or transcriptome wide data to facilitate faster target predictions as most of them are serially coded and web-server based. A major reason could be a predominant belief that unlike animal system, targeting in plants has been not much complex. Pertaining to this, exact complementarity search centered tools were used for plant target predictions while animal target identification witnessed large number of innovations . Few of the most frequently used plant miRNA target prediction tools relied strongly upon exact pattern search and local alignments. PatScan  was a tool developed to look for exact similar matching patterns for target, where users could modify the match and mismatch values as well as select for wobble. However this tool did not consider bulge or seed specific scoring and its use has been nonspecific as it is used for other pattern match based purposes too, besides target finding. Another tool, miRNAassist, used BLAST search for complementary regions of miRNAs . Using BLAST, already known miRNAs from other species were used as a database to search against Brassica EST sequences. Following almost similar approach, Carrington group proposed another protocol where BLAST was replaced by FASTA34 . They also introduced some scoring rules of alignment to separate the seed region from rest of the regions as well as relaxed values for mismatches and wobbles. However BLAST based approaches are good for instances where the query length is longer as for smaller sequences, hits come up with very low significance making a random hit case. Considering this Zhang  developed a new tool, miRU, which replaced BLAST with Smith-Waterman local alignment, weighting more for seed regions and allowed bulges. These all tools were centered around complementarity search. Acknowledgement for limitations of exact complementarity and alignment based methods was conspicuous with release of new generation tools like TAPIR . TAPIR worked with two different options: 1) Scan for targets using FASTA program based alignment or 2) By applying more sensitive approach of running RNAhybrid  and considered thermodynamic and mismatch factors together. Use of RNAhybrid in the back-end also ensured that unlike previously employed tools, TAPIR was able to detect multiple target sites in a given mRNA sequence. Contemporary to this, Xie and Zhang developed a novel tool Target-align . Target-align was implemented by considering some rules while performing alignments. These rules were about the number of allowed mismatches, consecutive mismatches, number of allowed gaps and strict mismatch conditions in the seed region. However unlike TAPIR, focus of Target-align was on Smith-Waterman based alignment for complementarity search with several conditions. An advantage with Target-align has been its availability as local standalone version, unlike majority of plant miRNA target identification tools. Very recently, Dai et al acknowledged about the various lacunae in existing plant miRNA target identification tools, including centrality of alignment based approach, no proper consideration for imperfect complementarity, no consideration for role of flanking regions, inability to detect multiple sites as well as unavailability of locally downloadable standalone version to perform large and genomic scale studies . Considering the various existing demerits, this group implemented the role of target site accessibility and flanking regions by using RNAup . RNAup is a tool to predict RNA-RNA interaction, considering single strandedness of a given RNA sequence while deriving partition function for various nucleotides in secondary structures. RNAup and similar approaches have been used frequently in animals for miRNA target identification with likes of Sfold , PITA  and MicroTAR . However, applications of such tools have some limitations, as they are based on single sequence secondary structure and energy based features, whose accuracy and reliability drop drastically with increase in the length of sequences [21, 22]. Considering this, Heikham and Shankar  had proposed a novel approach to consider the flanking region sequence information, bypassing the chances of getting trapped into the issues arising from limitations of thermodynamics and structure based modeling. It successfully applied varying dinucleotide density profile with respect to putative target positions to decipher the role of flanking region in miRNA targeting in animal system. In case of plants, considering such approach becomes more relevant as unlike animals, where targeting is preferred in the 3' UTR regions, in plants miRNA targeting can occur to any region of the full length mRNA.
In the present work, these findings have been extended with flanking regions sequence information role in determining miRNA targets , by applying and assessing the theory on plant system too. Here a machine learning based reliable approach with multiple features oriented statistical learning has been applied, having a clear edge over rule based approaches. Arabidopsis thaliana has been used as the source to derive plant specific features which were modeled using Support Vector Regression to classify as well as to implement an effective scoring scheme through regression score. Besides this, a concurrent architecture with multi-threads has been implemented, making the tool application easily deployable even on simple desktop machine in concurrent mode, enabling it to scan plant mRNA sequences for targets in transcriptome wide manner.
Basic working approach
The present work has used several sequence resources. miRNA sequences for plants were downloaded from Mirbase version 16 . 243 mature miRNA sequences were retrieved for Arabidopsis, 414 for Oryza, 234 for Populus, 51 for Medicago and 37 for Tomato. All these miRNAs have been integrated in the presented tool. Experimentally validated Arabidopsis thaliana miRNA targets and their corresponding targeting miRNAs were retrieved from ASRP database  as well as from the list of miRNA:target pairs validated through RACE PCR as reported in the supplementary material provided by Beauclair et al. Arabidopsis sequences were downloaded from TAIR, version 10. Experimentally validated targets for Medicago and Rice were retrieved from various literatures [7, 26]. Negative instances of false targets were built from the dataset used previously as well as random sequences [13, 23].
Plant specific encoded interaction pattern generation
Instances were extracted, using the list of RACE PCR validated miR:target interactions for Arabidopsis, submitted by Beauclair et al in their supplementary material. Experimentally validated miRNA and target interactions for other plants species like Rice, Medicago, Tomato, Populus, were also derived from various literatures [7, 26, 27]. All miRNAs and target partners were retrieved for a separate run of RNAhybrid. RNAhybrid predicts miRNA:Target interaction by considering thermodynamic parameters for interactions and multiple-sites while applying information from statistical distribution in its backdrop. Also RNAhybrid run is a common step between encoded interaction pattern generation for experimentally validated instances as well as during the prediction run over any unknown query sequences. This way, it maintains a common approach. Output of RNAhybrid over experimental datasets provided exact binding pictures of interactions, which was further refined by applying Needleman-Wunch global alignment algorithm based local alignment tool, Stretcher, from Emboss-package. In order to consider the G:U wobble, the scoring matrix was adjusted accordingly with +1 advantage for G:U wobble, gap opening penalty of -15 and extension penalty of -5. Through this, sequence similarity as well as thermodynamic considerations was implemented to derive the interaction patterns. Using local scripts, all such interactions were converted into single encoded patterns, where information was reduced to single dimension alone, with match states of nucleotides i.e. bulge on miRNA strand, bulge on target strand, mismatch, match and wobble. All experimentally validated interactions were finally represented into only this form. Same protocol was used by the tool to generate interaction patterns for the predicted targets automatically. For every predicted target, the entire library of experimentally validated encoded patterns is scanned for similarity with scope to look for inexactness. This step defines the primary filtering step based on similarity of interaction patterns with experimentally known interactions. At present, total 268 different interaction patterns have been included considering miRNA:target interaction cases from Arabidopsis (157), Medicago (7), Populus (42), Tomato (11) and Rice (51).
Support Vector Regression (SVR) model building for plants
Unlike rule based approaches of identification and classification, machine learning approaches have emerged much superior for the process of classification. Among them, Support Vector Machine (SVM) has appeared as highly reliable one as it can handle large number of features together to derive a suitable classifier using multivariate statistical learning, which is comparatively tough to achieve by rule based approach of classification. Another advantage of SVM has been that unlike other machine learning approaches it concentrates upon evolving a classifier boundary with maximum margins, lowering the chance of misclassification and error drastically. This property is also controlled by the type of kernel selected for training and classification purpose, as linear kernel applies linear boundary, Gaussian kernel applies normal distribution boundary while polynomial kernel has capability to evolve convolute boundary to handle the cases where instances from different classes are very mixed up for the given set of features. The final classification by SVM assigns the classified instances their respective class as either 1, 0 or -1. However, this does not come with any clear confident value for the classification. This degree of confidence could be derived through some scoring scheme, which is provided by the Support Vector Regression (SVR). In the current study, a more evolved Support Vector approach, the SVR, has been used to implement training and classification along with a scoring scheme, regressions score. For training purpose, a sequence dataset comprising 104 experimentally validated Arabidopsis sequence instances reported by Beauclair et al (Supplementary Material, 2010) as well as negative target instances used by Heikham and Shankar  was formed. The negative target sequences has randomly generated sequences as well as some experimentally validated negative targets which were predicted as targets but experimentally validated as false positives. 75 bases flanking regions around the target sites in negative as well as positive instances are considered through 20 bases long sliding windows, estimating the dinucleotide density and its variations with respect to the target-site. Discrimination through dinucleotide density variation with respect to position was found to be the best for window size of 20. Mean distribution based feature selection procedure was applied to learn about the most discriminating features in plants. The Support Vector Regression Machine was applied through SVMTorch , where every learning instance was converted into position specific dinucleotide density variation profile with respect to the (possible) target sites. Training and model generation were performed separately for three different Kernel classes: Linear, Gaussian and Polynomial. The best emerging models for plant systems for each Kernel class were saved and integrated into the plant target identification tool developed. This way the user gets three choices of plant models to select from.
Expression data support integration and visualization
Various array expression experiments and data (Affymetrix Rice Genome Array, Affymatrix Arabidopsis Tilling Array 1.0 R and AT-TAX) were used in the present study. Data normalization was done using gcRMA method implemented in "R" Statistical Package. The expression data ('.CEL' format) was downloaded from GEO for Oryza sativa. Expression studies and data for 17 Oryza miRNA families (156, 159, 160, 166, 168, 172, 396, 444, 528, 806, 810, 820, 1318, 1875, 2055, 2906, 395) and 57,359 RNA sequences (excluding miRNAs) were used. For Arabidopsis miRNAs, the available expression related studies and data for 31 miRNA families (156, 157, 159, 163, 164, 165, 166, 167, 169, 171, 172, 319, 390, 391, 393, 394, 396, 398, 399, 401, 403, 404, 405, 406, 407, 413, 414, 417, 447, 824, 834) and 30,166 mRNA transcripts were considered. For several of these array based experimental data, RT-PCR based validations for sets of associated representative genes were reported by the submitting authors. The RNA sequences for Arabidopsis were downloaded from TAIR and Oryza RNA sequences from RiceGE.
To calculate correlation coefficient, the submitted target(s) is first searched in the locally installed database of Oryza or Arabidopsis (to be opted by the user) using BLASTn. The top most hit amongst all the hits, found by BLASTn, is extracted. The identifier of best hit is scanned across the inbuilt library of expression data files to finally calculate the Pearson Correlation Coefficient for co-expression. Modules for scanning and data parsings for expression correlation analysis part were implemented through codes developed in PERL, PHP and Java. miRNA:target association graph was generated using graphviz and Java libraries, JgraphT and JGraph.
Introduction of Concurrency
Concurrency enables the system to perform the same task with higher speed by harnessing the available logical processors on a given machine. Currently, even a simple desktop or laptop comes with multicore CPUs, having two or more processors/cores, which can go upto more than 50 in current generation servers. Implementation of concurrency was done using Java Concurrent Library (JCL) while applying multi-threaded processing of tasks. The developed tool provides the user an option to select the total number of processors to be used for target scanning. Accordingly, multithreads are created to process the query sequences. A single query sequence is chopped into several small subsequences with minimum 50 bp length (considering that usually a miRNA:target interaction stays below 50 bp), in overlapping manner and distributed across the number of processors selected, to run the following steps of target identification. For every such processor and batch of allocated sequences, RNAhybrid is run separately; output is manipulated and parsed for coordinates, separately and concurrently. Similarly, the alignment step is run concurrently. Only the Support Vector Regression step is not concurrent as it is quite faster. The RNAhybrid, alignment, parsing and union steps are quite time consuming and application of concurrency saves the time by providing manifolds acceleration while performing analysis on large amount of data.
Standalone and Server Implementation
The entire tool has been developed as a web-server as well as Linux based standalone GUI version. The web-server version has been developed using Linux-Apache-PHP, along with concurrency. The standalone version has core programs and scripts written in Python, PERL, Java and C, while its GUI wrapper has been developed using QT C++ GUI library. The standalone version, too, supports concurrency.
ROC curve based on 10 fold cross validation was done to estimate the performance and robustness of the classifier models and associated tests.
Gene Ontology and enrichment studies
Gene Ontology information for Rice transcriptome was derived from Ensemble Plants. Enrichment analysis for gene categories predominant in miRNA target system was conducted through two different ways: A) Using multiple Binomial tests. B) Using Hyper-geometric exact tests. The null hypothesis was derived using the distribution of various GO categories and their terms in whole transcriptome of rice. For multiple Binomial tests, we developed in-house script in "R", while hyper-geometric tests were conducted using Cytoscape module of Bingo .
Result and Discussion
Web interface of p-TAREF server and GUI Standalone
The server version also provides a provision to scan for possible expression data based expression correlation measurement for the given user query and associated miRNA, found targeting it. The user is asked to select the species to which the sequence belonged or is expected to share a homologous sequence. The server has inbuilt, normalized, expression data for plant miRNAs as well as genes, currently for Arabidopsis and Rice. Along-with the expression data, the associated mRNA sequences are also formated for similarity search tools like BLAST, which is enabled to run on multiple processors. The user opts for the species to be scanned for the target gene, in turn, the server preforms a BLAST run to consider the longest and most identical hit, most similar to the query sequence. The corresponding expression data for the target and targeting miRNA is retrieved for expression correlation measurement, which is displayed to the user. The publicly available expression data for all known plant miRNAs and genes will be continuously updated with every release and for various species. It needs to be mentioned that array expression data could be not of much use in case of translational repression by miRNA. A possible analogous facility may be provided in future for targeting cases where translational repression could be involved. The final output page displays the target sequence ID, targeting miRNA, the predicted interaction pattern and closest experimentally validated pattern along-with the partner miRNA, SVR score and choice to scan for expression analysis based validation across different species. The SVR score comes positive for potential miRNA targets while it is negative for non-targets. Higher the absolute value of the SVR score better is the confidence of classification.
Impact of concurrency in p-TAREF.
# of processor/Mismatches
1 Hour 43 min
3 Hours 21 min
5 Hours 01 min
8 Hours 37 min
1 Hours 17 min
3 Hours 00 min
4 Hours 34 min
6 Hours 07 min
2 Hours 21 min
3 Hours 53 min
5 Hours 42 min
1 Hours 52 min
3 Hours 14 min
4 Hours 21 min
1 Hours 14 min
2 Hours 05 min
3 Hours 01 min
92 Hours 26 min
Performance comparison between psRNA-target, Target-align and p-TAREF.
P-TAREF (polynomial kernel)
Beauclair et al.
Beauclair et al.
Beauclair et al.
Performance comparison between TAPIR, Target-align and p-TAREF for Target-align/TAPIR Reference dataset for benchmarking.
TP Rate %
FP Rate %
Besides this, p-TAREF was also compared with psRNAtarget for experimentally validated dataset, which was used previously for performance benchmarking of psRNAtarget [31–34]. For all experimentally validated 46 instances of targets, p-TAREF identified 45 of them. Further experimentally validated target instances specific for Tomato, Populus and Medicago were collected and the performance of p-TAREF was measured on them. For available nine experimentally validated target instances in Medicago truncatula specific miRNAs, p-TAREF scored 100%. For all of the available eight experimentally validated targets from tomato, p-TAREF attained 100% accuracy. For Populus trichocarpa, 17 out of 21 experimentally validated and submitted instances were available, out of which 16 targets were identified successfully, notching an accuracy of 94.11%. For Populus euphratica 21 targets out of 24 known instances, were successfully identified (Accuracy% = 87.5%). All the details regarding performance, benchmarking and associated tests are explained elaborately on the performance page of the server as well as in Additional File 1.
Target identification in Rice transcriptome and emergence of miR156 as a prominent regulator
In the beginning of this part of the study, the validation and performance benchmarking process over the already known and experimentally validated miRNA target instances in rice transcriptome was done. Recently, Sunkar had group performed a degradome sequencing based study to report 153 miRNA targets . For 29 rice specific miRNAs, the authors had reported 56 targets. For validation work the same experiment was used to validate targets identified by p-TAREF in the rice transcriptome. The sequence data was found available for 52 such target genes and p-TAREF identified most of the targets with overall accuracy of 97.33%. Encouraged by this, whole transcriptome analysis for miRNA targets in rice transcriptome sequences was carried out, excluding those sequences on which the above mentioned analysis had been performed already in order to avoid redundancy, looking for new targets and save time.
p-TAREF was run over 57,995 mRNA sequences from rice transcriptome dataset, with upto 4 mismatch level between experimental and predicted interaction patterns and polynomial kernel plant model. Initially, total 36,916 targets were identified for upto four differences from experimentally validated interaction patterns for target:miR interactions. Total 7,996 unique genes were found being targeted. Additional File 2 contains details of all identifications made at different mismatch levels. To validate the predicted targets with support of experimental data, the microarray expression data for all of the predicted target:miRNA pairs was checked. Out of 36,916 predicted miRNA targets, the expression data was available for 33,709 pairs to estimate the expression correlation between the target gene and corresponding miRNAs. After performing the expression correlation analysis, for 27,586 predicted target:miRNA pairs inverse expression correlation was observed, for different experimental conditions and tissue types, suggesting strong concordance with the predicted targets (81.8%). The expression correlation was compared with their respective SVR scoring and a reasonable agreement between the two was found with Pearson correlation coefficient of 0.7. The remaining 18.2% of identified targets had no agreement with expression correlation, which may also include condition like translational repression by miRNAs, which can't be interpreted well through inverse correlation estimation. While discussing this, it needs to be mentioned that expression data has certain limitations for inferences. It could be useful in case of transcript disruption, which is mostly prevalent in the plants. Though unlike animal system where translational repression has been reported more prevalent than transcript decay during miRNA targeting, recent studies have reported existence of translational repression in plants too, as discussed above. In such condition array expression data could not be much of help in inferring the process of targeting by microRNAs.
From this study, miR156 family emerged as an important miRNA in Oryza system, with largest number of targets (526 unique genes), many of which also scored high for negative expression correlation with miR156. One of the possible reasoning for observing such high number of targets for miR156 could be attributed to existence of purine richness (GA/AG tract) in miR156 sequence, causing poly-pyrimidine regions to be counted as the targets due to complementarity. Though the algorithm design of p-TAREF has capacity minimize the noise, especially those arising through mere complementarity, yet a couple of analyses were performed to verify the above mentioned possibility. Maintaining the constant dinucleotide composition, a permuted miR156 sequence was generated. If the polypyrimidine tracts could influence the result significantly, one may expect to see the frequency of targets for such permuted miRNA with identical dinculeotide composition as almost of same level. However, when p-TAREF was run with most liberal parameters to find the permuted miR156 targets, only 105 genes were found being targeted and with consideration of only miR156 specific encoded interaction pattern comparisons, absolutely no hit was found for the permuted miR156. The same test was repeated with few more permuted miRNAs and almost similar pattern of lower number of random targets were observed, with absolutely no targets reported when miR156 specific encoded interaction patterns were considered. This suggests high reliability of identifications done by p-TAREF, where the user could also apply the different options parameters to limit the result of interest. Further, a search for polypyrimidine SSR regions in the rice transcriptome reported ~1000 genes with polypyrimidine tracts. When mapped for the target genes for miR156, only 56 genes were found common between these two sets of genes. For several of these 56 genes the target site was found non-overlapping with the polypyrimidine tracts. Therefore, these findings suggest a very limited possible role of repetitiveness/randomness in the observed abundance of miR156 targets. Also, this needs to be mentioned that the mentioned number of target genes for miR156 is the gross number of targets for miR156 obtained with the parameters described in the beginning of this section. Search could be refined further by applying various filters and options provided with p-TAREF, including SVR score cut-off, interaction pattern differences and expression correlation score, etc. Additional Files 2 and 3 hold all such details for rice, which could be used to refine the results further, based upon filters like SVR score/Correlation Score/Differences in encoded pattern/Selection of miRNA specific encoded patterns etc. Applying one of such cut-offs for inverse correlation for expression, we performed an analysis upon the top scoring targets for miR156, as demonstrated below.
Identified targets of miR156 in the rice transcriptome.
SAM domain containing protein, putative, expressed
retrotransposon protein, putative, unclassified
transposon protein, putative, CACTA, En/Spm sub-class, expressed
LSM domain containing protein, expressed
FAD-linked sulfhydryl oxidase ALR, putative, expressed
conserved hypothetical protein
kinase, pfkB family, putative, expressed
dof zinc finger domain containing protein, putative, expressed
MYB family transcription factor, putative, expressed
ethylene-responsive transcription factor ERF020, putative, expressed
DNA-directed RNA polymerases I, II, and III subunit RPABC1, putative, expressed
h/ACA ribonucleoprotein complex subunit 1-like protein 1, putative, expressed
OsPP2Ac-3 - Phosphatase 2A isoform 3 belonging to family 1, expressed
dehydrogenase, putative, expressed
amine oxidase, putative, expressed
bHelix-loop-helix transcription factor, putative, expressed
RNA recognition motif containing protein, putative, expressed
RNA recognition motif containing protein, putative, expressed
zinc finger C-x8-C-x5-C-x3-H type family protein, expressed
zinc finger DHHC domain-containing protein, putative, expressed
transposon protein, putative, CACTA, En/Spm sub-class, expressed
transposon protein, putative, unclassified, expressed
microtubule-binding protein TANGLED1, putative, expressed
growth-regulating factor, putative, expressed
protein kinase domain containing protein, expressed
protein kinase domain containing protein, expressed
RNA polymerase subunit, putative, expressed
ubiquitin carboxyl-terminal hydrolase 14, putative, expressed
transposon protein, putative, unclassified, expressed
ribosomal protein L51, putative, expressed
CGMC_GSK.8 - CGMC includes CDA, MAPK, GSK3, and CLKC kinases, expressed
Ser/Thr protein phosphatase family protein, putative, expressed
adenylate kinase, putative, expressed
plant protein of unknown function domain containing protein, expressed
amine oxidase, putative, expressed
L1P family of ribosomal proteins domain containing protein, expressed
CPuORF8 - conserved peptide uORF-containing transcript, expressed
N-rich protein, putative, expressed
RNA recognition motif containing protein, putative, expressed
RNA pseudouridine synthase, putative, expressed
ribosomal protein L24, putative, expressed
2-aminoethanethiol dioxygenase, putative, expressed
Sad1/UNC-like C-terminal domain containing protein, putative, expressed
hyaluronan/mRNA binding family domain containing protein, expressed
STRUBBELIG-RECEPTOR FAMILY 7 precursor, putative, expressed
ribosomal protein S2, putative
conserved hypothetical protein
Top 20 most significant GO terms found associated with miR156 targets in the rice transcriptome.
cellular protein metabolic process
cytosolic large ribosomal subunit
copper ion binding
aspartic-type endopeptidase activity
response to cadmium ion
aspartate kinase activity
mitochondrial inner membrane
DNA-directed DNA polymerase activity
zinc ion binding
cellular amino acid biosynthetic process
cytosolic small ribosomal subunit
ubiquitin thiolesterase activity
microtubule motor activity
cellular amino acid metabolic process
triose-phosphate isomerase activity
intracellular protein transport
branched-chain-amino-acid transaminase activity
protein import into nucleus, docking
structural constituent of ribosome
nucleic acid binding
translation initiation factor activity
branched chain family amino acid metabolic process
ubiquitin-dependent protein catabolic process
glyceraldehyde-3-phosphate dehydrogenase activity
embryo development ending in seed dormancy
COPI vesicle coat
glyceraldehyde-3-phosphate dehydrogenase (NAD+) (phosphorylating) activity
unfolded protein binding
small ribosomal subunit
hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
response to hormone stimulus
Previously done studies have reported critical role of miR156 in plant growth and developmental stage transitions like flowering, fruit ripening and shoot development, controlling some important transcription factors like SPL [35, 36]. Some recent studies now suggest that miR156 could be an eternal regulator of vegetative growth in plants and found critical in growth phase transitions . The present study found strong affinity of miR156 towards targeting genes involved in the process of transcription, growth and development which goes in sync with findings made previously with mentioned studies for miR156.
Like animals, in plant systems too, the role of flanking regions in determining miRNA targets appears as critical one. This was successfully tested for plants and implemented through the developed tool, p-TAREF. It works on statistical machine learning principle, deriving maximum margin classification decision boundary while considering multiple variables, which in the present work has been plant specific dinucleotide density profiles variations with respect to the possible target position. The confidence over that assigned class is derived through the scoring scheme of Support Vector Regression score. Besides this, implementation of concurrency provides p-TAREF an accelerated processing capability to harness multiple processors even on simple desktop machine as well as on big servers. p-TAREF web-server provides scope for expression based evidence for predicted targets, providing confidence on prediction, besides SVR scoring system to gather confidence on identification. The expression data and other associated publicly available information will be updated regularly with release of new data sources. The expression analysis and data in the present work were mainly based upon array experiments, which have some innate limitations. Though such array experiments may not produce the most accurate expression results, they have been used extensively for expression and abundance analysis at genome wide level and may provide a reasonable estimation of expression. For several of these experiments, RT-PCR based validation had been reported for the representative genes. More sensitive expression data from NGS and RT/q-PCR could be added in the upcoming versions of p-TAREF, depending upon the kind of experiments performed on these platforms and their public availability. For performance assessment, one of the most comprehensive performance measurements and comparisons with most recent and contemporary tools for miRNA target identification in plant system has been done, suggesting better performance by p-TAREF. Using p-TAREF, whole transcriptome level targets for rice transcriptome have been identified where miR156 was found as a critical miRNA in rice system. The reported targets were validated in two ways: using support from co-expression data as well as accurate identification of degradome analysis based targets. The identified targets could be an important resource to get clearer picture of regulation in rice. With all this, p-TAREF could be very helpful for the study of gene regulation and becomes more relevant considering the amount of data being produced by next generation sequencing projects, where p-TAREF could be applied over novel plant transcriptomes to discover miRNA targets.
Availability and requirements
Project name: p-TAREF
Operating system(s): Platform independent web-server version as well as Linux specific standalone version.
Programming language: Python, PERL, Java, R
Other requirements: Web-server is recommended for single or small number of sequences. For batch mode analysis, prefer to use the standalone GUI version.
Any restrictions to use by non-academics: None
List of Abbreviations
Receiver Operating Characteristic Curve
Mathew Correlation Coefficient
Area Under Curve
Next Generation Sequencing
Graphical User Interface
Java Concurrent Library.
We thank Heikham Russiachand Singh, Vandna Chawla and Mrigaya Mehra for helping us in this study. Ashwani Jha is thankful to Department of Biotechnology ( DBT, Govt. of India ) for his fellowship. The MS has IHBT communication ID: 2212.
The work was supported by Department of Biotechnology(DBT), Government of India, through project grant: BTPR/11098/BID/07/261/2008.
- Rhoades MW, Reinhart BJ, Lim LP, Burge CB, Bartel B, Bartel DP: Prediction of plant microRNA targets. Cell. 2002, 110: 513-520. 10.1016/S0092-8674(02)00863-2.PubMedView ArticleGoogle Scholar
- Dugas DV, Bartel B: Sucrose induction of Arabidopsis miR398 represses two Cu/Zn superoxide dismutases. Plant Mol Biol. 2008, 67: 403-417. 10.1007/s11103-008-9329-1.PubMedView ArticleGoogle Scholar
- Brodersen P, Sakvarelidze-Achard L, Bruun-Rasmussen M, Dunoyer P, Yamamoto YY, Sieburth L, Voinnet O: Widespread Translational Inhibition by Plant miRNAs and siRNAs. Science. 2008, 320: 1185-1190. 10.1126/science.1159151.PubMedView ArticleGoogle Scholar
- Lanet E, Delannoy E, Sormani R, Floris M, Brodersen P, Cre' te' P, Voinnet O, Robaglia C: Biochemical Evidence for Translational Repression by Arabidopsis MicroRNAs. Plnat cell. 2009, 21: 1762-1768. 10.1105/tpc.108.063412.View ArticleGoogle Scholar
- Beauclair L, Yu A, Bouché N: microRNA-directed cleavage and translational repression of the copper chaperone for superoxide dismutase mRNA in Arabidopsis. Plant J. 2010, 62: 454-462. 10.1111/j.1365-313X.2010.04162.x.PubMedView ArticleGoogle Scholar
- Brodersen P, Voinnet O: Revisiting the principles of microRNA target recognition and mode of action. Nat Rev Mol Cell Biol. 2009, 10: 141-148.PubMedView ArticleGoogle Scholar
- Li Y, Zheng Y, Addo-Quaye C, Zhang L, Saini A, Jagadeeswaran G, Axtell MJ, Zhang W, Sunkar R: Transcriptome-wide identification of microRNA targets in rice. Plant J. 2010, 62: 742-759. 10.1111/j.1365-313X.2010.04187.x.PubMedView ArticleGoogle Scholar
- Mendes ND, Freitas AT, Sagot MF: Current tools for the identification of miRNA genes and their targets. Nucleic Acids Res. 2007, 8: 2419-2433.Google Scholar
- Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet. 1997, 13: 497-498.PubMedView ArticleGoogle Scholar
- Xie FL, Huang SQ, Guo K, Xiang AL, Zhu YY, Nie L, Yang ZM: Computational identification of novel microRNAs and targets in Brassica napus. FEBS Lett. 2007, 581: 1464-1474. 10.1016/j.febslet.2007.02.074.PubMedView ArticleGoogle Scholar
- Fahlgren N, Carrington JC: miRNA Target Prediction in Plants. Methods Mol Biol. 2010, 592: 51-57. 10.1007/978-1-60327-005-2_4.PubMedView ArticleGoogle Scholar
- Zhang Y: miRU: an automated plant miRNA target prediction server. Nucleic Acids Res. 2005, 33: W701-W704. 10.1093/nar/gki383.PubMed CentralPubMedView ArticleGoogle Scholar
- Bonnet E, He Y, Billiau K, Peer YV: TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. Bioinformatics. 2010, 12: 1566-1568.View ArticleGoogle Scholar
- Kruger J, Rehmsmeier M, RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res. 2006, 34: 451-454. 10.1093/nar/gkj455.View ArticleGoogle Scholar
- Xie F, Zhang B: Target-align: a tool for plant microRNA target identification. Bioinformatics. 2010, 23: 3002-3003.View ArticleGoogle Scholar
- Dai X, Zhuang Z, Zhao PX: Computational analysis of miRNA targets in plants: current status and challenges. Brief Bioinform. 2011, 12: 115-121. 10.1093/bib/bbq065.PubMedView ArticleGoogle Scholar
- Mückstein U, Tafer H, Hackermüller , Bernhart SH, Stadler PF, Hofacker IL: Thermodynamics of RNA-RNA binding. Bioinformatics. 2006, 22: 1177-1182. 10.1093/bioinformatics/btl024.PubMedView ArticleGoogle Scholar
- Ding Y, Chan CY, Lawrence CE: Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res. 2004, 32: W135-W141. 10.1093/nar/gkh449.PubMed CentralPubMedView ArticleGoogle Scholar
- Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat. Genet. 2007, 39: 1278-1284. 10.1038/ng2135.PubMedView ArticleGoogle Scholar
- Thadani R, Tammi MT: MicroTar: predicting microRNA targets from RNA duplexes. BMC Bioinformatics. 2006, 7: S20.PubMed CentralPubMedView ArticleGoogle Scholar
- Gardner PP, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004, 5: 140-10.1186/1471-2105-5-140.PubMed CentralPubMedView ArticleGoogle Scholar
- Andronescu M, Zhang Z C, Condon A: Secondary structure prediction of interacting RNA molecules. J Mol Biol. 2005, 4: 987-1001.View ArticleGoogle Scholar
- Heikham R, Shankar R: Flanking region sequence information to refine microRNA target predictions. J Biosci. 2010, 35: 105-118. 10.1007/s12038-010-0013-7.PubMedView ArticleGoogle Scholar
- Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011, 39: D152-D157. 10.1093/nar/gkq1027.PubMed CentralPubMedView ArticleGoogle Scholar
- Tyler Backman, Christopher Sullivan, Jason Cumbie, Zachary Miller, Elisabeth Chapman, Noah Fahlgren, Scott Givan, James Carrington, Kristin Kasschau: Update of ASRP: the Arabidopsis Small RNA Project database. Nucleic Acids Res. 2008, 36: D982-D985.Google Scholar
- Jagadeeswaran G, Zheng Y, Li Y, Shukla LI, Matts J, Hoyt P, Macmil SL, Wiley GB, Roe BA, Zhang W, Sunkar R: Cloning and characterization of small RNAs from Medicago truncatula reveals four novel legume-specific microRNA families. New Phytol. 2009, 184: 85-98. 10.1111/j.1469-8137.2009.02915.x.PubMedView ArticleGoogle Scholar
- Li B, Qin Y, Duan H, Yin W, Xia X: Genome-wide characterization of new and drought stress responsive microRNAs in Populus euphratica. J Exp Bot. 2011, 10.1093/jxb/err051Google Scholar
- Collobert R, Bengio S: SVMTorch: support vector machines for large-scale regression problems. The Journal of Machine Learning Research. 2001, 1: 143-160.Google Scholar
- Dai X, Zhao PX: psRNATarget; a plant small RNA target analysis server. Nucleic Acids Res. 2011, 1-5.Google Scholar
- Kertesz M, Iovino N, Unnerstall U, Gaul U, Eran Segal E: The role of site accessibility in microRNA target recognition. Nature Genetics. 2007, 39: 1278-1284. 10.1038/ng2135.PubMedView ArticleGoogle Scholar
- Jones-Rhoades MW, Bartel DP, Bartel B: MicroRNAS and their regulatory roles in plants. Annu Rev Plant Biol. 2006, 57: 19-53. 10.1146/annurev.arplant.57.032905.105218.PubMedView ArticleGoogle Scholar
- Wang XJ, Reyes JL, Chua NH, Gaasterland T: Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 2004, 5: R65-10.1186/gb-2004-5-9-r65.PubMed CentralPubMedView ArticleGoogle Scholar
- Moldovan D, Spriggs A, Yang J, Pogson BJ, Dennis ES, Wilson IW: Hypoxia-responsive microRNAs and trans-acting small interfering RNAs in Arabidopsis. J Exp Bot. 2010, 61: 165-77. 10.1093/jxb/erp296.PubMed CentralPubMedView ArticleGoogle Scholar
- Jones-Rhoades MW, Bartel DP: Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol Cell. 2004, 18: 787-99.View ArticleGoogle Scholar
- Gang Wu, Scott Poethig R: Temporal regulation of shoot development in Arabidopsis thaliana by miR156 and its target SPL3. Development. 2006, 133: 3539-47. 10.1242/dev.02521.View ArticleGoogle Scholar
- Moxon S, Jing R, Szittya G, Schwach F, Rusholme Pilcher RL, Moulton V, Dalmay T: Deep sequencing of tomato short RNAs identifies microRNAs targeting genes involved in fruit ripening. Genome Res. 2008, 18: 1602-1609. 10.1101/gr.080127.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang Jia-Wei, Mee Park, Wang Ling-Jian, Koo Yeonjong, Chen Xiao-Ya, Weigel Detlef, Poethig RS: MiRNA Control of Vegetative Phase Change in Trees. PLoS Genet. 7: e1002012.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.