MMpred: functional miRNA – mRNA interaction analyses by miRNA expression prediction

Background MicroRNA (miRNA) directed gene repression is an important mechanism of posttranscriptional regulation. Comprehensive analyses of how microRNA influence biological processes requires paired miRNA-mRNA expression datasets. However, a review of both GEO and ArrayExpress repositories revealed few such datasets, which was in stark contrast to the large number of messenger RNA (mRNA) only datasets. It is of interest that numerous primary miRNAs (precursors of microRNA) are known to be co-expressed with coding genes (host genes). Results We developed a miRNA-mRNA interaction analyses pipeline. The proposed solution is based on two miRNA expression prediction methods – a scaling function and a linear model. Additionally, miRNA-mRNA anti-correlation analyses are used to determine the most probable miRNA gene targets (i.e. the differentially expressed genes under the influence of up- or down-regulated microRNA). Both the consistency and accuracy of the prediction method is ensured by the application of stringent statistical methods. Finally, the predicted targets are subjected to functional enrichment analyses including GO, KEGG and DO, to better understand the predicted interactions. Conclusions The MMpred pipeline requires only mRNA expression data as input and is independent of third party miRNA target prediction methods. The method passed extensive numerical validation based on the binding energy between the mature miRNA and 3’ UTR region of the target gene. We report that MMpred is capable of generating results similar to that obtained using paired datasets. For the reported test cases we generated consistent output and predicted biological relationships that will help formulate further testable hypotheses.


Assessing the quality of data and accuracy of prediction
For both datasets both the t-test and ANOVA analyses have been performed.
The blunt dataset expression seems to be ambiguousno strong signal determining the experimental groups have been detected. The first attempt utilized one-way ANOVA with multiple level factor indicating therapeutic outcomes of each patient plus control.   Better results were achieved using simpler, sample vs. control experimental design. The array experiments clusters into two groups: in one the control (unaffected individuals) are strongly overrepresented (Figure 0.3). However, the clustering is still far from perfect, clearly clustering into two groups, which do not overlap with the experimental design, indicates that there is some factor indicating much stronger expression signal than blunt damage itself.   Experiments with more complex designs (multi-way ANOVA, ANOVA with other factors) produced much ambiguous and confusing result than t-test sample/control design this design have been decided to be utilized as a baseline for the pipeline.
The final mean correlation achieved in the correlation table was relatively high (Figure   0.5). The correlation cut-off has to be raised to -0.4 in order to obtain sufficient group of genes for overrepresentation testing. Finally 2105 interaction between miRNA and its targets have been identified. 27 miRNAs are controlling 376 putative target gens. Much more consistent results have been achieved for the burn injury dataset. It seems that signal from burn damage must be much stronger and more specific than blunt damage one. The first design utilised the original experimental design with small modification. The two-way ANOVA had been performed. The first factor indicates time after treatment addition (as design in section 1.1.1) and control group. The second factor determines if patient or control individual is infant or adult. Utilizing this design brought the same clustering (Figure 0.6) and very good p-value level, much lower than in the case of permutated design (Figure 0.7).  The fold change is marked on the X-axis and p-value on the Y-axis. Actual experimental data (blue points) are characterised with significantly lower p-values than randomized ones (red points).
The interesting fact is that the clustering based on predicted miRNA activity may be considered better than the one based on the experimental data. Although the results obtained for each predictors differs for both scaling function predictor (Figure 0.8) and linear model predictor (Figure 0.9) good clustering had been achieved. Particularly the clustering of the control group (red) seems much better for predicted miRNA based results.

Functional analyses
Since only the t-test sample/control design returned reliable results in the case of blunt damage dataset the functional prediction output was much smaller than in case of the burn injury dataset. Hoverer, in both datasets some interesting insights may be formed basing on miRNA activity predictions.
The ten most numerous Gene Ontology categories for blunt damage dataset are:  cellular process  primary metabolic process  cellular metabolic process  macromolecule metabolic process  cellular macromolecule metabolic process  protein metabolic process  cellular protein metabolic process  catabolic process  macromolecule modification  protein modification process Very many genes fall into overrepresented categories. All the top five categories (sorted by hypergeometric test p-value) contain more than 100 genes (Figure 0.14). For example 277 out of 376 total genes found to be under significant genes fall into the "cellular process" category -this number represents a vast majority of all genes falling into this category in whole human genome. On the other hand the "cellular process" is not specific so it can be interpreted in different ways. Analysing more specific overrepresented categories reveals that most of those genes are connected with metabolism or catabolism.   (Figure 0.16). The gene concept network shows very poor connectivity between the groups connected with different terms (Figure 0.17).  The ten most overrepresented terms in Disease Ontology light are: Again, the expression pattern for genes supporting DOLight is similar to GO and KEGG two clusters of lower gene expression corresponding to unaffected control individuals.
The diseases are not consistent with each other and the categories are supported by relatively low number of genes (less than 10 in each category). The only positive thing supporting the previous output is that all these diseases are connected to inflammation.
Finally the user determined terms "Entrez" identified strong overrepresentation of genes connected with apoptosis (63 genes) and significant shift in expression of those genes (expression pattern is consistent with the GO and KEGG output). Inflammation (11 genes) and necrosis (10 genes) connected genes are present, but not strongly overrepresented. The top five overrepresented categories are supported by at least 200 genes (more than 500 for "Cellular process") (Figure 0.19). The general expression of those genes is similar to the blunt damage dataset. However, the heatmap presenting more specific GO categories reveals significant differences between control and affected group (Figure   0.20). Those expression differences are very small, which makes them hard to notice on the heatmap presenting direct values. For that reason also the z-score standardized heatmap have been provided (Figure 0.21).
The GO terms obtained from level II ontology, as well as terms obtained using different experimental designs are consistent with two-way ANOVA output.  It should be noted, that those terms are supported by very few genes (Figure 0.22

 Focal adhesion
The expression profile is similar to the profile obtained for genes supporting overrepresented GO categories (Figure 0.23). Again, the differences in expression between the groups are very small, so z-score standardization has to be performed to achieve human readable output (Figure 0.24   This can be explained as observed previously for GO terms and KEGG pathways overrepresentation of neurotransmitters connected genes. These categories may also come with signal determining infant patients from adult ones (see experimental design).
From user defined Entrez terms the "Apoptosis" seems to be very strongly overrepresented. The inflammatory gene seems to be fairy overrepresented, while necrosis is almost not present. The detailed output of hypergeometric test is presented in   Please note low values of p-value obtained from hypergeometric test (Table 0.3). The reason for that is as many as 115 found apoptotic genes are 85% of the genes considered to be significantly overrepresented. The genes falling into "apoptosis" category are only 5.8% of all annotated genes. The chance of shuffling about 134 genes from all annotated and getting 115 falling into the same category, which is as small as ~5% of all genes is actually very small. Please note characteristic clusters corresponding to affected patients (the left side and the middle part of the heatmap) and control unaffected control group (the right side). The genes for each category are equally distributed between over-and under-expressed.

Discussion
Comparing to the blunt damage dataset, the general quality of the signal was much better in burn injury datasets. This observation is supported both by clustering within the design groups and level of achieved p-value (Figure 0.6, Figure 0 Further interesting differences between datasets are revealed by examining the top GO terms sorted by p-value obtained from hypergeometric test rather than the number of genes falling into each category. Those terms are more specific and all of them are connected with cellular biosynthetic process. The expression profile of those genes, visualised on the heatmap (Figure 0.21) shows that most of the genes falling into supporting terms like "neurotransmitter biosynthetic process" and "glutamine family amino acid catabolic process" are over-expressed in sample (burn injury affected) group.
This may indicate very specific gene regulation during burn damage induced inflammatory process and subsequent repair process, which may suggest miRNA regulation involvement. Additional factor supporting advanced miRNA regulation comes with very small differences between expression indexes in that groupthe differences become visible only after z-score standardization of gene expression (please compare The output generated by overrepresentation testing for KEGG pathways seems to be consistent with GO terms sorted by strict hypergeometric testing. The identified KEGG pathways using two-way ANOVA are quite specific and supported by 10 -20 genes only. Other experimental design ware able to return more general pathways like "MAPK signaling pathway", "Leukocyte transendothelial migration" and "T cell receptor signaling pathway" supporting the assumption of inflammatory process happening on the organism level. Other terms including "Lipoic acid metabolism", "RNA degradation" and "Ubiquitin mediated proteolysis" supports the hypothesis of metabolic activation of immune system cells and macromolecule metabolism.
The expression profile of specific KEGG pathways genes is very consistent with the one Similarly to blunt damage dataset the strong overrepresentation of apoptosis connected genes have been detected for overrepresentation testing of user provided Entrez terms.
Also the inflammatory genes have been found to be under putative miRNA control in the burn injury dataset. The important difference is much lower overrepresentation of necrotic process genes comparing to blunt datasets. It seems that in case of burn damage the preferred method of cleaning survived cells in damaged areas is apoptosis. This may seem right, since most of the compounds in burned areas are not reusable, so the big deficit of macromolecules should accrue. In these circumstances "programmed cell death" is the most efficient way to reuse the resources of survived, but damaged cells to rebuild missing tissues. No significant expression pattern has been detected for those genes (Figure 0.25). The overview of comparative blunt damage and burn injury datasets analyses have been summarised in