### The mathematical bases of the predictors and correlation analyses

The scaling function predictor can be summarized as set of vector equations and implemented as required in the model:

Estimated microRNA expression index

Vector of mRNAs expression indexes

E
_{
affy
} Expression matrix of mRNA obtained from Affymetrix microarray

E
_{
miRNA
} Expression matrix of miRNA obtained from Agilent microarray

FC
_{
i
} Fold change of i-th mRNA present on Affymetrix array

p-value (from Student’s t-test or ANOVA) statistic of i-th mRNA present on Affymetrix array

k
_{
sense
} Strain dependant coefficient (*1.2* for sense *0.8* for antisense)

k
_{
overlap
} Overlap dependant coefficient (*2* for intronic *0.8* for exonic, 3’UTR, 5’UTR)

k
_{
evidence
} Evidence dependant coefficient (*1.2* for experimentally determined *0.8* for predicted)

**Equation 1** - General formula for the scaling function predictor

**Equation 2** - Mapping function

**Equation 3** - Weight vector

**Equation 4** - Scaling coefficients determining weight vector elements

**Equation**
**1** represents the general form of the predictor, which calculates the estimated microRNA expression index by averaging elements of the experimentally observed mRNA expression vector multiplied by a weight vector. The expression vector is created by a mapping function, which selects expression values corresponding to host genes from the messenger RNA expression matrix (**Equation**
**2**). Simultaneously a weights vector of the same length is created (**Equation**
**3**). Each value in this vector is calculated by multiplying the absolute fold change (FC) and reverse scaled p-value (1-*Pval*) obtained for each gene during pre-processing in addition to three coefficients (*k*
_{sense}, *k*
_{
overlap
} and *k*
_{
evidence
}) that combined describe the nature of predicted edge between miRNA and mRNA (**Equation**
**4**). Values for these coefficients have been arbitrarily assigned, using biological knowledge and computational tests performed prior to building the function. For example the intronic regions are extracted from coding sequences during splicing, which theoretically makes them available to the Drosha enzyme. However, both the 3’UTR, and 5’UTR are incorporated into mature mRNA, so they can only be processed to miRNA if the maturation process and transportation of mRNA out of the nucleus is interrupted. These scenarios dictate that the model preferentially promotes intronic sequences.

For linear model predictor the principal mathematical problem encountered while constructing the optimal regression formula was the variable number of the independent values describing each dependent value. The mapping function assigned every miRNA from 1 to 32 mRNA expression indexes. Parameters such as the p-value, fold change and genomic context of transcripts that were used successfully in the previous predictor were again incorporated into the linear model. In addition, the regression model includes additional ordinal (categorical) and continuous descriptive parameters:

e
_{
i
} Messenger RNA expression values from the microarray experiment

FC
_{
i
} Fold Change in expression between sample and control

p-value from t-test or ANOVA on mRNA expression data

overlay Categorical parameter of levels: *intron, exon, 3’UTR, 5’UTR*

strand Categorical parameter of levels: *sense* or *antisense*

evidence Categorical parameter e.g. *clone based, curated transcript, automatic transcript*

The following equations describe how starting with the simplest scenario (*i.e.* one microRNA’s expression dependent on only 1 mRNA transcript) we can implement a general regression formula based on these assumptions:

**Equation 5** - The regression formula when predicting the miRNA expression of 1 microRNA when dependent on 1 mRNA transcript

**Equation 6** - The regression formula when predicting the expression of a miRNA when dependent on 2 mRNA transcripts

**Equation 7** - A general regression formula for predicting the expression of a miRNA expression value when dependent on n transcripts

Implementing the iterative formula into the linear model is mathematically impossible. Instead the model predicts miRNA expression with each transcript separately and then calculates a median value as the final prediction for each miRNA. However, using the model described by **Equation**
**5** with this method resulted in poor prediction power – the Pearson’s correlation coefficient between the measured values and our predictions was 0.324. As solution the factor containing the names of miRNAs was introduced into the model. This allowed the fitting function to select different linear equation coefficients for unique miRNAs (**Equation**
**8**).

**Equation 8** - The regression formula for predicting the expression value of a microRNA after introducing miRNAs’ name factor

This model achieved a high performance, with an estimated correlation value of 0.945 between the experimental values and our predicted values. Additional analyses indicated that miRNAs located on antisense strand, exonic, 3’UTR and 5’UTR are weakly correlated and may introduce noise rather than add to the signal in the model. Pre-filtering these transcripts marginally increased the correlation to 0.949. The ambiguous nature of the evidence (*i.e.* origin of the entry in miRBase) also introduced the noise. Discarding this independent variable (**Equation**
**9**) further increased prediction power to 0.956. This simplification of the model (**Equation**
**10**), based only on mRNA expression values and miRNA ID factor resulted in a correlation coefficient of 0.955. Despite the larger computational complexity the best performing regression formula described by **Equation**
**9** was implemented in the pipeline (Figure 4).

**Equation 9** - Final regression formula characterised by the highest prediction power and moderate resource consumption.

**Equation 10** - Simplified regression formula for the linear model predictor.

Finally the miRNA-mRNA correlation analyses can be simplified to the following formula:

**Equation 11** - The mathematical bases of miRNA-mRNA correlation analyses:

### Correlation matrix

The idea of creating correlation matrices has been inspired by mathematical procedures present in regression analyses. The independent variables are being correlated against each other to assess their independence. The important differences are that regression analyses method operates on vectors, creates square matrices and aims to minimize the absolute value of correlation: correlation close to 0 indicates that independent variables are not biased to describe each other. The method that we have developed operates on arrays – though can be treated as reducing the dimensionality of the data. The basic assumption is that the expression matrices calculated using every paired dataset have the same number of columns – the same quantity of arrays must be used to assay miRNA and mRNA, and different number of rows – there is much more coding genes than miRNAs. Every row of the miRNA array is correlated against each row of the mRNA array and the correlation coefficient is captured – this way two matrices are collapsed into one, which shares the number of rows with miRNA’s expression matrix. The number of columns is equal to the number of rows present in mRNA array.

The most correlation comprehensive investigation has been made on the *“Integrative genomic profiling of human prostate cancer”* (GSE21032) dataset. 1,411,189 exons are represented on the Affymetrix Human Exon 1.0 ST array. Agilent Human miRNA Microarray 2.0 captures the expression of 821 different miRNAs and control quality sets. In constructing the correlation matrix quality control probesets and viral miRNA have also been correlated to mRNA for negative control. The output matrix, size of 821 × 1,411,189, has captured 1,158,586,169 correlation coefficients.