The SBSE algorithm was implemented as an R script http://www.r-project.org/ and is free to download for use and further evaluation (See Additional File 4). All data and supporting data files required to replicate the reported results have also been included as additional information (Additional File 3). Development and testing was completed using version R-2.9.1 on both Red Hat^{®} Linux and Microsoft Windows XP.

### Concept

Assume that as part of a miRNA functional characterisation effort we have determined the, relative to control, fold-changes of six gene transcripts. These respective expression values are used to generate a simple ordered (*i.e*. from most up-regulated to most down-regulated) list. We hypothesis that if the observed differential expression profile has been caused by the transfected miRNA, the down-regulated transcripts (*i.e*. those to the right) will be enriched with target sites for the miRNA seed region relative to that observed with the up-regulated transcripts (*i.e*. those to the left). Let's suppose that our list is populated with the following expression values [4, 2, 0.5,-0.2,-1.5,-3]. Each of the six transcript identifiers is then associated with its respective 3'UTR sequence via a one-to-one mapping matrix, and using a simple pattern matching function we determine the presence or absence of a specified nucleotide hexamer, in each of the respective 3'UTR sequences. The pattern matching results are used to transform our ordered list into an 'ordered' binary list indicating a match, or no match, of the nucleotide query (seed) sequence in each of the respective 3'UTR sequences of our ordered list. Assume that this takes the form [0, 0, 1, 1, 0, 1].

A second list is derived from the ordered fold-change data, but in this instance the data indicates the absolute ranking of each differentially expressed transcript and for our illustration takes the form [1, 6, 2, 5, 3, 4]. The absolute ranking is used to direct a step-wise analysis of the differential distribution of 1's as they are encountered in the binary vector. That is, for each increment of the absolute list we calculate the lowest likelihood that the binary profile observed to-date can be best explained by the differential distribution of 1's between the leftmost and rightmost transcripts. For the simple case described we begin with an up-regulated gene not containing a seed sequence match and hypothesise that there is a one-half chance that the differential expression profile observed at this point is due to differences in miRNA distribution. Next, we observe a down-regulated gene containing a seed sequence match fall to the right, and update our hypothesis accordingly. We next observe another up-regulated gene not containing a seed sequence match *etc*. The algorithm continues to update the hypotheses until all of the data has been evaluated. On completion we are left with five values (*i.e*. one for each hypothesis) corresponding to each division between the six observations. This analysis procedure is represented in cartoon format in Figure 1A.

On completion of the analysis the estimated probabilities are plotted alongside a simple graphical representation (See Figure 1B) that summarises how the algorithm navigated the dataset and the estimated likelihood at each interval. These combined observations are used to determine the maximum enrichment score of the query (seed) motif in the dataset and, by inference, to quantify the likely magnitude of the miRNA repression of gene transcription given the observed fold-change dataset. Using our simplified example each row of the main plot corresponds to an additional observation. The vertical black lines indicate the optimal division as the data is processed. The black line of the uppermost plot (U) summarises the -log of the estimated p-value for each division and is used to determine the optimal (lowest) value (i*). The blue line summarises a *post hoc* calculation of the p-value for each division when the number of observations is optimal (j*). The black line of rightmost plot (R) also summarises the -log of the estimated p-value associated with each hypothesis update and is used to determine optimal division (j*) of the dataset (*i.e*. the number of observations needed to estimate the lowest p-value). In our illustrative example we propose that only the four most differentially expressed observations are required to estimate the optimum partitioning of the data (*i.e*. the most likely location of miRNA repression). The blue line summarises a *post hoc* calculation to estimate the p-value using the optimal partitioning of the dataset.

The following assumptions are made with regard to the dataset: (1) that the 3'UTRs represented the full length transcript, and (2) that only one query (seed) match per 3'UTR was of relevance to the transcript repression mechanism (3) that RNAi transcript targets are down-regulated post-transfection.

### Algorithm details

Let "D" denote our ordered binary list of length N (the total number of gene transcripts). In this list a one corresponds to the *i*^{th} gene's UTR having a seed sequence match, while a zero corresponds to the *i*^{th} gene's UTR not having a seed sequence match. Now let "A" denote our absolute data list, also of length N. Initially consider the top j transcripts of list A. This value is used to extract and partition the *j*-most differentially expressed transcripts from D.

Now let *H*_{i, j} denote our hypothesis that the differential distribution of 1's observed between the division D_{1...j} (the "left") and D_{j+1...N} (the "right") can be best explained by the distribution of the miRNA seed motif on either side of this division. Let denote the number of ones in the left set and the number of zeros in the left set. Similarly, is used to denote the number of ones in the righst set, and the number of zeros in the right set.

Given the above definitions we define an updating mechanism that allows the complete dataset to be traversed and our hypothesis to be incrementally evaluated. First, an initial estimate is assigned to the hypothesis *H*_{i,j-1}, that is, the probability that our hypothesis is correct given that we have observed D_{j-1}(*i.e*. *j*-

1) transcripts. This probability is updated for each subsequent A[*j*] increment of the dataset and can be succinctly defined using the Bayes' formula.

Note that P{H_{0}|D_{1...j-1}} = 1- P{H_{i, j-1}|D_{1...j-1}} is the probability that the differential expression at this division cannot be explained by the observed distribution of the miRNA seed motif (*i.e*. the seed motif distribution is random).

Should the next most differentially expressed transcript of encode a miRNA seed motif then a logical assumption is that the probability of the next transcript falling to the left of the division is the current ratio of seed sequences matches to the left of the division to the total number of seed sequences matches.

Further, the probability of the next transcript falling to the right of the division is the current ratio of seed sequences matches to the right of the division to the total number of seed sequences matches observed.

However, if the next transcript does not have a seed sequence match, then the probability of the next transcript falling to the left of the division is the current ratio of transcripts without a seed sequence match to the left of the division to the total number of transcripts without a seed sequence match observed.

By extension, the probability of this transcript falling to the right of the division is the current ratio of transcripts without a seed sequence match to the right of the division to the total number of transcripts without a seed sequence match observed.

Under our null hypothesis all of the observed differential expression is assumed to be independent of the miRNA seed motif distribution. Hence the probability that the next transcript falls to the left of the division is the ratio of transcripts to the left of the division to the total number of transcripts.

Likewise, under the null hypothesis the probability that the next transcript falls to the right of the division is quite simply the ratio of transcripts to the right of the division to the total number of transcripts.

The P{H_{i, 0}} is given an initial value of 0.5 and equation 1.1 updated until the dataset has been traversed. On completion the highest value of corresponds to the optimum partitioning of the data. This will be referred to as the optimum enrichment score henceforth and is the most likely estimate that the observed differential expression profile is best explained in terms of the miRNA seed motif distribution by dividing the top *j*_{*} genes at division *i*_{*}. The estimated probabilities, P{H_{i, j}}, for each D[*j*] are plotted as black lines as indicated in Figure 1B. The estimates and are plotted as blue lines in the uppermost (U) and rightmost (R) plots of the summary plots but are not utilised further in this report

### Miniscule values

Our Bayes formula while mathematically correct, may prove problematic as both P{H_{i, j}|D_{1..j}} and P{H_{0}|D_{1...j}} become miniscule for large datasets. It is therefore pragmatic to work on a logarithmic scale, that is:

And likewise,

### Binning

The algorithm as defined requires N(N-1) iterations of formula 1.1 to complete an analysis. Given that a typical microarrays dataset summarises the expression data of several thousand genes, it is a valuable option that the dimension of the dataset be reduced to enable a more rapid execution of the analysis. One proven approach is to group the ordered gene list into M bins and apply equation seven for M(M-1) iterations. Under this scenario most of the utilised formulae remain unmodified. However, the functions used to estimate P{D_{1...j}|H_{i, j-1}} and P{D_{1...j}| H_{0}} must be updated to accommodate this additional option. This can be achieved as follows. Let *x* denote the number of genes with a seed sequence match and *y* the number of genes without a seed sequence match in bin D_{j}. Under the hypothesis H_{i, j-1}, the probability that we will observe *x* genes with a seed sequence match falling to the left of the division, and *y* genes without a seed sequence match falling to the left of the division is P{D_{1...j}|H_{i;j-1}} = p^{x}q^{y}, where

If the bin falls to the right of the division, then P{D_{1...j}|H_{i;j-1}} = p ^{x} q ^{y} , where

Under the null hypothesis P{D_{1...j}|H_{0}} = p^{x+y}, where

should the bin fall to the left of the division and

should the bins fall to the right of the division.

### Composite hexamer plot

The summary plot of the query (seed) distribution (See Figure 1B) is a useful representation of the differential distribution of a query sequence and, by inference, an estimate of the magnitude and location of transcript repression in a given dataset. However, an obvious extension of such an estimate is to compare the distribution of a specific query motif relative to that of all other possible query sequences of the same length (*i.e*. to evaluate our specific seed query estimate in context with all other putative explanatory seed sequences). To address this requirement the SBSE algorithm was extended to iteratively query a given dataset with a comprehensive library of 4096 (*i.e*. 4^^{6}) unique hexamer nucleotides and plot each of the resulting estimates on a composite graphical representation. Such plots allow a simple and succinct graphical representation of how our estimate of a given hexameric nucleotide query motif compares relative to all other hexameric sequences (see Figure 3).

### Datasets

To develop and validate SBSE public microarrays datasets were retrieved from the EBI's ArrayExpress [41] public archive http://www.ebi.ac.uk/arrayexpress. For each study relevant cel files were quality assessed using standard metrics and subsequent expression values RMA normalised [42] before differential expression profiles were generated using the LIMMA library [43]. Human 3'-UTRs were retrieved from BioMart [44] and mapped to Affymetrix probeset identifiers. The longest 3'-UTR was selected when many-to-one UTR mappings occurred. Complex nucleotide repeat patterns were masked using DUST [45].

Brief summaries of selected case studies used in the development and evaluation of SBSE are as follows:

(1) The E-GEOD-6207 dataset is comprised of 14 Affymetrix GeneChip^{®} Human Genome U133A Plus 2.0 cel files. In this study hsa-miR-124 was over expressed in HepG cells and RNAs extracted at time points 0, 4, 8, 16, 24, 32, 72 and 120 h post-transfection [7]. This time course dataset was used extensively to develop several aspects of the SBSE algorithm

(2) Six Affymetrix GeneChip^{®} Human Genome U133 Plus 2.0 cel files were retrieved from E-MEXP-875. This dataset was originally generated to investigate the effects of FAM33A RNAi knockdown on the gene expression profile of a lung carcinoma cell line [30]. Two unique siRNA oligonucleotides were used in separate transfections along with a non-silencing oligonucleotides control.

(3) The E-MEXP-456 dataset consists of six Affymetrix GeneChip^{®} Human Genome U133 Plus 2.0 cel files. In this investigation the effect of siRNA duplex knock-down of the human miR-30a-3p miRNA precursor was evaluated in HepG2 cells in an attempt to identify hsa-miR-30a-3p target transcripts [31].

(4) The dataset E-GEOD-16097 is comprised of six Human Genome U133Plus 2.0 cel files [32]. Briefly, the author used a cocktail of three siRNAs to knockdown the BAHD1 transcript. In each instance HEK293 cells were transfected with either BAHD1 siRNA or control siRNA. Total RNA from cells transfected for 72 h were extracted and purified before hybridization on GeneChip Human Genome U133Plus 2.0 chips.

(5) The E-GEOD-9264 dataset is comprised of 12 Affymetrix GeneChip^{®} Human Genome U133 Plus 2.0 cel files. Four of these were control replicates (pCDNA3.1), four samples transfected with hsa-miR-155 and four samples transfected with the KSHV-miR-K12-11 miRNA, a proposed ortholog of hsa-miR-155 [33].