DotcodeR overview
The purpose of the method DotcodeR is to locate structured RNAs conserved between two genomic sequences. To this end, subsequences are extracted from the genomic sequences, and these subsequences are compared to each other by calculating the dot product of binary vectors that reflect local ensemble secondary structures. The score is used to predict whether the subsequences have a similar RNA structure or not (see Fig. 1).
Secondary structure dot plot
The binary vectors used in the dot product are based on the base-pairing probabilities. This section describes how the base-pairing probabilities are calculated.
Let x=x
1
x
2…x
L
be an RNA sequence, where x
i
∈{A,C,G,U} for 1≤i<j≤L. The RNA sequence x can fold into a secondary structure y, which consists of canonical base pairs such as A-U, G-C and G-U. The posterior probability p
ij
that x
i
is paired with x
j
given the RNA sequence x can be calculated by
$$p_{ij} = \frac{1}{Z} \sum_{y \in \mathcal{S}_{ij}(x)} e^{-\frac{G(y)}{RT}}, $$
where \(Z = \sum _{y \in \mathcal {S}(x)} e^{-\frac {G(y)}{RT}}\) is the partition function over the set \(\mathcal {S}(x)\) of all secondary structures of x, \(\mathcal {S}_{ij}(x)\) is the set of all secondary structures of x with x
i
and x
j
paired, G(y) is the free energy of y, R is the gas constant, and T is the temperature. In the actual case, p
ij
is calculated by dynamic programming that employs additional partition functions including Z
ij
defined over all secondary structures on a sequence interval [i,j] (see [32] for further details).
When addressing a genomic screen for structured RNAs with a sliding window of fixed size w, we need to know ‘local’ base-pairing probabilities within that window. A good solution is to consider the partition function \(Z_{ij}^{u,w}\) over all secondary structures on the interval [i,j] with the window [u,u+w] folded. More specifically, \(Z_{ij}^{u,w} = Z_{ij}\) if [i,j]⊆[u,u+w], and \(Z_{ij}^{u,w} = 0\) otherwise. The local base-pairing probability \(p_{ij}^{u,w}\) can also be calculated by dynamic programming using these partition functions (see [33] for details).
A secondary structure dot plot for an RNA sequence is a base-pairing probability matrix for that sequence, where each (i,j) element is p
ij
(or \(p_{ij}^{u,w}\)). Note that we have only to consider upper triangular part of the matrix because of the relationship i<j. A dot plot for a given RNA sequence can be computed by the partition function-based approach stated above, which is implemented by RNAfold [23] for global base-pairing probabilities and RNAplfold [33] for local pairing probabilities in the ViennaRNA package [24]. In this work, we used RNAplfold to compute dot plots that correspond to local base-pairing probabilities.
Binary vectors
The binary vector is constructed from a given dot plot using a sliding window of fixed size 2d+1. It should be noted that the window introduced here is different from the previous one of size w for genomic screen, and it is arranged in the diagonal-like way on the dot plot (see ‘Moving window’ in Fig. 2 as an example of d=1). It is also to be noted that the moving window has two shapes depending on the first position of the window center put on the diagonal (e.g. ‘L’ shape and its inverse for d=1 as shown in Fig. 2). Let p
ij
denote a local base-pairing probability in the dot plot, which corresponds to the central position of the window, and θ be a threshold. Here we consider the following rule that converts pairing probabilities within the window into a binary digit b∈{0,1}:
-
1.
if the next of cell (i,j) in the window is (i+1,j):
-
b=1 if \(p_{ij} + \sum _{\delta =1}^{d} \left (p_{i-\lfloor \frac {\delta }{2}\rfloor,\ j-\lceil \frac {\delta }{2}\rceil } + p_{i+\lceil \frac {\delta }{2}\rceil,\ j+\lfloor \frac {\delta }{2}\rfloor } \right) > \theta \);
-
b=0 if \(p_{ij} + \sum _{\delta =1}^{d} \left (p_{i-\lfloor \frac {\delta }{2}\rfloor,\ j-\lceil \frac {\delta }{2}\rceil } + p_{i+\lceil \frac {\delta }{2}\rceil,\ j+\lfloor \frac {\delta }{2}\rfloor } \right) \le \theta \);
-
2.
else if the next of cell (i,j) in the window is (i,j+1):
-
b=1 if \(p_{ij} + \sum _{\delta =1}^{d} \left (p_{i-\lceil \frac {\delta }{2}\rceil,\ j-\lfloor \frac {\delta }{2}\rfloor } + p_{i+\lfloor \frac {\delta }{2}\rfloor,\ j+\lceil \frac {\delta }{2}\rceil } \right) > \theta \);
-
b=0 if \(p_{ij} + \sum _{\delta =1}^{d} \left (p_{i-\lceil \frac {\delta }{2}\rceil,\ j-\lfloor \frac {\delta }{2}\rfloor } + p_{i+\lfloor \frac {\delta }{2}\rfloor,\ j+\lceil \frac {\delta }{2}\rceil } \right) \le \theta \).
Note that in the above conditions, ‘next’ means a cell in the fixed window, and we set the parameters as d=1 and θ=0.1 in the subsequent computational experiments.
Based on this conversion rule, a coarse-grained dot plot can be defined as a binary vector obtained by performing the following pseudo code (see also Fig. 2):
For windows that do not contain exactly 2d+1 probabilities on the boundary of the dot plot, the summing operation is performed with probabilities available in that window.
DotcodeR
Figure 1 illustrates how DotcodeR works when a pair of genomic sequences is given. Two potential structured RNA genes (subsequences) are picked up by the sliding window in the two genomic sequences. Note that the window is to be moved along the genomes by some step size s, which is smaller than the window size w. First, the binary vectors are calculated for all windows in the two genomes. Once these coarse-grained dot plots are computed, dot products are calculated in all-against-all comparison of the vectors from the two genomes to quantify the structural similarity of the subsequences.
The theoretical run-time of this genomic scan algorithm is evaluated as follows. We let L denote the length of the longer genomic sequence of the two. The first step that computes the local base-pairing probability matrices by RNAplfold, requires O(L
w
2) time for each genome [33]. The second step needs \(O\left (\frac {(L-w)^{2}}{s^{2}}\right)\) comparisons of windows between the two genomic sequences. The binary vectors are O(w
2) long, and hence it takes O(w
2) to calculate the dot product between two vectors. Therefore, the run-time of DotcodeR can be evaluated as \(O\left (Lw^{2} + \frac {(L-w)^{2}w^{2}}{s^{2}}\right)\).
Datasets
Benchmark data
To benchmark the method, a positive dataset consisting of known structured RNAs was created. It is based on sequences from the Rfam 12.0 database [34]. The sequences in the dataset were selected in the following way:
-
1.
Remove sequences that have unpaired nucleotides in columns with at least 80% gaps;
-
2.
Remove sequences that have more than 20% gaps;
-
3.
Remove sequences from families with fewer than 20 members;
-
4.
Redundancy is reduced to at most 90% identity within a family;
-
5.
Randomly select five sequences from each clan.
The first two steps are aimed at removing outlier sequences in the alignments. The consensus structure of some families in Rfam contains very few base pairs. Many of the structures with a low number of base pairs belong to families with few member sequences (data not shown). Step 3 removes a large part of these lightly structured non-coding RNA families.
Sequences were split into two disjoint subsets for training and testing. The purpose of the training is to determine the optimal score cutoff as described in the next section. Sequences from the same clan/family are all either in the training or the test dataset to limit over-fitting. Rfam contains many families from the miRNAs. These families have the same structure to a large extent, and all miRNA sequences were therefore grouped into one family. This was also done for the snoRNAs for similar reasons.
The training set has 73 families and 347 sequences, whereas the test set contains 47 families and 210 sequences. Details of these RNA families can be found in Additional file 1: Tables S1 and S2.
Next, two negative datasets for each family were created as follows:
-
The ‘gene-shuffled’ dataset consists of shuffled sequences from the positive dataset. This dataset has the same number of negative sequences as that of the positive dataset, and the GC-content is the same.
-
The ‘genome-shuffled’ dataset consists of shuffled sequences taken randomly from human chromosome 22 (GRCh38 [35]). The sequences in this dataset have a GC-content similar to that of the chromosome rather than that of the positive dataset.
All these negative sequences were shuffled while preserving the dinucleotide distribution [36].
For each positive/negative sequence in the dataset, we next created the corresponding ‘simulated short genomic sequence’ by adding dinucleotide-shuffled sequences of the original RNA gene to both ends of that gene. These added sequences can be considered as flanking regions. In the tests, the sliding window must overlap with the gene regions.
Genomic sequences
To evaluate the performance of DotcodeR on real genomic sequences, GRCh38 human chromosome 21 and GRCm38 mouse chromosome 19 were used. The chromosomes as well as their gene annotations were taken from Ensembl [35]. The annotated snoRNAs were further subdivided into H/ACA box and C/D box snoRNAs, since these two classes have different structures. These chromosomes were selected because chromosome 21 is the smallest human chromosome, and mouse chromosome 19 is the mouse chromosome with the least amount of sequence similarity to human chromosome 21.
Before running DotcodeR, regions annotated as repeats were removed using the repeat-masking available from the UCSC Genome Browser [37]. Window pairs from regions covered by human–mouse chained BLASTZ alignments [38] from the UCSC Genome Browser were also removed because these regions can be investigated using alignment-based methods.
Evaluation metrics
We evaluated the predictive performance by calculating sensitivity (SEN), specificity (SPC), positive predictive value (PPV), negative predictive value (NPV) and Matthews correlation coefficient (MCC), defined by
$$\begin{array}{@{}rcl@{}} \text{SEN} &=& \frac{\text{TP}}{\text{TP}+\text{FN}},\quad \text{SPC} = \frac{\text{TN}}{\text{TN}+\text{FP}},\\ \text{PPV} &=& \frac{\text{TP}}{\text{TP}+\text{FP}},\quad \text{NPV} = \frac{\text{TN}}{\text{TN}+\text{FN}},\\ \text{MCC} &=& \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN} +\text{FN})}}, \end{array} $$
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
Software
The program DotcodeR is implemented in C++ and available along with the benchmark data at https://github.com/ykat0/dotcoder/. To compute secondary structure dot plots, we used the ViennaRNA package 2.1.9.