Alignment-free comparative genomic screen for structured RNAs using coarse-grained secondary structure dot plots

Background Structured non-coding RNAs play many different roles in the cells, but the annotation of these RNAs is lacking even within the human genome. The currently available computational tools are either too computationally heavy for use in full genomic screens or rely on pre-aligned sequences. Methods Here we present a fast and efficient method, DotcodeR, for detecting structurally similar RNAs in genomic sequences by comparing their corresponding coarse-grained secondary structure dot plots at string level. This allows us to perform an all-against-all scan of all window pairs from two genomes without alignment. Results Our computational experiments with simulated data and real chromosomes demonstrate that the presented method has good sensitivity. Conclusions DotcodeR can be useful as a pre-filter in a genomic comparative scan for structured RNAs. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4309-y) contains supplementary material, which is available to authorized users.


S1 Supplementary Tables
: Sensitivity for non-coding RNA (ncRNA) detection on the pair of human chromosome 21 and mouse chromosome 19 using DotcodeR. TP and FN denote the number of true positives and that of false negatives, respectively, whose definitions are described in Section 3.3 in the main article. Note that ncRNA type with sensitivity of N/A is not shown in the bar plot in Figure 5 in the main article. The result of snRNA is not also shown in Figure 5 as it consists of just one example.  Figure S1: Receiver operating characteristic (ROC) curves that show discriminative power of DotcodeR on snoRNAs in the training set of simulated short genomes whose negatives are 'gene-shuffled' sequences. In this test, we used the window size of 120 nt, the step size of 30 nt and d = 1.  Figure S2: ROC curves that show discriminative power of DotcodeR on snoRNAs in the training set of simulated short genomes whose negatives are 'genome-shuffled' sequences. In this test, we used the window size of 120 nt, the step size of 30 nt and d = 1.  Figure S7: ROC curves for DotcodeR on respective RNA families in the test set of simulated short genomes whose negatives are 'genome-shuffled' sequences. In this test, we used the window size of 120 nt, the step size of 30 nt and d = 1.  Figure S12: DotcodeR score as a function of GC content on the training set of simulated short genomes. The scores used in the y-axis and the min cutoff can be interpreted in the same way as in Figure S11. Note that GC content was calculated only on real window pairs in the dataset since it should be the same between real and shuffled sequences.

S3.1 Parameter settings
We used the Needleman-Wunsch global alignment algorithm to compute similarity scores in DotcodeR with alignment with the following parameters: • Match score between two binary digits: 9; • Mismatch penalty between two binary digits: 2; • Gap penalty: 1; • Threshold for the sum of the neighboring probabilities in a dot plot: 0.1.
Note that the last parameter was also used in DotcodeR with dot product.

S3.2 Calculating the number of pairs of windows in input
The number of pairs of windows between two chromosomes can be basically calculated by counting the numbers of windows in respective sequences and taking the product of them. In particular, the number of pairs of windows in cleaned input in Table 3 in the main article was calculated as follows: #{pairs of windows in cleaned input} where reduced alignments mean the ones obtained by removing pairs of overlapping repeat regions from the original alignments.

S3.3 Estimating run-time for chromosomal screen and genomic screen
The chromosomal screen by DotcodeR on the "original" input took 14.2 CPU months or approximately four days of run-time on a small computer cluster. Taking this and Table 3 in the main article into account, an estimated run-time for the chromosomal screen on the "cleaned" input is run-time for original input × #{pairs of windows in cleaned input} #{pairs of windows in original input} =14.2 × 2.296058 × 10 12 3.184135 × 10 12 =14.2 × 0.72 =10.2 (CPU months), or 4 × 0.72 = 2.9 (days) on the small computer cluster.
Next, let us consider the full genomic screen. The number of window comparisons between human and mouse genomes of size 3G bases is estimated as {3 × 10 9 × (1 − 0.5) − 120} 2 30 2 = 2.5 × 10 15 due to the theoretical O( (L−w) 2 s 2 ) comparisons described in the main text. Note that approximately 50% of the genomes are assumed to be repeats [1], and thus we remove such regions in the above calculation. An estimated run-time for the full genomic screen is: 14.2 CPU years × 2.5 × 10 15 3.2 × 10 12 = 11100 CPU months = 925 CPU years, which would take: 4 days × 11100 CPU months 14.2 CPU months = 3130 days = 8.6 years.
to run on the current cluster.

S3.4 Criterion of determining repeat and aligned regions in annotation
Assume that a known annotated region in a genome (e.g., exon) is overlapped with a known repeat region. Let r a be a non-overlapping region in the annotation and |r a | be the length of that region. If |r a | < 2s where s is a step size of the sliding window, we will judge this annotated region as "repeat." A known aligned region can be interpreted similarly but in a pairwise way.