Efficient calculation of exact probability distributions of integer features on RNA secondary structures

Background Although the needs for analyses of secondary structures of RNAs are increasing, prediction of the secondary structures of RNAs are not always reliable. Because an RNA may have a complicated energy landscape, comprehensive representations of the whole ensemble of the secondary structures, such as the probability distributions of various features of RNA secondary structures are required. Results A general method to efficiently compute the distribution of any integer scalar/vector function on the secondary structure is proposed. We also show two concrete algorithms, for Hamming distance from a reference structure and for 5ʹ − 3ʹ distance, which can be constructed by following our general method. These practical applications of this method show the effectiveness of the proposed method. Conclusions The proposed method provides a clear and comprehensive procedure to construct algorithms for distributions of various integer features. In addition, distributions of integer vectors, that is a combination of different integer scores, can be also described by applying our 2D expanding technique.

S2 A method to obtain g k (·) in constant time calculation for distributions of hamming distance from a reference structure Firstly, we show full description of g k (·).
They show that we need O(n 2 ) calculation for g k (·) which obtains hamming distance between two structures. This is one of the bottlenecks since these functions are embedded in recursive process. If we pre-calculate a vector C before the recursive process, we obtain g k (·) by O(1) calculation. Definition of vector C corresponding to structure vector S is as follows: Let us call it a cumulative structure vector. C can be computed efficiently by dynamic programming technique: Initialization: This pre-calculation requires O(n 2 ) time. We wrote down O(1) procedure for g k (·) below: (S16) (S18) (S21)

S3 Pre-calculating the maximum of distance between structures
We can find exact maximum value of hamming distance d max though it never exceeds sequence length n: where ς is a set of all possible candidate structure vectors and S r is a reference structure vector. We construct a O(n 3 ) dynamic programming procedure so as to obtain d max : where C is a vector which is defined in Supplementary section 0.1. We finally have d max as D 1,n .

S4 A time-saving method for calculating d 5 −3 exact distributions
We show a time-saving procedure for calculating d 5 −3 exact distributions.
Algorithm S2 Exact calculation of a d 5 −3 distribution by DFT approach Recursions implied above are as follows:

S5 A framework of algorithm for the distribution of two-dimensional integer vector
In Algorithm S3, we show a method to expand our original algorithm to two dimensions.
Algorithm S3 2D expansion of the original model in Algorithm 4.
1: /* DP phase (distributed processing is available) */ 2: for S 1 = 0 to S 1max do 3: 10: end for 11: end for 12: /* DFT phase*/ 13: for S 1 = 0 to S 1max do 14: for S 2 = 0 to S 2max do 15: where x = (x 1 , x 2 ), subscripts of 1 or 2 represent that they are variables or functions for the first and second components of a two-dimensional score vector respectively, and P S means the probability that RNA sequence folds into a structure whose score is (S 1 , S 2 ).
We have p S , the probability of obtaining score S, by the following equation:

S6 A concrete description of recursions for the distribution of hamming distance from two reference structures
We show here concrete recursions to obtain the distribution of hamming distance from two reference structures. Naive expansion is quite simple; all we have to do is just exchanging such as: However, there are many meaningless calculations which can be cut by utilizing sparseness of the distribution. From constraints such as triangle inequality, we must satisfy the following expressions: where N is a set of natural numbers, ς is a set of all possible secondary structure vectors, S Ri (i = 1, 2) is a structure vector of i-th reference, and d(S 1 , S 2 ) means hamming distance between S 1 and S 2 . Equation (S34) is derived from triangle inequality, and equation (S35) and (S36) are originated in definitions of d 1max and d 2max . A reason of equation (S37) is a little complicated. To put it simply, 1 bit transition of structure vector of S invariably causes 1 hamming distance changing from any other structures, and every structure vector can visit each other by repetition of 1 bit transitions. We modified our algorithm to reduce meaningless calculation by converting axes. Abstract form is shown in Algorithm S4.

S7 Explosion of the number of possible structures
We have a massive number of possible structures with the distances from the reference structure. It is due to the combinatorial explosion of possible base pairs. Fig. S1 shows the number of structures of each Hamming distance from the reference. Figure S1: The number of possible structures