- Open Access
Efficient calculation of exact probability distributions of integer features on RNA secondary structures
© Mori et al.; licensee BioMed Central Ltd. 2014
- Published: 12 December 2014
Although the needs for analyses of secondary structures of RNAs are increasing, prediction of the secondary structures of RNAs are not always reliable. Because an RNA may have a complicated energy landscape, comprehensive representations of the whole ensemble of the secondary structures, such as the probability distributions of various features of RNA secondary structures are required.
A general method to efficiently compute the distribution of any integer scalar/vector function on the secondary structure is proposed. We also show two concrete algorithms, for Hamming distance from a reference structure and for 5ʹ − 3ʹ distance, which can be constructed by following our general method. These practical applications of this method show the effectiveness of the proposed method.
The proposed method provides a clear and comprehensive procedure to construct algorithms for distributions of various integer features. In addition, distributions of integer vectors, that is a combination of different integer scores, can be also described by applying our 2D expanding technique.
- RNA secondary structure
- integer score distribution
- structural landscape
Recent investigations of coding and non-coding RNAs have proved that RNA molecules have more important roles in the regulation of living cells than those of our previous knowledge. It has also become clear that the structures of RNAs, especially the secondary structures, are one of the important features to identify the functions of RNAs. While the high-throughput methods to determine the secondary structures of RNAs are spreading, the importance of computational analyses of RNA sequences including prediction of secondary structures is increasing [1, 2].
where PSt and ESt are respectively the existence probability and the free energy of the structure St, k B is the Boltzmann constant and T is the temperature constant. Z is the normalizing factor known as the partition function, which is the summation of Boltzmann factor e−ESt /(k B T) among all the possible structures. The partition function of an RNA sequence and the free energy of each structure can be obtained by dynamic programming algorithms on the parameters determined experimentally [3, 4].
The equation (1) shows that the structure with the highest existence probability is the structure of the minimum free energy. Therefore, it is natural to treat the secondary structure of the minimum free energy as the estimate of the secondary structure. The probability that an RNA folds into a particular structure is, however, generally extremely low even if it is the structure of the minimum free energy, because of the combinatorial explosion . For example, the probability of a particular secondary structure of some rRNAs are less than 10−22 no matter which structure is chosen. This means that the prediction of the secondary structure of an RNA and subsequent analyses based on the predicted secondary structure are not always reliable. It is therefore desirable to investigate properties of the probability distributions of the whole ensemble of the possible structures.
We propose in this paper a general method to efficiently compute the exact distribution of any integer quantity of the feature on each secondary structure. The proposed method has been motivated by the framework and its application to sequence alignments by . Their framework is generally valid for integer functions on Boltzmann distributions whose partition function can be calculated by a linear dynamic programming. For the case of secondary structures of RNAs, however, the recursions in the dynamic programming of the partition function have more complicated forms including the products of combinations of DP matrix elements, which inhibits direct application of their framework. We have overcome the difficulty by expanding McCaskill algorithm, which is a well-known dynamic programming of the partition function of the secondary structures of RNAs using the energy parameters experimentally determined .
Naive implementations of our proposed method requires computational complexities of in time and in space, where n is the length of the sequence, |S| is the size of the integer score variation, which depends on the objective distribution while they never exceed n in the case of example problems in this paper, and α and β is the costs depending on the objective score. By adapting Discrete Fourier Transform, we can reduce those complexities to in time and . The DFT in our method on RNA structures achieves an order-level improvement of the complexity, which could not achieved by the DFT on linear dynamic programmings in . We can further reduce time complexity to by parallel computing using U computational units.
We demonstrate the effectiveness of the proposed method in several practical problems. The first example is the distribution of the Hamming distances from a reference structure. A practically equivalent algorithm and its acceleration have been implemented as RNAbor by  and , while we have reconstructed the algorithm by deducing from our general principle. The second example is the exact distributions of 5ʹ - 3ʹ distance. Conventional methods for analysing 5ʹ - 3ʹ distance only calculate mean length or assume over-simplified models. We propose here a novel algorithm to compute the whole distribution of 5ʹ - 3ʹ distance considering the thermodynamic properties of the RNAs. The final example is acceleration of RNA2Dfold, which is included in ViennaRNA package .In this example, the distribution of the Hamming distances from two specified reference structures are calculated. We show our method reduces computational complexity from O(n7) in time and O(n4) in re-space to less than O(n5/U ) in time and O(n2U ) in space, which is a similar idea proposed recently . These examples indicate that our method offers a way to obtain a wide variety of distributions of integer quantities.
We first show the fundamental concepts of our proposed method in this section.
Definition of integer score distribution
In this paper, we discuss on how to efficiently compute integer score distributions in general and in the specific cases for RNA secondary structures. Our proposed method for RNA secondary structures efficiently computes the exact distribution when p(x) and p(s) can be calculated by the dynamic programming algorithms sharing a same form.
A conventional model for integer score distribution
For a certain class of problems, including distributions of integer score of each sequence alignment, the partition function of the objective distribution can be calculated abstractly by Algorithm 1. Z is the partition function shown in equation (2). Z is a scalar array of length N representing the partition function of the problem size N , whose components for the dynamic programming are aligned in the computing order. t(k|i) is a quantity proportional to the probability of the transition from state i to state k, which can be quite sparse in values.
Algorithm 1 An abstract form of calculating the partition function
Z = 1
for k = 1 to N do
Z = Z[N ]
 showed that if the partition function can be computed by Algorithm 1, integer score distributions are obtained by Algorithm 2, where Z(x) is an array of polynomials of x, and s(i, k) is the gain of the integer score in the transition from i to k.
Algorithm 2 A polynomial approach to integer score distributions proposed in 
Z(x) = 1
for k = 1 to N do
Z = a sum of coefficients of polynomial Z(x)[N ]
where S max is the maximum score.
A general model for integer score distribution of RNA secondary structure
In the case of RNA secondary structures, the dynamic programming for the partition function does not match to Algorithm 1. Therefore, we have to construct an algorithm different from Algorithm 2 for the calculation of integer score distributions on RNA secondary structures. As pseudo-code is shown in Algorithm 3, products of combinations between DP matrix elements and constant term c k are required for the computation. The detailed description of this derivation is shown in the additional file 1 (Section S1).
Algorithm 3 A general polynomial approach to integer score distributions for the ensemble of RNA secondary structures
Z(x) = 1
for k = 1 to N do
Z = a sum of coefficients of polynomial Z(x)[N ]
The partition function is dispersed according to the score of each secondary structure included in the whole ensemble. In other words, the coefficient of x S in Z(x)[N] represents proportional to the probability that the RNA structure has score S. After the calculation by Algorithm 3, p S can be derived from equation (5).
Algorithm 3 requires computational complexities of in time and in memory, where α and β is the complexities in time and in space respectively for the calculation of each integer score.
Adopting Discrete Fourier Transform (DFT)
Discrete Fourier Transform (DFT) is a Fourier Transform on a discrete sampling interval, which is employed in improving the efficiency of various computational problems as well as frequency analysis. According to , by applying DFT distributed processing is available for computing integer score distributions on sequence alignments. On RNA secondary structures, DFT reduces time complexity of computations in order-level as well as merely decentralize the procedure.
After ζis obtained, DFT extracts z from ζby time.
Algorithm 4 shown below is the modification of our naive Algorithm 3 by adopting DFT approach.
Algorithm 3 suffers from heavy computations of in time for products of polynomials if the degree S max is large. In the recursions for ζin Algorithm 4, however, each computation for polynomial products is replaced to a computation of products of complex numbers, which requires only a constant time. While we still need to extract z from ζby time, the total time complexity is reduced from to . In addition, each ζ k can be calculated
Algorithm 4 DFT-adopted approach for integer score distribution
/* DP phase (distributed processing is available) */
for S = 0 to S max do
Z[S] = 1
for k = 1 to N do
ζ S = Z[S][N]
/* DFT phase*/
for S = 0 to S max do
Required time and space
DFT with U* units
According to the above approach, we next construct and implement concrete formulas of computing a general integer score distribution for RNA secondary structures based on McCaskill model. McCaskill model is a standard procedure for computing partition function in equation (2) by a dynamic programming based on energy parameters. In this model, the partition function is obtained as Z1,nfrom the following recursive scheme of polynomial order:
Although the second factor in the right hand side of the equation (15) indicates that this procedure requires O(n4) in time, it is usually reduced to O(n3) by assuming a reasonable threshold of the length of the internal loops.
Score accumulable McCaskill model
In this section, we show three examples to demonstrate how to construct algorithms for practical problems. The first and the second examples are the case to which our general model is directly applicable, where all we have to do is defining scoring functions. In the third example, we expand our model into two dimensions in order to describe a distribution of two dimensional integer vector.
We practically implemented and evaluated the concrete algorithms for those three examples with a distributed processing application by OpenMP on a dual quad-core Intel® Xeon® E5540 @2.53GHz with 17.6 GB RAM. The run time was measured as a mean of 10 random sequences by single or eight cores.
A distribution of the Hamming distance from a reference structure
Conventional RNA secondary structure estimation produces the most stable and possible structure or the representative structure such as a centroid in the whole ensemble. Those point estimations of the secondary structures, however, have a risk to neglect the information on the thermal fluctuations or significant suboptimal structures . Some local structures might be relatively stable only at certain global structures, and some structures such as ribo-switches might have multiple distinct stable global structures besides the predicted structures . RNAbor  is a software which exactly calculates the probability that RNA folds into the structures that have the same distance from a given structure. It informs us concentration of existence probability around a structure predicted by conventional software, which will help deeper understanding about biological behavior of RNA molecules. Our model is applicable for this problem since the distance between RNA secondary structures can be defined as an integer function. Here we reconstruct the algorithm from a viewpoint of our general principle described in the Approach section, motivated by the work by Newberg et al., while practically equivalent algorithm has been independently presented in .
Definition of distance
Let us call S a structure vector. The dimension of a structure vector is for an RNA of length n.
The Hamming distance between RNA structures never exceeds its sequence length n in spite of the high dimensions of structure vectors, we obtain d max ≤ n as the exact maximum of d by cubic time (See the Section S3 in the additional file 1).
A distribution of RNA 5ʹ- 3ʹdistance
Recently, Yoffe et al. found that the distance of 5ʹ end and 3ʹ end of the RNA molecule tended to be short, largely independent of molecule lengths or sequence patterns . They pointed out the relevance of these observations and biological interpretation especially about in viral RNA evolution. Clote et al. proposed a method for calculating an expected distance , but it might be helpful for RNA structure analysis to reveal the population of structures shorter than some threshold as well as mean length. A method for counting the 5ʹ-3ʹ distances over all secondary structures has been proposed by , but their method assumes that all structures occur by the same probability and every base can make pairs with an arbitrary base except for pseudoknots. We propose the first algorithm for computing the exact probability distribution of the 5ʹ-3ʹ distances based on the McCaskill model.
Definition of 5ʹ - 3ʹ distance
The g1(i, j) is the 5ʹ - 3ʹ distance of the chain structure, which contains no base pairs. The g2(i, j, k) is the newly accumulated 5ʹ - 3ʹ distance, that is the junction of k-th and k+1-th bases. The g3(i, j, k) represents the sum of a hydrogen bridge by i-th and k-th bases and length of a chain structure from the k + 1-th base. Other functions g m (·), m = 4, . . . , 9) are irrelevant to 5ʹ - 3ʹ distance because their corresponding transitions for internal structures.
Total computational complexity using DFT with U parallel computing units, is O(n4/U ) in time and O(n2U ) in space (S max = n − 1, α = β = 1). In addition, since , , and do not contain variable x, therefore we can reduce the total amount of calculation by pre-computing them (See the Section S4 in the additional file 1).
A distribution of 2D RNA folding landscapes
RNA2Dfold is an application for 2D projections of RNA folding landscapes which are the two-dimensional probability distributions whose axis correspond to the Hamming distances from two independent given reference structures . Such distributions provide profound information on the whole ensemble through the medium of landscapes depending on the given structures. The RNA2Dfold, however, has difficulty of computational cost; it requires O(n7) in time and O(n4) in space though the computational time can be drastically improved by utilizing sparse matrices. On the other hand, extension of our proposed method reduces the complexity to less than O(n5) in time and O(n2) in space. Our method only calculate the distribution though RNA2Dfold also computes the minimum free energy structure of every combination of distances from the given structures. While a similar simplified algorithm has been proposed by , we construct an effective algorithm using DFT by expanding general principle described in previous sections.
Expanding the original model to two dimensions
The problem of computing the 2D folding landscape of an RNA, is defined as a natural expansion to two dimensions of the algorithm mentioned in the section. In this case, the objective distribution is defined on the two-dimensional discrete sample space which represents the Hamming distances from two given reference structures. Accordingly, we expand original model in equations (19)-(23) to two dimensions for the purpose of corresponding to two-dimensional score vectors. As shown in Algorithm S2, a vector variable x = (x1, x2) is employed to accumulate each component of a score vector S = (S1, S2) instead of applying a scalar variable x. The computational complexity of this model is O(n3S1maxS2maxα1α2/U ) in time and O((n2 + β1 + β2)U ) in space, where U (≤ S1maxS2max) is the number of parallel processing units, and α k and β k are time and space complexity for computing scoring function of k-th component.
Now we can construct a model for the distribution of the Hamming distance from the two given structures by assigning S1 and S2 to the Hamming distance from the first and the second structures respectively. The total cost of this algorithm is O(n3d1maxd2max/U) in time and O(n2U) in space. A concrete description is not shown here but in the Section S5 and S6 in the additional file 1.
Run time evaluation and sample outputs
Next we show the run time of the above three algorithms. We adopt the minimum free energy (MFE) structures as the reference structures for the algorithms in the section 4.1 and 4.3. The other reference structure for the algorithm in the section 4.3 is the open chain structure, that is a completely no base pairing structure. We measured the run time by single or 8 cores, though theoretically we can distribute the process up to S max or S1maxS2max.
Unreliable predictions of the RNA secondary structure have been prevented us from integrated analysis of RNA based on the estimated RNA structures. The energy model of the RNA secondary structure, however, offers rich information about the target RNA if we use appropriate algorithms to extract it. Such information is useful for analyzing not only the 3D structure prediction as a natural extension of secondary structure, but also the stabilities, the interactions with the other molecules, and so on.
In this paper, we proposed a general method to construct fast and accurate algorithms to compute the exact probability distributions of integer-valued features on the energy model of RNA secondary structures. We have shown that two useful algorithms, for Hamming distance from a reference structure and for 5ʹ- 3ʹ distance, can be constructed by just assigning the score functions gk (·). We extended the general method of an integer score to the method of an integer vector (2D), for the distributions of Hamming distances from two reference structures. We also applied those algorithms to tRNA as an example, and demonstrated the usefulness of observing the landscapes of probability distributions of the features. Although in some applications there have been proposed practically equivalent algorithms, the proposed method offers a clear and comprehensive guideline to design algorithms for a wide variety of integer features. The web server for the distributions of the Hamming distances is available at http://rtools.cbrc.jp/cgi-bin/index.cgi. We don't show the precise implementations for the other applications, but the proposed method is applicable to the integer features such as number of base pairs, or frequency of specific structure motifs by a little modification of original McCaskill model. In addition, distributions of combination of different integer scores can be also visualized by applying the 2D expanding technique described in the previous section.
The authors thank to Toutai Mituyama and Yukiteru Ono for their help in integration of the software to the web page. The authors also thank to Hisanori Kiryu, Tomoshi Kameda, Junichi Iwakiri for useful discussions. The authors also thank to Ivo Hofacker et al. who developed Vienna RNA Package. This work was supported by JSPS KAKENHI Grant Numbers 13J06668, 24680031, 25240044, and MEXT KAKENHI Grant Number 221S0002.
The publication charges for this work were funded by a Grant-in-Aid for Young Scientists (13J06668).
This article has been published as part of BMC Genomics Volume 15 Supplement 10, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S10.
- Contrant M, Fender A, Chane-Woon-Ming B, Randrianjafy R, Vivet-Boudou V, Richer D, Pfeffer S: Importance of the RNA secondary structure for the relative accumulation of clustered viral microRNAs. Nucleic Acids Research. 2014, 42 (12): 7981-7996. 10.1093/nar/gku424.PubMedPubMed CentralView ArticleGoogle Scholar
- Wan Y, Qu K, Zhang QC, Flynn RA, Manor O, Ouyang Z, Zhang J, Spitale RC, Snyder MP, Segal E, Chang HY: Nature. 7485, 706-709.Google Scholar
- Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004, 101 (19): 7287-7292. 10.1073/pnas.0401799101.PubMedPubMed CentralView ArticleGoogle Scholar
- Turner DH, Mathews DH: NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010, 38 (Database): 280-282. 10.1093/nar/gkp892.View ArticleGoogle Scholar
- Carvalho LE, Lawrence CE: Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc Natl Acad Sci USA. 2008, 105 (9): 3209-3214. 10.1073/pnas.0712329105.PubMedPubMed CentralView ArticleGoogle Scholar
- Newberg LA, Lawrence CE: Exact calculation of distributions on integers, with application to sequence alignment. J Comput Biol. 2009, 16 (1): 1-18. 10.1089/cmb.2008.0137.PubMedPubMed CentralView ArticleGoogle Scholar
- McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6-7): 1105-1119. 10.1002/bip.360290621.PubMedView ArticleGoogle Scholar
- Freyhult E, Moulton V, Clote P: RNAbor: a web server for RNA structural neighbors. Nucleic Acids Res. 2007, 35 (Web Server): 305-309. 10.1093/nar/gkm255.View ArticleGoogle Scholar
- Senter E, Sheikh S, Dotu I, Ponty Y, Clote P: Using the fast fourier transform to accelerate the computational search for RNA conformational switches. PLoS ONE. 2012, 7 (12): 50506-10.1371/journal.pone.0050506.View ArticleGoogle Scholar
- Lorenz R, Flamm C, Hofacker IL: 2D Projections of RNA Folding Landscapes.Google Scholar
- Senter E, Dotu I, Clote P: Efficiently computing the 2D energy landscape of RNA. Math Biol. 2014Google Scholar
- Ding Y, Chan CY, Lawrence CE: RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005, 11 (8): 1157-1166. 10.1261/rna.2500605.PubMedPubMed CentralView ArticleGoogle Scholar
- Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ: Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch. Nature. 2006, 441 (7097): 1167-1171. 10.1038/nature04740.PubMedView ArticleGoogle Scholar
- Yoffe AM, Prinsen P, Gelbart WM, Ben-Shaul A: The ends of a large RNA molecule are necessarily close. Nucleic Acids Res. 2011, 39 (1): 292-299. 10.1093/nar/gkq642.PubMedPubMed CentralView ArticleGoogle Scholar
- Clote P, Ponty Y, Steyaert JM: Expected distance between terminal nucleotides of RNA secondary structures. J Math Biol. 2012, 65 (3): 581-599. 10.1007/s00285-011-0467-8.PubMedView ArticleGoogle Scholar
- Han HS, Reidys CM: The 5ʹ-3ʹ distance of RNA secondary structures. J Comput Biol. 2012, 19 (7): 867-878. 10.1089/cmb.2011.0301.PubMedView ArticleGoogle Scholar
- Hamada M, Kiryu H, Sato K, Mituyama T, Asai K: Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics. 2009, 25 (4): 465-473. 10.1093/bioinformatics/btn601.PubMedView ArticleGoogle Scholar
- Puton T, Rother K, Kozlowski L, Tkalinska E, Bujnicki J: A Server for Continuous Benchmarking of Automated Methods for RNA Structure Prediction. 2011Google Scholar
- Copela LA, Chakshusmathi G, Sherrer RL, Wolin SL: The La protein functions redundantly with tRNA modification enzymes to ensure tRNA structural stability. RNA. 2006, 12 (4): 644-654. 10.1261/rna.2307206.PubMedPubMed CentralView ArticleGoogle Scholar
- Tuorto F, Liebers R, Musch T, Schaefer M, Hofmann S, Kellner S, Frye M, Helm M, Stoecklin G, Lyko F: RNA cytosine methylation by Dnmt2 and NSun2 promotes tRNA stability and protein synthesis. Nat Struct Mol Biol. 2012, 19 (9): 900-905. 10.1038/nsmb.2357.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.