RMalign: an RNA structural alignment tool based on a size independent scoring function

RNA-protein 3D complex structure prediction is still challenging. Recently, a template-based approach PRIME is proposed in our team to build RNA-protein complex 3D structure models with a higher success rate than computational docking software. However, scoring function of RNA alignment algorithm SARA in PRIME is size-dependent, which limits its ability to detect templates in some cases. Herein, we developed a novel RNA 3D structural alignment approach RMalign, which is based on a size-independent scoring function RMscore. The parameter in RMscore is then optimized in randomly selected RNA pairs and phase transition points (from dissimilar to similar) are determined in another randomly selected RNA pairs. In tRNA benchmarking, the precision of RMscore is higher than that of SARAscore (0.8771 and 0.7766, respectively) with phase transition points. In balance-FSCOR benchmarking, RMalign performed as good as ESA-RNA with a non-normalized score measuring RNA structure similarity. In balance-x-FSCOR benchmarking, RMalign achieves much better than a state-of-the-art RNA 3D structural alignment approach SARA due to a size-independent scoring function. Taking the advantage of RMalign, we update our RNA-protein modeling approach PRIME to version 2.0. The PRIME2.0 significantly improves about 10% success rate than PRIME. Author summary RNA structures are important for RNA functions. With the increasing of RNA structures in PDB, RNA 3D structure alignment approaches have been developed. However, the scoring function which is used for measuring RNA structural similarity is still length dependent. This shortcoming limits its ability to detect RNA structure templates in modeling RNA structure or RNA-protein 3D complex structure. Thus, we developed a length independent scoring function RMscore to enhance the ability to detect RNA structure homologs. The benchmarking data shows that RMscore can distinct the similar and dissimilar RNA structure effectively. RMscore should be a useful scoring function in modeling RNA structures for the biological community. Based on RMscore, we develop an RNA 3D structure alignment RMalign. In both RNA structure and function classification benchmarking, RMalign obtains as good as or even better performance than the state-of-the-art approaches. With a length independent scoring function RMscore, RMalign should be useful for the modeling RNA structures. Based on above results, we update PRIME to PRIME2.0. We provide a more accurate RNA-protein 3D complex structure modeling tool PRIME2.0 which should be useful for the biological community.

RNA plays important roles in many biology processes such as gene regulation, subcellular location and splicing. High-throughput global mapping of RNA duplexes with near base-pair resolution reveals that RNA interacts with RNA and RNA-binding proteins using higher randomly selected from all-to-all pairs of x-FSCOR. These datasets are named as the balance- x-FSCOR. If the number of positive or negative pairs is less than 1000, the dataset contains less pairs. The structural classes with various cut-offs of 419 RNA chains in FSCOR can be downloaded from www.rnabinding.com/RMalign/RMalign.html. The vary cut-offs are tried, because it is still unknown which value is appropriate to cluster the RNA structures.
Unbound protein-RNA docking set. The unbound set is used to compare the performance of PRIME [22] and PRIME(2.0) in predicting protein-RNA complex structures. This set includes 49 protein-RNA structures from protein-RNA docking benchmark [23].  [26] on Rg of RNA shows that this scaling law is Rg N 1/3 . In their work, the data is fitted by a fixed index of the exponent (1/3), although Flory theory [27] limits the scaling law value between 1/3 and 3/5. Different to Hyeon et al.'s result[26], our study shows that the index of exponent is 0.39.

Relationship between the ACC and the RNA length
The ACC has a function correlation with the Rg of protein and RMSD [17]. The Rg of the protein and RMSD also depend on the number of residues and the aligned length. Like protein, the relationship between RMSD of two aligned RNA structures and the Rg of these two can be written as (1) Where L N is the length of fragment in the fragment-pairs.

RMscore
Inspired by a protein 3D structure scoring function TM-score, we introduce a sizeindependent scoring function RMscore to describe the similarity of two RNA structures. For an RNA alignment, RMscore is defined as: Where L N is the length of average of target and query RNA, L T is the length of aligned nucleic acids to the target structure, d i is the distance between the ith pair of aligned nucleic acids and d 0 is a scale to normalize the length effect. 'Max' denotes the maximum after optimal superposition. For different scoring strategy, a different d 0 is adopted. In RMscore, a length-dependent d 0 is adopted. The relation between d 0 and the length can be estimated from Rg ∝ N 0.39 and eq. (2). (4) Here the constant compensation (set as 0.6) is introduced to smooth the curve when the RMscore is optimized in random pairs-0.3M ( Figure S1). Eq. 4 can be well approximated by a simple formula.

Searching Engine of RMscore
To find the spatially optimal superposition of the query and the target structure with the maximum RMscore according to eq. (3) and eq. (5), we use an iterative searching algorithm from TM-score.

Benchmarking RMscore on tRNA pairs
The all tRNAs are selected to benchmark RMscore because the RNA homology modelling method modeRNA was also benchmarked in tRNA dataset [16]. For comparison of the ability to select the similar RNA structures for RMscore and SARAscore (normalized SARAscore), we determine the phase transition point of RMscore and SARAscore in random pairs-0.1M. All alignments of RNA pairs are generated by SARA. The target SARAscore is normalized by dividing SARAscore of aligning itself. After the phase transition point from dissimilar to similar pairs is determined, RMscore and SARAscore are tested in tRNA pairs to distinguish similar (RMSD <= 5Å) or dissimilar (RMSD > 5Å) tRNA pairs. The alignments of tRNA pairs are also generated by SARA. A possible application of RMscore is to measure similarity between the native RNA structures and RNA models [16].

RMalign
We developed RMalign, an RNA structural alignment tool based on RMscore. The strategy taken by RMalign is similar to TM-align (see Figure S4). Two processes are modified. Firstly, in the secondary type of initial alignment, RNA secondary structure is calculated by X3DNA[30]. Secondly, for the final scoring process, all the aligned nucleic acids are used to score instead of setting a distance cut-off for the aligned nucleic acid.

Benchmarking RMalign on balance-FSCOR and balance-x-FSCOR
RMalign and ESA-RNA are tested on the balance-FSCOR for RNA function classification. They are also tested on the balance-x-FSCOR for RNA structural similarity.
The purpose of RNA structure alignment approach is used to detect the structural similarity. So, we also benchmark RMalign in balance-x-FSCOR. The AUC value is used as the metric to measure the performance.

Predicting protein-RNA 3D structure
We previous developed an approach PRIME[22] to predict the protein-RNA 3D structure. PRIME was tested on an unbound protein-RNA docking benchmark. The result shows that PRIME performs better than 3dRPC [31]. We update previous PRIME to v2.0, because RMalign performs better than SARA in balance-x-FSCOR. A similar approach with PRIME is adopted to build the protein-RNA complex structure model. The transformation matrices of TM-score and RMscore are applied to superimpose the target protein and RNA onto the templates. The ligand RMSD of RNA C3' atom between the model and the native structure is calculated. The quality of the model is measured by ligand RMSD. A prediction defined as "acceptable" for the ligand RMSD <= 10 Å[32].

Principle and benchmarking of RMscore
In figure 3(A), it shows that the raw-RMscore changes with the length. To overcome the shortcoming of length-dependent scoring function ( Fig. 3A) for aligning RNA structures, we proposed a size-independent scoring function RMscore (all RMscore discussed in this manuscript is normalized by an average length) to measure the RNA structural similarity. In order to obtain the formula of RMscore like TM-score, firstly we reveal that Rg of RNA has liner relation (R 2 = 0.91) with the logarithmic RNA length (Fig. 1). Secondly, we found that aligned correlation coefficient has a complex function relation (R 2 = 0.95) with the number of aligned nucleic acids (Fig. 2). Then a compensation value 0.6 is introduced to flat the average RMscore in random pairs-0.3M. (Fig. S1). The final average RMscore shows a slightly dependent on the RNA length with the highest standard error 0.2 (Fig. 3B). For comparing RMscore with normalized SARAscore, the relationship between the RMSD and the RMscore/normalized SARAscore in 0.1 million RNA pairs randomly selected from total pairs are investigated. In Fig. S2, it shows that the phase transition (from noise to similar RNA pairs) are 0.5 and 0.78 (at accumulative fraction = 0.5) for RMscore and normalized SARAscore, respectively. For the ability of selecting similar RNA pairs with phase transition as the cut-off, RMscore and SARAscore discriminate 0.8771 (Fig. 4A) and 0.7766 ( Fig. 4B) pairs in all-to-all pairwise structure comparison for 172 tRNA structures, respectively. The result shows the RMscore can distinguish similar (RMSD <= 5Å) or dissimilar (RMSD > 5Å) RNA pairs (Fig. S3). The above results indicate RMscore is a better metric to measure RNA structural similarity than SARAscore.

Benchmark of RMalign and comparison with other state-of-the-art approaches
For benchmarking RMalign in RNA function classification, FSCOR are downloaded and then a new dataset balance-FSCOR is constructed. RMalign obtains the AUC value of 0.95 which is as good as ESA-RNA in balance-FSCOR (Fig. 5). However, RMalign has two advantages comparing with ESA-RNA. Firstly, RMalign is written with C++ and ESA-RNA is written with a commercial software Matlab. Secondly, the geodesic distance describing the RNA structural similarity in ESA-RNA is not normalized and RMscore is a size independent score. In shows that the AUC of RMalign is higher than the AUC of SARA in balance-x-FSCOR. The performances on balance-FSCOR and balance-x-FSCOR show that RMalign can be used to predict RNA functions based on RNA structural similarity.
Predicting protein-RNA 3D structure with PRIME 2.0 and comparison with PRIME RMalign could find more remote RNA structural homologs than SARA. It could enhance searching RNA templates in the template-based RNA-protein structure modeling approach PRIME. For testing the ability to detect more potential RNA-protein complex structure templates, we update PRIME to PRIME 2.0 by replacing SARA with RMalign.
PRIME 2.0 was tested on unbound RNA-protein docking benchmark containing 49 complexes. In Fig 6, it shows the RNA-protein docking results. For top 1 prediction, the success rate of PRIME 2.0 is about 10% higher than that of PRIME. The result indicates that RMscore can select more potential templates than SARAScore in protein-RNA 3D complex structure prediction. For the top 300 predictions, success rate of PRIME 2.0 is higher than PRIME. In Fig 7, it shows a successful example in PRIME 2.0 but it failed in PRIME. Above results indicate that RMalign can detect more templates for protein-RNA complex structure modeling.

DISCUSSION
In conclusion, we introduce an RNA structure alignment approach RMalign, which includes RMscore as the similarity score. The definition of RMscore is derived from TMscore which has been applied to protein structural alignment successfully. However, the RMscore shows a slightly dependent on RNA length. This phenomenon may be caused by the flexible structure of RNA. It is hard to benchmark RMscore like TM-score because that study in RNA falls behind in protein. For example, the best way to benchmark RMscore is to compare the similarity between RNA model and native structure in RNA structure modelling.
However, no related studies have been investigated. Even more, the RNA homology modelling modeRNA employs RMSD or LG-score, which is introduced as an auxiliary metric to measure the RNA structure similarity without any modification [16]. Considering the currently situation, we study the relationship between RMscore and RMSD in RNA. The result shows that RMscore = 0.5 can discriminate the similar and dissimilar structures.

CONFLICT OF INTEREST
None declared.  3  D  s  t  r  u  c  t  u  r  e  .  N  u  c  l  e  i  c  a  c  i  d  s  r  e  s  e  a  r  c  h  .  2  0  1  1  ;  3  9  (  1  0  )  :  4  0  0  7  -2  2  .  d  o  i  :  1  0  .  1  0  9  3  /  n  a  r  /  g  k  q  1  3  2  0  .  P  u  b  M  e  d  P  M  I  D  :  2  1  3  0  0  6  3  9  ;  P  u  b  M  e  d  C  e  n  t  r  a  l  P  M  C  I  D  :  P  M  C  3  1  0  5  4  1  5  .  1  7  .  K  o  l  i  n  s  k  i  A  ,  S  k  o  l  n  i  c  k  J  .  A  s  s  e  m  b  l  y  o  f  p  r  o  t  e  i  n  s  t  r  u  c  t  u  r  e  f  r  o  m  s  p  a  r  s  e  e  x  p  e  r  i  m  e  n  t  a  l  d  a  t  a  :  a  n  e  f  f  i  c  i  e  n  t  M  o  n  t  e  C  a  r  l  o  m  o  d  e  l  .  P  r  o  t  e  i  n  s  .  1  9  9  8  ;  3  2  (  4  )  :  4  7  5  -9  4  .  P  u  b  M  e  d  P  M  I  D  :  9  7  2  6  4  1  7  .  1  8  .  B  e  t  a  n  c  o  u  r  t  M  R  ,  S  k  o  l  n  i  c  k  J  .  U  n  i  v  e  r  s  a  l  s  i  m  i  l  a  r  i  t  y  m  e  a  s  u  r  e  f  o  r  c  o  m  p  a  r  i  n  g  p  r  o  t  e  i  n  s  t  r  u  c  t  u  r  e