MONSTER v1.1: a tool to extract and search for RNA non-branching structures

Background Detection of RNA structure similarities is still one of the major computational problems in the discovery of RNA functions. A case in point is the study of the new appreciated long non-coding RNAs (lncRNAs), emerging as new players involved in many cellular processes and molecular interactions. Among several mechanisms of action, some lncRNAs show specific substructures that are likely to be instrumental for their functioning. For instance, it has been reported in literature that some lncRNAs have a guiding or scaffolding role by binding chromatin-modifying protein complexes. Thus, a functionally characterized lncRNA (reference) can be used to infer the function of others that are functionally unknown (target), based on shared structural motifs. Methods In our previous work we presented a tool, MONSTER v1.0, able to identify structural motifs shared between two full-length RNAs. Our procedure is mainly composed of two ad-hoc developed algorithms: nbRSSP_extractor for characterizing the folding of an RNA sequence by means of a sequence-structure descriptor (i.e., an array of non-overlapping substructures located on the RNA sequence and coded by dot-bracket notation); and SSD_finder, to enable an effective search engine for groups of matches (i.e., chains) common to the reference and target RNA based on a dynamic programming approach with a new score function. Here, we present an updated version of the previous one (MONSTER v1.1) accounting for the peculiar feature of lncRNAs that are not expected to have a unique fold, but appear to fluctuate among a large number of equally-stable folds. In particular, we improved our SSD_finder algorithm in order to take into account all the alternative equally-stable structures. Results We present an application of MONSTER v1.1 on lincRNAs, which are a specific class of lncRNAs located in genomic regions which do not overlap protein-coding genes. In particular, we provide reliable predictions of the shared chains between HOTAIR, ANRIL and COLDAIR. The latter are lincRNAs which interact with the same protein complexes of the Polycomb group and hence they are expected to share structural motifs. Software availability: the software package is provided as additional file 1 ("archive_updated.zip").

Getting Started 1 MONSTER_v1.0 MONSTER is a procedure to extract and search for RNA non-branching structures in order to identify common structural motifs.

Pipeline overview
The pipeline is composed of three parts ( Figure UG1): 1. Structure prediction and encoding of the reference into secondary structural descriptor (SSD). a) prediction of the reference structure through RNALfold; b) extraction of the non-overlapping non-branching structures (NBSs) thorough nbRSSP_extractor module; c) encoding of SSD through nbRSSP_extractor module.
2. Matches searching and filtering. a) searching for matches between reference SSD and target sequence through Structator; b) filtering out of matches through match_ filter module.
3. Chains of matches building. a) building of chains of matches through SSD_finder module.

Figure UG1 Pipeline overview
The pipeline is composed of three parts: (1) Structure prediction and SSD encoding of the reference (step 1-4 in the manuscript) (2) Matches searching and filtering (step 5-6 in the manuscript); (3) Chains of matches building (step 7 in the manuscript). More details are given in the paper. Such a flowchart is specific for the case in point of two RNA sequences (HOTAIR and ANRIL), explained in the tutorial of chapter 2. Legend: orange circles represent published available tools; green circles represent software developed by us; rectangles represent software input and output (I/O), colored with water blue and yellow for what concerns reference and target, respectively.

System requirements
MONSTER version 1.1 has been tested on the following operative systems: When all packages are downloaded:  Install MONSTER_v1.0 4 -choose a directory and let "rootdir" be the path to this directory; unzip the archive.zip file in "rootdir", that will create a folder named "archive"; -type: go the subfolder "bin" of the unzipped "Structator1.1-linux-gnu.amd64" file; copy the executable afsearch 5 according to the user operative system in "MONSTER v1.0/bin".
Finally, in the directory "archive/MONSTER/bin" you should find the following executable files: To test if the executables have been correctly built type, run:

Additional files:
The "data" subfolder of "archive" contains the following additional files, needed to run MONSTER:  dna_rna.comp 7 ;  rna.alphab 8 . 5 afSearch is a program for matching RNA sequence-structure patterns in a precomputed index or directly in a plain FASTA file. 6 RNALfold is a program for calculating locally stable secondary structures of RNAs. 7 File specifying the Watson-Crick and wobble complementary rules. 8 File specifying an alphabet to which characters are mapped and the sequences are then alphabetically transformed, needed to run afsearch program (see the user's manual v1.01 of Structator packages for details).

6
Basic Usage 2 After installing MONSTER software, you can follow a sample run executing the following tutorial 9 .

Aim of the tutorial:
The user whishes to search for chains (group of matches) of a reference lncRNA into a target lncRNA. For this example we considered as a reference HOTAIR and as a target ANRIL.

Preliminary step:
Go to the "archive/example_data" subfolder which stores all the file needed to execute the tutorial. You can run the tutorial step by step following the "tutorial step" section. Otherwise, you can run the script "MONSTER.sh" on unix platforms to execute the whole tutorial procedure. -

Preliminary files:
 HOTAIR_human.fasta: a fasta file with the RNA sequence of HOTAIR  ANRIL_human.fasta: a fasta file with the RNA sequence of ANRIL Steps: 9 We provide the command lines to run the tutorial on unix platforms (i.e., GNU/Linux, Mac OS). Details of each step are explained as follows: 1. Run RNALfold of the Vienna Package to obtain the secondary structure predictions for the HOTAIR sequence in dot-bracket notation 10 .

Synopsis:
RNALfold.exe [-L span] Description: RNALfold reads RNA sequences from stdin and prints local structure predictions to stdout. Options: -L span Set the maximum allowed separation of a base pair to span, i.e. no pairs (i,j) with j-i>L will be allowed. In the present example, we used L = 150.
The format of the output file is as follows: 10 dots represent unpaired nucleotides; matched brackets (opened/closed) represent paired nucleotides.

Sequence header
Local predicted structure (dot-bracket notation)

Nucleotides sequence
Minimum free energy 2. Run nbRSSP_extractor to extract the NBSs using the "HOTAIR_human_RNALfold_150_pred.txt" file as input. The software returns the HOTAIR SSD comprising of 67 RSSPs.
The format of the output file is as follows: 3. Run the afSearch program of Structator package to look for reference SSD of HOTAIR into the target ANRIL sequence, setting a global modality. The software returns the found matches. ../MONSTER_v1.0/bin/nbRSSP_extractor -i HOTAIR_human_RNALfold_150_pred.txt \ -o HOTAIR_human_RNALfold_150_pred_ssd.pat
The Structator output is in the following format: 4. Run RNALfold (with a span L equal to 150) to obtain the secondary structure predictions of the ANRIL sequence.
Command line example: . The output format is the same of step 1.
5. Run nbRSSP_extractor to extract the non-branching structures (NBSs) from the "ANRIL_human_RNALfold_150_pred.txt" file, retaining even the overlapped RSSPs. Thus, we have a wide array of possible structure predictions of the target.
The output format is the same of step 2, the only difference consists of the higher number of RSSPs that are extracted, because the option --RNAlfold_out allows to maintain even overlapped predictions.
6. Run match_filter to discard the unlikely matches obtained running the step 5, based on the predicted RSSPs of step 5. The software returns the filtered matches between HOTAIR and ANRIL.
The output format is the same of the step 3, but with a lower number of matches because of the filtering.
7. Run SSD_finder to perform the chaining. It returns the chains of matches that represent the structural motif shared between ANRIL and HOTAIR.
Output files: "HOTAIR_chains.txt". The format of the output file is as follows: The first line (starting with "#") contains the number of target sequence (in this case <seqID = 0> because ANRIL is the only target sequence analyzed); then, there is a line for each found chain of matches. Each line starts with the computed score of the chain, and it is followed by (i) the pattern ID (pID) of the reference RSSPs found in the target sequence; (ii) the positions (pos) at which RSSPs have been found in the target; (iii) the weight (w) of each RSSP; and (iv) the pair-wise relative distances (dist). This parameter consists of two numbers enclosed in the brackets and comma-separated: the first providing the distance of the found RSSPs in the reference and the second representing the corresponding distance in the target. The highest scores represent the most putative structural motifs shared between the reference and the target.  --, --ignore_rest Ignores the rest of the labeled arguments following this flag  --version

Displays version information and exits
 -h, --help

Matches Filtering (match_filter)
Match_filter filters out matches that cannot actually fold. It writes a file of matches (following the output format of Structator) containing for each sequence the matches that have been someway predicted. Current implementation considers a match predicted if it is a substructure of some predicted RSSP. In particular the external loop of the match must coincide with the one of the predicted RSSP.