Systematic identification of stem-loop containing sequence families in bacterial genomes

Background Analysis of non-coding sequences in several bacterial genomes brought to the identification of families of repeated sequences, able to fold as secondary structures. These sequences have often been claimed to be transcribed and fulfill a functional role. A previous systematic analysis of a representative set of 40 bacterial genomes produced a large collection of sequences, potentially able to fold as stem-loop structures (SLS). Computational analysis of these sequences was carried out by searching for families of repetitive nucleic acid elements sharing a common secondary structure. Results The initial clustering procedure identified clusters of similar sequences in 29 genomes, corresponding to about 1% of the whole population. Sequences selected in this way have a substantially higher aptitude to fold into a stable secondary structure than the initial set. Removal of redundancies and regrouping of the selected sequences resulted in a final set of 92 families, defined by HMM analysis. 25 of them include all well-known SLS containing repeats and others reported in literature, but not analyzed in detail. The remaining 67 families have not been previously described. Two thirds of the families share a common predicted secondary structure and are located within intergenic regions. Conclusion Systematic analysis of 40 bacterial genomes revealed a large number of repeated sequence families, including known and novel ones. Their predicted structure and genomic location suggest that, even in compact bacterial genomes, a relatively large fraction of the genome consists of non-protein-coding sequences, possibly functioning at the RNA level.


Background
The availability of a massive amount of sequence data stimulated in-depth analyses on the organization of bacterial genomes [1][2][3][4][5][6]. Although less prominent than in eukaryotic genomes, sequence repeats are found in most bacterial species. According to their sizes, sequence repeats may be roughly classified into two main classes. Large repeats (0.8-2 kb) are mostly insertion sequences (IS), and encode proteins mediating their genomic mobility. The terminal inverted repeats (TIRs) and the nature of their gene products allow sorting ISs into specific classes [7,8]. Smaller repeats (50-300 bp) make up a much less defined and more variegate set of genomic sequences. Some of them contain palindromic sequences, demonstrated or proposed to be structured as stem-loops able to function as regulatory elements at DNA or RNA level. For example, E. coli PU-BIME elements have been shown to interact with the DNA gyrase [9] and the integration host factor protein [10], but also to function as mRNA stabilizers [11] and transcriptional attenuators [12]. Similarly, palindromic sequence repeats have been shown to function as mRNA stability determinants in Neisseriae [13][14][15] and Yersiniae [16,17].
Following these observations, and given the current availability of a large number of sequenced bacterial genomes, a systematic analysis of stem-loop containing repeated sequences appeared of interest. In a previous article [18], high stability stem-loop structures (SLS) were studied within a representative set of bacterial genomes and some of them were shown to have strong similarity with each other. Here we extend this study to detect all families of SLS containing sequences in the same bacterial set. To this aim, a pipeline, combining sequence clustering and Hidden Markov Model (HMM) based searches, was developed. This strategy led to the definition of a large number of sequence families, sharing sequence similarity and, in most cases, a common predicted secondary structure.

Identification of initial SLS clusters
In a previous work a large number of SLS containing sequences were identified within 40 bacterial genomes [18]. For each bacterial species, sequences obtained from this study and predicted to fold with a free energy lower than -5 Kcal/mol were selected. In order to avoid obvious structured repeated sequences, they were filtered to eliminate those falling within either mature RNA species (tRNAs, rRNAs) or known ISs. An all-against-all BLAST comparison was performed on the selected sequences for the creation of a distance matrix, where distance is reported as the E-value of the found matches. Since SLSs are strand-specific, BLAST was run without searching for the complementary strand. Links between overlapping sequences were cut, by eliminating the corresponding matches from the matrix. The resulting matrix was then fed to a Markov Clustering algorithm (MCL) based tool [19] to produce a set of clusters. This clustering step was performed by using stringent parameters (see Methods) in order to favour the selection of more homogeneous clusters.
To avoid repeated analyses on the same genomic sequence, overlapping clustered SLSs were subsequently joined into larger SLS containing regions (SCRs). Clusters composed of at least 7 SCRs were selected and are reported in Table 1. Of the 40 analyzed genomes, 29 contain at least one cluster. The procedure led to the identification of 523 clusters, which together contain 28,904 elements, corresponding to 12,254 non-overlapping SCRs. No clusters were identified for the remaining 11 genomes: L. innocua, L. monocytogenes, S. pyogenes, C. pneumoniae, C. trachomatis, U. urealyticum, R. prowazekii, T. pallidum, Buchnera, C. jejuni and H. pylori.
The clusters identified in each positive genome range between 1 and 75, for a total of 8 to over 4,000 clustered SCRs per genome. All together they correspond to about 1.3% of the originally selected population of over 2 million sequences. In order to evaluate the quality of the described clustering procedure, grouped SCRs were aligned by using the PCMA multiple alignment tool [20], and the resulting alignments were evaluated by ALISTAT [21]. Over 80% of the clusters showed an average identity better than 60%. The established consensus was larger than 90 bp for the about half of them, while the others produced consensus sequences between 27 and 90 bp (not shown).

Clustered SLS containing sequences show high ability to form a stable secondary structure
The ability to fold into a reliable secondary structure was analyzed by using RANDFOLD [22], which compares the predicted minimum folding energy (MFE) of a sequence with those of a large number of random shuffles of the same sequence. Results are expressed as a p-value, representing the probability of the predicted MFE being truly different from the others. In this test, sequences were shuffled by preserving dinucleotide frequencies, as proposed by Workman and Krogh [23].
For each genome, the test was performed on clustered SLSs, as well as on SLSs randomly picked from the initial population and random sequences of equal size extracted from the same genome. The results are reported in Figure  1, where sequences are assigned to specific "folding aptitude" classes, according to the p-value calculated by RANDFOLD. Most clustered sequences (panel A) show a non-random probability of folding below 0.01 (dark grey bars), and, very often, also below 0.001 (black bars), whereas only about 20% of the original SLS population reach these p-values ( Figure 1, panel B). Only for M. leprae, L. johnsonii, M. genitalium and M. pneumoniae, the two SLS populations do not show statistically different folding aptitudes. A very small fraction (less than 5%) of control sequences showed a non-random folding probability higher than 0.1% (light grey bars in Figure 1, panel C).

Evaluation and refinement of the initial clustering
Various grouping procedures were used to combine the initial 523 clusters, according to sequence similarity, strand reciprocity and position on the genome. The results are reported in Table 2.
In order to group clusters sharing sequence similarity, the clustered SCRs were re-clustered by using the above described BLAST-MCL based procedure, under less stringent conditions. The initial 523 clusters could be associated into 301 new clusters, most of them characterized by a larger number of elements (see column 'sequence' in Table 2). Within each new cluster, overlapping SCRs were fused as described above.
The ability to form SLS is generally shared by the two complementary strands of a given DNA sequence, the only exception being sequences where GU pairing is essential to form a stem-loop satisfying the minimum requirements. A number of clusters should therefore be composed of elements from the opposite strands of the same genomic region. Such clusters were identified, again by using the BLAST-MCL procedure, this time allowing BLAST searches also on the complementary strand. About two thirds of the clusters could be paired in this way, thus reducing the total number to 205 'unrelated' clusters (see column 'strand' in Table 2). BLAST-MCL clustering of SLSs identified, from a representative set of bacterial genomes, as described in Petrillo et al. [18]: only species with at least one cluster of a minimum of 7 elements are listed. For each species, the number of elements within the starting population, the number of clusters and the number of clustered SLSs are reported. The number of SLS containing regions (SCRs), obtained by fusing overlapping clustered SLSs, is also reported.
A third refinement was directed to connect clusters, which might represent different parts of a larger DNA repeat. For this reason, paired clusters, whose elements resulted to be overlapping or located at short distance (< 150 bp), were identified and joined within one group. This led to a further reduction to 137 cluster groups (see column 'location' in Table 2).
The resulting set was pruned by comparing SCRs from each cluster against the IS sequences collected in the ISfinder database [8] by using BLAST, in order to remove insertion sequences, possibly missed in the initial filtering. Similarly rRNA-and tRNA-related clusters were removed by evaluating the genomic localization of their elements, relative to genes encoding stable RNAs. These tests revealed that 28 cluster groups were related to insertion sequences, mainly not known at the time of the initial filtering, and 11 cluster groups were composed of sequence elements contained within rRNA precursors. These 39 cluster groups, reported in the columns 'IS' and 'rRNA' of Table 2, were flagged and not used for further analysis.
The whole procedure above described led to the identification of 98 candidate SLS-containing repeated DNA families.

Characterization of families expanded by Hidden Markov Model searches
The candidate families were identified starting from small SLS containing sequences, which may be contained within regions of sequence similarity larger than the originally detected ones. In addition, genomic sequences may exist which, although similar, do not contain a SLS able to match the threshold used in the original search. For these reasons, a combined iterative procedure, based on HMM genome searches, was developed and applied to each family. In the procedure, a HMM built on the alignment of all family members is used to scan the parental genome and the detected sequences are aligned to the model. Alignments are extended by attaching neighboring sequences in order to define larger models, when possible. Multiple cycles of alignment, elongation, model building and genome search were performed until the borders of the repeated sequence were reached (see Methods).
A final, manual refinement was performed to combine essentially identical models. At the end of this procedure 92 models were obtained, which define the families reported in Table 3, where the length of the model and the number of detected sequences, both covering the entire model or part of it, are indicated. 67 models range in size between 31 and 200 bp, while the rest are larger, but only two extend over 1 Kb.
Fraction of sequence elements positive to RANDFOLD test Figure 1 Fraction of sequence elements positive to RANDFOLD test. RANDFOLD test was run onto groups of clustered SLSs (panel A), total SLSs (panel B) and random sequences (panel C) from the 29 genomes listed in Table 1. The fraction of elements scoring positive with the indicated probability is diagrammed. Standard deviation bars are shown in panels B and C.
The remaining 67 families are not described as such in literature. Their sizes range from 31 to over 2,000 bp for a number of elements varying between 9 and 164. Nine of these families (Bhal-2, Clot-2, Clot-3, Myt-5 Sal-2, Myt-11, Nem-4, Pam-1, Hin-1) contain known DNA sequence motifs, such as CRISPR [34], MIRU [35] and DUS [36]: the combination of two or more specific elements, matching these motifs, generates larger, SLS containing repeated sequences, not previously described. Sixteen families are made up of sequences contained within larger sequence blocks, either coding for abundant protein motifs or located within larger, ill-defined redundant intergenic sequences. Forty-two families appear to be unrelated to previously described sequence elements.

Secondary structure analyses
Three different approaches were used to evaluate the aptitude of sequences from the detected families to fold into Clusters reported in Table 1 were regrouped, according to sequence similarity, strand reciprocity and relative genomic position of their elements, as described in Methods. The number of groups, obtained by each criterion, is reported in the three columns labelled "Grouped by". Several groups are composed of sequences, contained within ISs or rRNA genes; their number is shown in the last two columns, for each genome.  Table 4): 1) ability to form conserved secondary structures, evaluated, for each family, by RNAz [37] analysis of the alignment of six representative sequences to the family HMM (column "conserved structure"); 2) presence of aligned SLSs and agreement with the structure predicted by RNAz (column "conserved SLS position"); 3) probability of non-random folding for SLSs contained within each family, calculated by using RANDFOLD [22] (column "SLS folding aptitude").
Only families with either a predicted conserved secondary structure or aligned SLSs are reported. Of the 92 described families, 57 generate a common secondary structure, when analyzed by RNAz. For most (47) of them, marked as "s", the predicted structure contains a stem-loop compatible with the original search. In all but Cod-2, the position of the originally found SLSs is in agreement with the structure predicted by RNAz. These SLSs tend to be positive also to the RANDFOLD test: in 36 of the 47 families, most members contain SLSs, able to fold into a non-random secondary structure (P <= 0.005). For ten of the 57 families, indicated by "c", a more complex common structure is predicted by RNAz, not including a stem- The final set of 92 families of repeated sequences is reported, grouped by species. For each family, the length of the model and the number of sequences fitting the model are given. The number of complete sequences, i.e. covering the model from end to end, is reported in parenthesis. Previously described sequence families have been named in column "Family", according to the current literature; for each of them, the number and typical size of its members are also provided, together with references. For novel families, a systematic name was built by fusing a shortened species name to a progressive number. In the column "type", I, G and S indicate the prevalent genomic location of the members of each families within intergenic, genic or border-spanning sequences. For some families, small previously described sequence motifs contribute to the formation of a substantially larger model; for others, their members are frequently located within larger previously described sequences. In both cases, a note is reported in the rightmost column.   Table 4). Table  3 where, in column "type", families are classified, according to the position of the vast majority of their members, relative to annotated coding sequences. 41 families are intergenic (I), 30 genic (G) and 7 tend to span the borders between coding and non coding sequences, and are therefore indicated as border spanning (S).

Genomic localization Genomic localization of the families is reported in
For 14 families no clear predominance of genic or intergenic sequences was observed, and therefore the family was not assigned to a class.
Genomic localization of the families predicted to fold in a secondary structure is reported in Table 4; for all families, genomic localization, correlated with the predicted ability of the family members to fold into a common, stable secondary structure, is summarized in Table 5. For most intergenic families a secondary structure is predicted (31 out of 41). Genic families, in contrast, are predominantly not structured: only about one third (9 out of 30) have a structure predicted by RNAz and only for 5 of them aligned SLSs support its existence. Border spanning and unclassified sequence families feature a predicted secondary structure with frequencies similar to intergenic ones.

Characterization of specific families
The described procedure led to the identification of a large number of families of repeated bacterial sequences, some already known, other previously undescribed. For many of them, a number of tests showed the potential folding of the majority of their members into a shared secondary structure. Four such families are reported in Figure 2, where the predicted secondary structure is shown along with the aligned, originally found, SLSs. One of them, the ERIC family from E. coli, had previously been described, while the other three are new. ERIC, as anticipated from literature reports [31,32], is predicted to fold into a single, long stem-loop structure. Sta-1 folds into a simple, shorter SLS. Pae-1 and Efa-1 families feature more complex structures, composed of a pair of adjacent SLSs. The structures predicted for these four families may be predicted on both strands, with complementary sequences generally, but not necessarily, folding into corresponding stems. For Pae-1, the prediction of different structures on the two strands indicates the likely presence of multiple foldings of comparable stability, which, on each strand, are alternatively selected as the best one, because of minor base pair differences.
For some of the identified families, secondary structure predictions, although supported by high RNAz scores, are not consistent with the originally found SLSs. Generally this stems from the prediction, by RNAz, of structures not including SLSs fitting with the original SLS definition. PU-BIME and dRS3, shown in Figure 3, are examples of such families: in PU-BIME the stem includes a five base internal loop, while in dRS3 the 8 bp stem is too short. Both cases Columns under "Sec. Struct. +/-" report the number of families, characterized by the presence or absence of a conserved secondary structure predicted by RNAz; the labels "SLS +/-" indicate the presence or absence of aligned SLSs; "Total" means the sum of rows or columns. The ability to form a consensus secondary structure was evaluated by RNAz: the prediction scores are reported in column "P" for each family. The type of predicted structure is indicated in column "conserved structure", where "s" indicates a stem-loop based structure, while "c" indicates a more complex structure, where a stem-loop compatible with the original search is not present. For each family, the aligned localization of the original SLSs is indicated by '+' in column "conserved SLS position"; when SLS alignment is not in agreement with the RNAz prediction, a '°' is added to the '+' symbol. The column marked "SLS folding aptitude" reports the behavior of family elements in the RANDFOLD test: the number of '+' symbols describes the percent of positive elements ('+++' if 90% or above; '++' if 70-90%; '+' if 50-70%; '-' if less than 50%). The localization of family members, as already described in Table 3, is also reported in the last column. Alignment of ERIC, Pae-1, Sta-1 and Efa-1 family members Finally, for about one third of the 92 identified families, it is unlikely that a RNA secondary structure may play a relevant role, as shown by the absence of either a common predicted structure or alignment of originally found SLSs.
Alignment of PU-BIME, dRS3 and Myt-10 family members Figure 3 Alignment of PU-BIME, dRS3 and Myt-10 family members. Panels A and B legends are as in Figure 2.
An example of such families is Myt-10, reported in Figure  3.

Discussion
In a previous study, a systematic analysis of putative SLSs found in bacterial genomes showed that they tend to be more abundant and stable than those randomly formed in shuffled sequences of comparable size and base composition [18]. This observation led to the hypothesis that, along with SLSs stochastically formed because of sequence composition, a sizeable quota is possibly the result of selective pressure, due to the need to preserve a biological function. SLS-containing secondary structures are known to play a relevant role in several aspects of gene expression and its regulation. Structured RNAs are a functional component of enzymes like RNAse P [38], or contribute to the formation of regulatory cis-acting regions such as riboswitches [39], thermosensors [40], transcriptional attenuators and terminators [41,42]. Palindromic RNA sequence repeats may also influence mRNA stability [11].
In this work, we describe a systematic procedure, schematically depicted in Figure 4, to identify and classify families of repeated sequences, characterized by a shared secondary structure, in the genomes of a representative set of bacteria, most of which of medical interest. To this aim, SLS containing sequences were first clustered by sequence similarity and subsequently evaluated for their potential to form secondary structures. In most analyzed genomes, a fraction of SLSs could be grouped into clusters, containing at least 7 non-overlapping elements. No clusters were found in 11 of the 40 analyzed genomes.
Clustering by sequence similarity resulted in selection of 523 clusters corresponding to just above 28,000 SLSs, about 1% of the whole SLS population: this figure may vary quite a lot in specific species, being sensibly larger, up to 6%, in N. meningitidis, and substantially lower in B. subtilis and P. multocida, where less than 0.1% of the SLSs fall within clusters. Clustering ended up by selecting a subset of SLSs different from the original population and characterized by a much higher probability of non-random folding (see Figure 1), indicating that selection based on sequence similarity was very effective in enriching for structured regions.
Various refinement steps produced the final set of 137 clusters, reported in Table 2. Although mature rRNA and tRNA genes were initially masked within the searched genomic sequences, some clusters were identified, which correspond to unmasked parts of ribosomal RNA precursor genes (Table 2). Similarly, some clusters correspond to SLSs contained within ISs, which escaped the initial filtering for various reasons. Removal of these two subsets and other redundancies reduced the number of identified families to 92.
Notwithstanding the starting population of SLS containing sequences, within these families regions sharing primary structure similarity, but not a common SLS, might, in principle, still be found, and 35 families with no recognizable shared secondary structure, were indeed identified. Most of these sequences are, not surprisingly, found within coding regions, where the formation of secondary structures is expected to be limited by the translation machinery. However, some of these families coincide with intergenic sequence repeats, such as the S. pneumoniae BOX and P. putida REP sequences unable to form structures compatible with the originally searched ones.

Families sharing common secondary structures
Most identified families, 57 out of 92, are predicted by RNAz to share a common secondary structure. This group includes well-known intergenic families, such as the E. coli PU-BIME and ERIC repeats, and their homologues in other species, as well as a number of less known families, most of which described in isolated reports, but not characterized in detail (see Table 3). Practically all intergenic repeats, previously shown or predicted to fold into a RNA secondary structure, have been found. The only exceptions are the S. pneumoniae RUP and the R. conorii RPE-6 repeats, which, although identified by the pipeline, do not fall into this group, because RNAz could not predict a shared secondary structure better than the defined threshold.
For known families, the sequence boundaries, as predicted by the pipeline, are essentially coincident with those previously reported in literature. Specific discrepancies were found only in two families. In the N. meningitidis NEMIS elements, the present search identified the central 46 bp core, but failed to extend the similarity to either the partial 108 or the complete 158 bp repeats described by Mazzone et al. [13]. Similarly, for the S. pneumoniae RUP family, only 63 bases were detected out of the complete 108 bp elements [26].

Known and novel families
In well characterized genomes, such as those of enterobacteria, practically all known families have been detected, along with a few new ones. In E. coli, the known PU-BIME, ERIC and BoxC families were recognized and feature shared secondary structures, while the only new one identified, the Eco-1 family, is predicted as unable to fold. PU-BIME repeats were also detected in S. typhi as two related variants (a full size and a shorter one, only the former predicted to fold) and in S. typhimurium, along with two novel families, Sal-1 and Sal-2 (Table 3). For both of them Schematic representation of the overall procedure Figure 4 Schematic representation of the overall procedure.
RNAz could predict a shared secondary structure of the complex type.
As expected, ERIC sequences were detected not only in E. coli, but also in Y. pestis and V. cholerae [16,31]: Y. pestis repeats are predicted to fold with a structure closely similar to the E. coli elements. In contrast, ERIC sequences detected in V. cholerae are not predicted to fold, being 20 bp shorter than both E. coli and Y. pestis homologues, because of selective erosion of their TIRs. Yersiniae ERIC sequences have been shown to regulate the level of expression of neighboring genes by folding into RNA harpins [16]. V. cholerae ERIC, being unable to fold, may thus not function as RNA stability determinants.
Most potentially structured new families have been found in species less analyzed experimentally or whose genome was more recently sequenced, such as pseudomonaceae, bordetellae, mycobacteria.
For both novel and known families, the predicted common secondary structure is often a stem-loop (see Sta-1 and ERIC in Figure 2). In a fraction of cases, however, RNAz analysis proposes different structures. Some families feature a double hairpin (see EFA-1 and Pae-1 in Figure 2) and others feature a complex structure containing a SLS (not shown).

Genomic localization
Genomic localization highlights the preferential tendency of repeated sequences with a predicted common secondary structure to lie within intergenic regions; this is true for both known and novel ones. In contrast, families found within coding sequences (CDSs) of genomes are often not structured. This is in agreement with the results of RAND-FOLD analysis: most (19 out of 27) intergenic families with aligned SLSs (Table 4) are enriched in highly structured SLSs, while this is true for only one genic family, Myp-2. These observations support the overall hypothesis that many of these sequence families fold in a secondary structure at the RNA level, particularly those located in intergenic regions, where the translation machinery is not expected to interfere with secondary structure formation.
Three novel intergenic structured families, Hin-1 in H. influenzae, Nem-4 in N. meningitidis and Pam-1 in P. multocida are composed of similar sequences, characterized by the repetition of short, abundant oligonucleotides, known as DUS [36]. The recurrence, at specific short distances, of this basic oligonucleotide module, shorter than the searched pattern, produces a conserved SLS larger than the required threshold. It is possible that these sequences function as transcriptional terminators, and it has been recently reported that terminator hairpins are indeed frequently formed by closely spaced, complementary instances of exogenous DNA uptake signal sequences [43].
Some novel structured families are located within CDSs. They often contain repetitive motifs of one or a few coding regions, such as Lac-1 in L. johnsonii, Pae-3 in P. aeruginosa and Efa-2 in E. faecalis. Interestingly, the Cod-2 family defines a very small repeat, found within various CDSs, encoding different peptides in different frames. Cod-2 repeats resemble repetitive sequence elements found by Claverie and coworkers in protein coding genes of R. conorii [44]. Five genic families found in M. pneumoniae are part of large (1.5-5.4 kb), possibly mobile repeated DNA sequences having coding capacity [45].
About one third of the identified families are found to be "unstructured". These sequences were not the object of the original search; a possible explanation of their detection is the incidental presence of SLSs within large repeated sequences. Most such families fall within CDSs (see Table  4, and Myt-10 in Figure 3 as an example).

Conclusion
A systematic analysis of 40 bacterial genomes is presented, aimed to identify repeated sequence families, sharing a common predicted secondary structure. This procedure identified practically all already described families meeting these constraints, as well as a larger number of novel, undescribed nucleic acid repeats.
About two thirds of the families shared a predicted conserved secondary structure, often a stem-loop based one. Interestingly, these families are mostly composed by elements located within intergenic regions. This localization reflects the hypothesis that RNA folding, within these regions, is more likely to occur, not being affected by the translation machinery.
The identification of repetitive sequence families, able to fold into secondary structures and preferentially located within intergenic regions, reinforces the notion that also in prokaryotic genomes, typically more compact than eukaryotic ones, a relatively large fraction, not coding for proteins, is likely to play a biological role, by encoding functional RNAs.

Selection of SLS clusters
SLSs previously identified in 40 bacterial genomes by Petrillo at al. [18] were taken as the starting population.
Only SLSs predicted to fold with a free energy <=-5 Kcal/ mol were used for the present study.
For each genome, selected SLSs were clustered according to a procedure, based on BLAST and MCL programs [46,19]. An all-against-all BLAST comparison was performed on the whole population, to create an E-value based distance matrix. The BLAST result matrix was pruned by removing hits linking overlapping SLSs, and subsequently fed to MCL to produce a set of clusters. BLAST was performed with an E-value cut-off of 1E-4 and only on the sequence top strand. MCL was run by setting the inflation parameter (I)equal to 4. The alignments of clustered elements were produced by PCMA [20] used with default parameters.

Aptitude to form a stable secondary structure
The aptitude of SLSs and control sequences to form a stable secondary structure was tested by running RAND-FOLD [22]. The '-d' option was used, in order to preserve dinucleotide frequencies. RANDFOLD was set to shuffle each sequence 1,000 times. In the tests reported in Figure  1, all clustered SLSs (panel A) were compared to a number of SLSs representing the 5% of initial SLS population (panel B) and to a number of genomic sequences having the same size of clustered SLSs, randomly extracted from the corresponding genomes (panel C). Control sequences analyzed in panels B and C, were selected three times, in order to evaluate average and standard deviations.

Cluster refinement
The regrouping procedures summarized in Table 2 were made as follows: • Regrouping by sequence was made by using the BLAST-MCL procedure (see above) on all SCRs, but in a less stringent way, i.e. setting parameter 'I' to 1.4.
• Regrouping by strand was performed by using the BLAST-MCL procedure, but allowing searches on the complementary strand and setting parameter 'I' to 1.4.
• Regrouping by location was obtained by merging clusters in which SCRs were partially overlapping, or within a distance of 150 bp, according to their genomic coordinates.
For each regrouping procedure, groups of clusters, sharing at least 50% of the elements, were fused into a larger one.

Identification of families by cycles of HMM searches
In order to identify all family members of each cluster, a procedure was developed, based on cycles of alignment by PCMA and search on the genome by HMMER package tools [21]. First, SCRs of clusters regrouped by sequence (see Table 2) were aligned by PCMA with option 'ave_grp_id' set to 50. The procedure can be summarized as it follows: 1. The alignment is used to build a HMM by HMMBUILD and HMMCALIBRATE, with the default options.
2. The produced HMM is used to search new elements within the genome, by using HMMSEARCH. E-value cutoff was set to 1E-10. Independent searches are run on each genomic sequence strand.
3. Identified sequences are extracted and aligned to their parental HMM by HMMALIGN. Pairs of overlapping sequences on the opposite strands are avoided by discarding the one with the worse score and E-value. 4. The aligned sequences are extended by 10% of the length of the parental HMM. Only the extensions are aligned by PCMA.
5. The alignment of the extended sequences is then used for the construction of a new model, returning to step 1.
The loop ends when one of the following criteria is met: • The detected sequences, which cover the entire model, are less than 7.
• The new model is shorter in terms of length than the previous one.
• The alignment does not extend the HMM any further (within a tolerance of 3 bp).
• The alignment contains a number of gaps higher than 30% of the aligned bases.
• The extreme value distribution, derived from the model calibration, is in the range Average_Score ± 3*Standard_Deviation, derived from HMMBUILD.
The HMM and the final alignment are used as definition of the family.

Secondary structure analyses
SLSs contained in sequences of each family were analyzed by RANDFOLD as described above and taken as positive if their p-value is < 0.005. Families were divided in four categories, according to the fraction of sequences contain-ing at least one positive SLS ('+++' if 90% or above; '++' if 70-90%; '+' if 50-70%; '-' if less than 50%). Figures  2 and 3 were chosen in the following way:

Representative sequences of the families shown in
• All sequences able to cover the entire model are sorted by the E-value determined by HMMSEARCH.
• Six sequences are picked from this population by selecting the best model-fitting one and five (if available) more with progressively increasing of the E-value.
Sequences were aligned to corresponding HMM by using HMMALIGN and the resulting alignments were analyzed by RNAz (version 0.1.1) [37]. For RNAz analysis, alignments with length <= 200 bp were used as a single block, while alignments with length >200 bp were screened in sliding windows (length 120 and slide 40), according to Washietl et al. [47].
RNAz was used with standard parameters. All alignments with RNAz classification score P > 0.5 were considered.
Overlapping hits, i.e. resulting from hits in overlapping windows, were analyzed again by using larger sliding windows able to contain structures obtained with different hits.

Conflicts of interests
The author(s) declare that they have no competing interests.