Computational prediction of splicing regulatory elements shared by Tetrapoda organisms

Background Auxiliary splicing sequences play an important role in ensuring accurate and efficient splicing by promoting or repressing recognition of authentic splice sites. These cis-acting motifs have been termed splicing enhancers and silencers and are located both in introns and exons. They co-evolved into an intricate splicing code together with additional functional constraints, such as tissue-specific and alternative splicing patterns. We used orthologous exons extracted from the University of California Santa Cruz multiple genome alignments of human and 22 Tetrapoda organisms to predict candidate enhancers and silencers that have reproducible and statistically significant bias towards annotated exonic boundaries. Results A total of 2,546 Tetrapoda enhancers and silencers were clustered into 15 putative core motifs based on their Markov properties. Most of these elements have been identified previously, but 118 putative silencers and 260 enhancers (~15%) were novel. Examination of previously published experimental data for the presence of predicted elements showed that their mutations in 21/23 (91.3%) cases altered the splicing pattern as expected. Predicted intronic motifs flanking 3' and 5' splice sites had higher evolutionary conservation than other sequences within intronic flanks and the intronic enhancers were markedly differed between 3' and 5' intronic flanks. Conclusion Difference in intronic enhancers supporting 5' and 3' splice sites suggests an independent splicing commitment for neighboring exons. Increased evolutionary conservation for ISEs/ISSs within intronic flanks and effect of modulation of predicted elements on splicing suggest functional significance of found elements in splicing regulation. Most of the elements identified were shown to have direct implications in human splicing and therefore could be useful for building computational splicing models in biomedical research.


Background
Eukaryotic genes contain intervening sequences or introns that need to be removed from precursor messenger RNA (pre-mRNA) in a complex process termed splicing. During pre-mRNA splicing, relatively short exonic sequences are recognized by spliceosome, a large RNA-protein complex. During splicing, introns are removed and exons are joined together to form mature RNA. In addition to splice site (SS) signals at the exonic 5' and 3' ends, accurate discrimination of exons and introns requires additional auxiliary elements [1][2][3]. These conserved but degenerate motifs have been termed exonic (ESEs) and intronic (ISEs) splicing enhancers and exonic (ESSs) and intronic (ISSs) splicing silencers that activate or repress splicing, respectively. These elements are thought to bind splicing regulatory factors, including the serine/arginine-rich (SR) proteins and the heterogeneous nuclear ribonucleoproteins [1]. Consistent with this concept, splicing regulatory motifs were shown to associate with a single stranded conformation that is more accessible to protein-RNA interactions [2]. Combinatorial interaction of splicing factors bound by these motifs is important for both constitutive and alternative splicing of pre-mRNAs because they contribute to the regulation of gene expression and proteomic diversity across higher eukaryotes [3][4][5][6].
Several systematic computational approaches and in vivo or in vitro selection methods have been employed to identify these motifs in the genomic sequences. For example, the RESCUE-ESE (Relative Enhancer and Silencer Classification by Unanimous Enrichment), a computational approach used in conjunction with experimental validation, predicted specific hexanucleotide sequences as candidate ESEs based on significantly higher frequency of occurrence in exons than in introns and also significantly higher frequency in exons with weak SSs than in exons with strong SSs [7]. The number of putative exonic enhancer and silencer octamers were computationally identified by their enrichment in internal non-coding exons versus unspliced pseudoexons and 5' untranslated regions of transcripts in intronless genes [8]. A cell-based fluorescence-activated screen (FAS), an in vivo splicing reporter system was used to identify ESSs that demonstrated consistent silencing results in a splicing reporter construct [9]. Evolutionary conserved intronic splicing regulatory elements were found by considering intronic boundaries surrounding orthologous exons in Homo sapiens, Canis familiaris, Rattus norvegicus and Mus musculus obtained from UCSC genome-wide multiple alignments [10]. Putative splicing regulatory sequences were reported based on evolutionary conserved wobble positions between human and mouse orthologous exons, along with overabundance of sequence motifs compared to their random expectation [11]. Exonic and intronic elements have also been predicted based on strand asymmetry [12]. Neighborhood Inference (NI) approach predicted ESEs and ESSs with activity in regulating biochemical processes based on the local density of known sites in sequence space [13]. Finally, a recent study based on deep re-sequencing of human transcriptome [14] uncovered a new repertoire of plausible intronic hexamers supporting the tissue-specific splicing events.
A large fraction of spliceosomal components are highly conserved across eukaryotes, including Tetrapoda (four-footed) organisms [1,6,[15][16][17], where the genes encoding well-known RNA binding proteins involved in splicing regulation are enriched with ultraconserved elements [18]. Three quarters of RESCUE-ESEs are shared between humans and mice [17]. Most of the human RESCUE-ESEs [7] have a pronounced bias towards exonic boundaries in more distantly related vertebrate organisms [17]. A number of experimental reports showed that genes from distantly related Tetrapoda organisms were correctly expressed and post-transcriptionally modified in transgenic animals [19,20]. These observations suggest that splicing regulatory motifs shared by tetrapods may further enrich known elements for functionally important sequences. However, no systematic studies have been carried out.
In this work, we predict an extensive set of cis-acting elements identified in a large set of Tetrapoda exons and characterize their overlap with previously identified silencers/ enhancers. Unlike in previous methods, we did not restrict the size of ESE/ISE/ESS/ISSs oligomers unless they are longer than 8 nt. Our prediction is based on the assumption that auxiliary splicing elements have pronounced statistically significant density increase/decrease towards the exonic boundaries compared to the deep intronic or exonic sequences. This assumption allows using the identified elements to improve performance of splicing prediction methods. Predicted ISEs/ISSs close to the annotated exons were examined for increased evolutionary conservation as compared to oligos with no predicted functionality. Finally, we investigated association of the elements placed in context with the single-stranded configuration of local pre-mRNA structure.  Section 6]. Among the predicted elements for the primates and outgroup 21 5'SS ISEs/ISSs and 82 3'SS ISEs/ ISSs were common between two clades. Splicing regulatory elements predicted for these two distant clades heavily overlapped with the elements ascertained for the entire Tetrapoda superclass (Table 1), suggesting a remarkable conservation of cis-acting splicing regulatory factors in vertebrate evolution.

Identification of splicing regulatory elements in tetrapods
We compared groups of the predicted exonic and intronic enhancers/silencers to better understand the "splicing code" supporting the exon definition. As could be seen in [see Additional File 1 Table S1] groups of ISEs supporting 5'SS and 3'SS sides intersect only half as expected by a random chance. This observation supports a hypothesis that independent mechanisms define neighboring exons and they do not share intronic enhancers located within common introns. On the contrary, ISSs are approximately four times more likely to be shared by the 5'SS and 3'SS sides, compared to a random chance expectation, and seem to play an active role in creating a "silencing" background within introns [21]. The group of 5'SS ISEs has substantial intersection with the 5'SS ESSs. This finding is consistent with previous observations that 5'SS ISEs frequently play silencing role if misplaced within exons [22]. This is further supported by a pronounced antagonism between 5'SS supporting ISEs and ESEs [see Additional File 1 Table  S1]. Table 2 puts the motifs we have found in retrospective context of previously reported elements shown in Table 3. Many of these elements were confirmed by the splicing reporter constructs. After excluding previously identified elements [10][11][12][14][15][16][17][18] from our Tetrapoda identified elements, a set of 373 novel oligomers was identified [see Additional File 1 Section 7]. Statistical significance for the motifs found along with LOD scores is shown in [see Additional File 1 Section 8].

Higher conservation of intronic elements
A higher evolutionary conservation of the elements found in the proximity of exonic flanks compared to the background sequences would be an important indicator of their functional importance in splicing regulation [10,14]. Within 12,000 multiple intronic flank sequence alignments we found a significantly higher conservation of the predicted intronic cis-acting octamers as compared to all other possible motifs (Table 4). Here we considered only the predicted octamers for uniform estimates of conservation scores, which would not be possible for elements of different sizes. Conservation degrees of ISSs and ISEs shown in Table 4 are similar, which suggests the importance of both enhancers and silencers in splicing definition.

Secondary structure association with the elements
According to [2] splicing enhancers and silencers are preferentially located in a single stranded region of RNA as compared to the controls, especially in the vicinity of SSs. This has been explained by the higher probability of transacting factors, such as SR proteins, to bind local singlestranded regions. We therefore determined the Probability Unpaired (PU) values for the predicted elements [see subsection Statistical analysis]. We considered only predicted octamers to obtain the PU values on the same scale which would be problematic for elements of different sizes. We examined sequences composed of predicted octamers surrounded with ± 30 nt context located in various segments of exons and introns as shown in Figure 1. PU values are known to be strongly associate with GC content of the motifs and the surrounding context [2], therefore it would be most informative to evaluate the difference in the distribution of PU values for the same group of elements surrounded by wild type and dinucleotide reshuffled contexts. Table 5 presents the average PU values for the elements located in the different segments before and after reshuffling.
Having the numerical series of PU values in various segments for different types of elements, we estimated if their distribution changes after dinucleotide reshuffling with Table 1: Intersection between putative intronic enhancers found separately for primates and outgroup clades and for the entire Tetrapoda superclass. Here is shown the ratio between the actual intersection and the expected intersection of the sets under the null hypothesis (expected intersection between the same number of randomly generated oligos). An intersection between the two sets of elements is calculated as the number of all the possible longest common substrings LCS of the two compared elements a and b, with the size | LCS| ≡ min(|a|,|b|), in ordered pairs (a, b) coming from the Cartesian product of the sets.

Vertebrates 3'SS ISEs/ ISSs
the two-sided Wilcoxon rank-sum test as shown in Table  5. Our working hypothesis was that if predicted enhancers/silencers are preferentially supported by a singlestranded configuration then average PU values should go down after contextual reshuffling as it would most probably disrupt the naturally occurring local secondary structures. We did not find statistically significant discrepancies in the distribution of PU values after reshuffling the contexts of elements located in segments associated with SS regulatory functions ('Next to 5'SS', 'Next to 3'SS' and 'Inside exon') as shown in Figure 1. The only exception was the insignificant reduction of PU values for both 5' and 3' ISSs located in deep intronic segments as could be seen in Table 5. This statistical significance is highly reproducible and holds even for reduced size subsets of 600 ISSs examined deep inside intron (P = 0.0072 for 5'SS ISSs and P = 0.027 for 3'SS ISSs).

Implication of elements found in splicing reporter experiments
In order to investigate the implications of elements found in splicing regulation, we considered systematic mutation Table 2: Intersection of predicted elements with the systematically identified elements reported in Table 1.

5'SS ISEs 5-mers [10]
Yeo et al. Here is shown the ratio between the actual intersection and the expected intersection of the sets under the null hypothesis (randomly generated oligos). An intersection between the two sets of elements is calculated as the number of all the possible longest common substrings LCS of the two compared elements a and b, with the size | LCS| ≡ min(|a|, |b|), in ordered pairs (a, b) coming from the Cartesian product of the sets.  Table S1] predicted 3'SS ISSs are three times more likely to overlap with 3'SS ESSs compared to overlap by random chance, which indicates that most of the 3'SS ISSs elements also act as 3'SS ESSs. This is further supported by Set of all other possible elements was obtained by excluding the ISEs and ISSs supporting either 5' or 3' SSs from the set of all possible octamers. We counted cases where oligonucleotides stay entirely conserved versus changing in at least one nucleotide position between the pairs of sequences from multiple sequence alignments, where only the motifs containing no gaps were considered. In case of 5'SS elements we considered window of size 20 nt starting 16 nt downstream from 5' exonic boundary in human sequence, where in case of elements supporting 3'SS we considered 20 nt window ending 63 nt upstream of 3' exonic boundary. noticing that FAS-ESS elements AGGGGT and GGAGGG [9] are similar to our predicted 3'SS ISSs GGAGGGG (A.IE -2.00) and TGGAGGG (A.IE -2.08) and a substantial overlap between predicted 3'SS ISSs and FAS-ESS decamers [24] as could be seen in Table 3. As could be seen in [see Additional File 1 Table S3]

Comparison of newly identified elements with known binding sites for RNA binding proteins
To further support the functional importance of the predicted elements we compared elements found with the oligonucleotides already known to attract RNA binding factors actively involved in splicing.
CA repeats bound by hnRNP L [25] are located in clusters D.IE.14 and A.IE.9 (here and further in this section we refer to [see Additional File 1 Section 4] listing the clusters of elements predicted). Clusters D.IE.4, A.IE.10, A.IE.12 are enriched with elements YCAY that bind the NOVA family of neuron specific splicing factors [22]. Poly-G signal has been reported simultaneously as an ISE signal [26] when located downstream of a 5' splice site (clusters D.IE.6, D.IE.12 and D.IE.15 are enriched with these elements) and play a role of an exonic silencer (cluster A.EE.5) when located inside exon [22]. The G-run-binding factor hnRNP H is known to participate in exon definition [27,28]

Conclusion
Using the orthologous exons currently available for 23 Tetrapoda organisms we have identified 2,546 unique splicing regulatory elements. Among these elements 203 (7.97%) 3'SS and 177 (6.95%) 5'SS supporting motifs are novel and have not been previously reported in systematic screens detecting such elements. Among our predicted elements, 51.81% were octamers and 41.08% of sequences were heptamers as compared to only 6.76% hexamers and 0.35% pentamers, suggesting that motifs of larger size play important role in splicing regulation. We detected intersections with some of the cis-acting elements reported in the previous studies, but not nearly as dramatic as we saw between the intronic elements predicted for primates and Tetrapoda non-eutherian (an outgroup) clades. It demonstrates high reproducibility of our results obtained for various vertebrate lineages and supports the existence of highly conserved splicing regulatory code Four segments for testing PU values for the predicted elements Figure 1 Four segments for testing PU values for the predicted elements. Next to the 5'SS and 3'SS segments were chosen to extend 50 nt context, not including ± 30 nt, inside intron from the corresponding exonic boundaries.

AG Py
Next to 5'SS segment Next to 3'SS segment

Inside intron segment
Inside exon segment

Exon Exon
Intron AG across vertebrates. This result also suggests the implications of elements found in regulating human splicing and may help explaining human hereditary disorders caused by mutations modulating such elements. We have established the higher evolutionary conservation for the predicted intronic cis-acting elements within mammalian intronic flanks which indicates their functional significance in exon definition. The elements found contain many of the known cis-acting factor binding sites with functionality supported by experiments with splicing reporter constructs. All these lines of evidence suggest active involvements of the predicted elements in control gradient directing spliceosome to the proper exons in the process of pre-mRNA splicing [23].
We did not observe statistically significant association for the predicted groups of cis-acting elements with the secondary pre-mRNA local structure in the vicinity of the SSs, except for slightly increased single strandedness detected for 5' and 3' ISSs deep inside introns. This observation is in contrast to the earlier reported [2], where known splicing regulatory motifs were identified as more single stranded compared to controls in exonic vicinity. Our result may indicate a potential mechanism of how ISSsmediated silencing background keeps spliceosomal components inactive in the deep intronic sequences by providing stronger than normal binding affinity to preferentially single-stranded ISSs.
A remarkable intersection between the 5'SS ISSs and the 5'SS ESSs [see Additional File 1 Table S1] is explained by the highly improbable chances of having elements containing a core fragment of a strong 5'SS competitor consensus in vicinity of a 5'SS. We have also established that many 3'SS ISSs act as 3'SS ESSs. These observations suggest that discovered splicing regulatory elements have broad functionality spectrum spreading beyond genomic segments where they have been originally found, such as possible regulatory role in 3'UTR [14]. . The "threaded blockset alignments" [33], built under the assumption that all matching segments occur in the same order and orientation in the given sequences, were projected onto human reference exons predicted by the spliced alignment of human reference sequences ftp:/ /ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot against reference human chromosomal assemblies http:hgdownad.cse.ucsc.edu/goldenPath/hg18/chromo somes/ using the BLAT program [34]. Having the chromosomal sequences of corresponding organisms, the blocks from the multiple genome alignments were extended to include splicing signals and 205 nt intronic flanks. We collected functionally important regions of intronic flanks normally located no further than 100 nt from the exons [14,35] and deep intronic sequences which we used as background model located beyond 100 nt from the SSs. The splicing signals flanking the extended exons have been double checked with the Bayesian SSs sensor [36] to make sure the extension yielded the correct exonic boundaries and the splicing signals flanking the exons have statistically significant score indicating their splicing competence. We kept only one isoform per gene with the largest number of predicted exons.

Collection and validation of Tetrapoda exons
This exon set was extended with exons derived from processing of 28 vertebrates multiple genome alignments obtained from UCSC genome browser [37] from the following tetrapods: Bush Baby (Otolemur garnetti), Tree Shrew (Tupaia belangeri), Guinea Pig (Cavia porcellus), Shrew (Sorex araneus), Hedgehog (Erinaceus europaeus), Cat (Felis catus), Horse (Equus caballus), Platypus (Ornithorhynchus anatinus), Lizard (Anolis carolinensis). The blocks from a total of 28 vertebrates multiple genome alignments are normally shorter than blocks from 17 vertebrates multiple genome alignments, therefore chances are higher that the block extension may not produce the correct exonic boundaries. Only the exons associated with the species not obtained through the first round should be processed in the second round.
To establish the firm ground for using sequences from distantly related organisms in predicting common SS proximal elements and to estimate implication of elements found in modeling human splicing we conducted independent search for the elements in two distantly related clades of primates and non-eutherian Tetrapoda organisms. For these purposes we have examined 489,668 extended exons in primates clade (Human (Homo sapiens), Chimpanzee (Pan troglodytes), Rhesus (Macaca mulatta), Bush Baby (Otolemur garnetti)) and 476,218 extended exons from non-eutherian Tetrapoda organisms (Opossum (Monodelphis domestica), Chicken (Gallus gallus), Frog (Xenopus tropicalis), Platypus (Ornithorhynchus anatinus), Lizard (Anolis carolinensis)) taken as the most distant outgroup (a group of species known to be phylogenetically outside the primates clade) among Tetrapoda organisms relative to primates.
Through the literature search we collected the test set of 185 human genes previously linked to autism spectrum disorder and genes implied in environmental response [38]. A set of extended exons obtained through the spliced alignment of human reference sequences for the test set against the reference chromosomal assemblies, as described previously, was used as a sample representative collection of important human genomic regions with potential implication in medical practice. The set included 4,650 canonical 5' and 3' SSs flanking internal exons and was used to estimate association of local pre-mRNA secondary structures with the predicted elements.

Statistical analysis
We measured a statistically significant bias for all the possible oligonucleotides of size equal or less than 8 bp in the vicinity of true 5' and 3' SSs compared to distant "background" locations as shown in Figures 2 and 3. Our assumption is that enhancers are more common and silencers are less frequent in vicinity of SSs compared to "background". For the convenience of representation all the elements were scored using prefix tree structure as shown in [see Additional File 1 Figure S1] to determine statistical significance. The tree structure of height 8 has 65,536 leaves associated with any possible octamer where each internal node and root have out-degree 4 corresponding to the number of possible nucleotides at a next position. When octamers get inserted in the tree the counts associated with the traversed internal nodes and the destination leaf node increase. The scores associated with an internal node of certain depth correspond to the density of an oligonucleotide of size depth present at a certain positions within genome. A significant deviation of an oligo density in the proximity of SSs as compare to background locations is strongly indicative of important functionality of elements related to splicing [10,15,33].
Comparing oligo counts at the significance level of α = 0.011 using χ 2 test involved 87,380 statistical hypotheses testing for all possible oligos of size less or equal to 8 bp located at certain position relative to a SS. Following the Bonferroni correction for multiple hypothesis testing we reduced the significance level to for individual tests. All the χ 2 tests for the SS vicinity counts ≥ 63 or deep intronic/exonic counts ≥ 63 are statistically significant under conditions or Comparative measurements between the regions shown in Figure 3 were made in 3 rounds according to experimental schemas shown in Figure 2. Every round of scoring involved all the sequences from the exonic set, elements predicted in any of these 3 scoring rounds were reported. The second comparative measurement for the Skip value 29 nt, as shown in Figure 3(A), was necessary to detect intronic enhancers/silencers that have maximum impact on splicing when located at certain optimal distance from the exonic flanks, which is the known fact in case of polyG signals [26]. Elements detected in the first comparative measurement (for Skip = 0 nt in Figure 3(A)), in the second measurement (for Skip = 29 nt in Figure 3(A)) and for the third differential measurement as shown in Figures 2(C) and 3(B)) were merged in one prediction.
Extended orthologous exons associated with the same multiple genome alignment block frequently contain identical conserved oligonucleotides at certain positions relative to SSs, especially within exons. These conserved elements violate our assumption of independence of the sequences used for analysis of statistical significance; therefore within a block we counted unique element at certain position relative to SSs only once, disregarding all other identical motifs conserved since speciation from a common ancestor. To prevent substantial element underscoring in a counting round, since many of the elements at a certain position relative to a SS were sorted out due to evolutionary conservation, we shifted an element position by one within a counting region for each consecutive sequence as shown in Figure 3(C), where element coordinate within a region were calculated according to formula where mod is a modulo operation, counting round could be 0,1 or 2 and the sequence index goes from 1 to 2,333,379. Elements shifted by one position within the region are normally different and therefore not sorted out for being similar, which allows combining more elements in the region associated with a block under the same evolutionary pressure. The PU values for the predicted octamers surrounded by ± 30 nt context were calculated as described in [39] using Position within region region start counting round sequence in Location of genomic regions used for comparative analysis Figure 2 Location of genomic regions used for comparative analysis. (A) Statistical significance tests for intronic enhancing/ silencing elements surrounding exon. Blue is the null-hypothesis region and red is the region of statistical significance associated with the exon proximity. The red region is specifically located outside the area associated with donor or acceptor signal consensuses [36]. (B) Statistical significance test for the ESEs/ESSs elements supporting the exonic definition. This strategy allows canceling the statistical biases associated with the protein coding potential best characterized by the hexamer statistics [41] and focusing at the essential difference between the exonic flanks, normally enriched with ESEs [42], and the middle section supposedly depleted of such elements. (C) The differential strategy allows detecting enhancing and silencing elements that have substantially different concentration in vicinity of a strong vs. weak SS as defined by the Bayesian SS sensor [36]. The score from the sensor is measured on a discrete scale from 1 to 5, where 1 stands for a weak signal and 5 stands for strong.  [2], a method which produced consistent results for perfect loop configuration (PU = 1), perfect stem configuration (PU = 0) and a very similar PU value for the example in [2] (Figure 1) for natural pre-mRNA structure supporting TCTCTCT element. We have also confirmed that 77 known enhancer/silencer elements are more single stranded since average PU values were going down after dinucleotide contextual reshuffling (the control) as reported in [2]. The dinucleotide reshuffling procedure [41] were making 10,000 iterations equally distributed between the non-overlapping dinucleotides swapping within or across flanking segments, excluding the ele-Location of the counting regions used for oligonucleotide scoring relative to exonic flanks Figure 3 Location of the counting regions used for oligonucleotide scoring relative to exonic flanks. All short exons that were not able to accommodate the regions are disregarded. (A) The region arrangement for the counting strategies shown in Figures 2 (A) and (B), where the Skip value is set to 0 nt for the first comparative measurement and 29 nt for the second. The second comparative measurement is necessary to predict active intronic elements that have maximum enhancing/silencing potential at certain optimal distance from the exonic boundary, such as polyG signals [26]. The second measurement also trades the smaller number of longer exons considered for the greater chance of detecting element density discrepancy between the middle of the exons and the flanks. (B) The region arrangement corresponding to differential test strategy shown in Figure 2 (C). (C) The tiling strategy within a region increases the variety of elements sampled in a counting round. Tree different colors used to show which oligo within a region gets sampled in a three consecutive statistical tests (red in the first test, green in the second test, blue in the third test). This strategy reduces chances for multiple sampling of the same oligo conserved at a certain position in closely related organisms.