Skip to main content

Comparative analysis of sequence features involved in the recognition of tandem splice sites

Abstract

Background

The splicing of pre-mRNAs is conspicuously often variable and produces multiple alternatively spliced (AS) isoforms that encode different messages from one gene locus. Computational studies uncovered a class of highly similar isoforms, which were related to tandem 5'-splice sites (5'ss) and 3'-splice sites (3'ss), yet with very sparse anecdotal evidence in experimental studies. To compare the types and levels of alternative tandem splice site exons occurring in different human organ systems and cell types, and to study known sequence features involved in the recognition and distinction of neighboring splice sites, we performed large-scale, stringent alignments of cDNA sequences and ESTs to the human and mouse genomes, followed by experimental validation.

Results

We analyzed alternative 5'ss exons (A5Es) and alternative 3'ss exons (A3Es), derived from transcript sequences that were aligned to assembled genome sequences to infer patterns of AS occurring in several thousands of genes. Comparing the levels of overlapping (tandem) and non-overlapping (competitive) A5Es and A3Es, a clear preference of isoforms was seen for tandem acceptors and donors, with four nucleotides and three to six nucleotides long exon extensions, respectively. A subset of inferred A5E tandem exons was selected and experimentally validated. With the focus on A5Es, we investigated their transcript coverage, sequence conservation and base-paring to U1 snRNA, proximal and distal splice site classification, candidate motifs for cis-regulatory activity, and compared A5Es with A3Es, constitutive and pseudo-exons, in H. sapiens and M. musculus. The results reveal a small but authentic enriched set of tandem splice site preference, with specific distances between proximal and distal 5'ss (3'ss), which showed a marked dichotomy between the levels of in- and out-of-frame splicing for A5Es and A3Es, respectively, identified a number of candidate NMD targets, and allowed a rough estimation of a number of undetected tandem donors based on splice site information.

Conclusion

This comparative study distinguishes tandem 5'ss and 3'ss, with three to six nucleotides long extensions, as having unusually high proportions of AS, experimentally validates tandem donors in a panel of different human tissues, highlights the dichotomy in the types of AS occurring at tandem splice sites, and elucidates that human alternative exons spliced at overlapping 5'ss posses features of typical splice variants that could well be beneficial for the cell.

Background

As the central intermediate between transcription and translation of eukaryotic genes, the splicing of precursors to messenger RNAs (pre-mRNAs) in the nucleus is frequently variable and produces multiple alternatively spliced (AS) mRNA isoforms. The recognition of authentic pre-mRNA splice sites out of many possible pseudo-sites, the precise excision of introns, and the ligation of exons to produce a correct message are catalyzed by a large ribonucleoprotein (RNP) complex known as the spliceosome, which is composed of several small RNPs and perhaps over two-hundred proteins [1]. Splice sites mark the boundaries between exon and intron: a 5'-splice site (5'ss or donor) at the terminus of the exon/beginning of the intron and a 3'ss (acceptor) at the terminus of the intron/beginning of the exon. In addition, introns contain a branch point signal, typically 15 to 45 nucleotides upstream of the 3'ss. During later stages of spliceosome assembly, there are mediated interactions between the 5'ss and 3'ss, as well as splicing factors that recognize them, and a basic distinction is made between the pairing of splice sites across the exon ('exon-definition') or the intron ('intron-definition') [2]. In humans, with compact exons (average length of about 120 nucleotides) and comparatively much larger introns, exon-definition is thought to be the prevalent mode of RNA splicing. When a pair of closely spaced 3'ss-5'ss signals is recognized, the exon is roughly defined by interactions between U2 snRNP:3'ss, U1 snRNP:5'ss as well as additional splicing factors, including U2AF65:branch site and U2AF35:poly-(Y) site interactions.

AS events are categorized according to their splice site choice and one can distinguish four canonical types: exon-skipping (SE), in which mRNA isoforms differ by the inclusion/exclusion of an exon; alternative 5'ss exon (A5E) or alternative 3'ss exon (A3E), in which isoforms differ in the usage of a 5'ss or 3'ss, respectively; and retention-type intron (RI), in which isoforms differ by the presence/absence of an unspliced intron [3]. These types are not necessarily mutually exclusive and more complex types of AS events can be constructed from such canonical types. Alternative splicing produces similar, yet different messages from one gene locus, thus enabling the diversification of protein sequences and function [4]. In addition, AS holds the possibility to control gene expression at the post-transcriptional level via the non-sense mediated mRNA decay (NMD) pathway. To prevent aberrantly or deliberately incorrectly spliced transcripts that prematurely terminate translation, NMD ensures that only correctly spliced mRNAs that contain the full (or nearly so) message are subsequently utilized for protein synthesis. Therefore, NMD scans newly synthesized mRNA for the presence of one or more premature-termination codons (PTCs), and, if detected, can selectively degrade defective mRNAs [5].

Fostered by the abundant accumulation of complementary DNA (cDNA) sequences and expressed sequence tags (ESTs), genome-wide computational studies of AS have investigated its scope in metazoans and estimated that a fraction of up to two-thirds of human genes are predicted to encode or regulate protein synthesis via such pathways [6–9]. The outcome of these approaches have shown SEs as the most frequent AS event in mRNA isoforms in human and other mammalian organ systems and cell types, followed by A3Es and A5Es, in turn followed by RIs [10]. Interestingly, the sequence information of SEs and their flanking regions, and the phylogenetic conservation of such information, is sufficient to discriminate constitutive exons from SEs and can be used in computational models to start predicting AS events that have not yet been uncovered by cDNA and EST analyses [11, 12].

Compared with the skipping of about one hundred exon nucleotides or the retention of several hundred intron nucleotides, A3Es and A5Es are thought to create more subtle changes, by affecting the choice of the 3'ss or 5'ss, respectively. Here, splice site usage gives rise to two types of exon segments – the 'core' common to both splice forms and the 'extension' that is present in only the longer isoform. Both types of AS events have been shown to play decisive roles during development (e.g., sex determination and differentiation in Drosophila melanogaster [13] or developmental stage-related changes in the human CFTR gene [14]), but also in human disease (e.g. 5'ss mutations in the tau gene [15]). A3Es and A5Es are thought to be regulated by splicing-regulatory elements in exons and nearby exon-flanking regions, as well as trans-acting antagonistic splicing factors, which bind them and affect the choice of splice sites in a concentration dependent manner [16, 17]. Interestingly, computational studies showed that for both A3Es and A5Es the distribution of extensions, f(E), is markedly skewed toward short-range splice forms [18]. In particular, alternative splice sites that are separated by the three-nucleotide long motif NAG/NAG/(where '/' marks an inferred splice site) make up a predominant proportion of A3E events in a mammals, extending to invertebrates and plants [19, 20]. Yet additional support from experimental studies is still very sparse, and the similarities and dissimilarities of overlapping against non-overlapping ("competitive") as well as constitutive splice sites remain to be delineated.

Here, we describe an effort to compare and contrast A5E, A3E, and constitutive splice sites of human exons derived from transcript sequences, of different human organ systems and cell types, which were aligned to the assembled human genome sequence. To study known sequence features involved in the recognition and distinction of splice sites, we performed large-scale but stringent alignments of cDNAs and ESTs to the human and mouse genome. Subsequently, we experimentally validated a subset of computationally inferred patterns of overlapping AS patterns, by RT-PCR and direct sequencing, analyzed implicated sequence and transcript features, and compared A5Es with constitutive and pseudo-exons, as well as A3Es, in H. sapiens and M. musculus. We found differences for sequence conservation and base-pairing to U1 snRNA, proximal/distal splice site utilization, occurrence of candidate motifs, and transcript coverage in subsets of overlapping 5'ss.

Our results distinguish a small but authentic enriched set of A5Es (A3Es), with specific distances between proximal and distal 5'ss (3'ss), which show a marked dichotomy between the levels of in- and out-of-frame tandem splice site usage, identify a number of candidate NMD targets, and allow the rough estimation of a number of unobserved tandem AS events based on splice site information. The implications for the processing of human alternative transcripts are discussed.

Results

Biased extensions of alternative 5'ss and 3'ss exons

Exon-skipping is the most prevalent AS type produced by the human spliceosome, as well as by all other mammals investigated to date, when averaged across different organ systems and cell types that can exhibit tissue-enriched splice forms [21, 22]. Internal alternative exons that involve exclusively either the 3'ss (A3Es) or the 5'ss (A5Es) are also abundantly produced, while the simultaneous alteration of 3'ss and 5'ss (producing exons that overlap but match neither splice site) are markedly less frequent. For A5Es the most distal splice site defines the exon core, while proximal sites (if more than one alternative choice is possible) are exon extensions only included in selected mRNAs.

Out of a collection of ~37,400 transcript-inferred human alternative exons maintained in the HOLLYWOOD database [23], AS events of about 10,300 A5Es and 9,200 A3Es were filtered for exon splice variants of solely one proximal/one distal 5'ss, while being constitutively spliced at the opposite site, and resulted to 5,275 A5Es and 4,497 A3Es; either exon set had no other inferred AS type, respectively. Stringent alignment criteria were imposed on all transcripts: 1) ESTs were required to overlap at least one co-aligned cDNA; 2) the first and last aligned segments of ESTs were required to be at least 30 nucleotides in length with 90% sequence identity; 3) the entire EST sequence alignment was required to extend over at least 90% of the length of the EST with at least 90% sequence identity; and 4) realignments of ESTs with two other algorithms were required to agree in three out of all three independent alignments (see below, as well as Methods). The resulting dataset of identical computational inferences of three methods contained 1,868 (~18%) A5Es and 3,301 (~36%) A3Es.

We subdivided alternative exons into their core and extension, where the latter is the sequence between the distal and proximal splice sites. The extension (E) included lengths up to about 250 nucleotides, with quickly decreasing transcript coverage/utilization as E increases. Larger extensions existed, albeit with barely more than a few transcripts (data not shown). For the sake of simplicity, we defined the boundary between A5E (A3E) overlapping and non-overlapping splices at E > 6 (E > 18) nucleotides and displayed the distribution f(E) for E = 1,2,...,18 nucleotides in a window across the boundary region. Noticeably, the obtained distribution f(E) for both A5Es and A3Es was highly biased for extensions with overlapping splice sites. Figure 1 shows (in the upper-left panel) that for extensions at the 5'ss the bias is caused predominantly by a peak at E = 4 nucleotides. It further shows for A5Es that short extensions exhibit a small but persistent pattern periodically occurring at E = 6, 9, 12, 15, and 18 nucleotides, all multiples of three, and thus preserving the reading-frame. These patterns of AS for short extensions were in accord, both qualitatively and in good approximation quantitatively, in an independent, comparative analysis for the mouse Mus musculus (Figure 1, lower-left panel). Overall, the median sizes of inferred alternative exons showed that SEs and A5Es tend to be shorter than CEs and A3Es, while overlapping and skewed to larger sizes [see Additional File 1, Figure S1].

Figure 1
figure 1

Occurrence of extensions ( E = 1,2,...,18 nucleotides) for A5Es (parts A, C) and A3Es (B, D), with human and mouse exons in the top and bottom panels, respectively. Extensions were inferred from three different alignment algorithms (colored as blue, SIM4; red, BLAT; and green, EXALIN) of cDNAs/ESTs to genomic DNA. The distribution f(E) for A5Es was markedly biased for extensions (E) with overlapping splice sites, with a peak at E = 4 nucleotides. Exon extensions exhibited relatively smaller but persistent periodic peaks at E = 6, 9, 12, 15, and 18 nucleotides. f(E) for A3Es also displayed a bias for overlapping splice sites, with a peak at E = 3 nucleotides and smaller peaks at 4–6 nucleotides. The program SIM4 predicted significantly more extensions at E = 4 nucleotides as compared to BLAT and EXALIN predictions of the same initial set of cDNAs/ESTs, which was indicative of spurious alignments. A comparative analysis of alternative exons in M. musculus corroborated the above patterns.

Unexpectedly, Figure 1 was indicative that different splice-alignment algorithms gave rise to quite different outcomes, particularly when faced with alignments involving short extensions. Among several standard algorithms, SIM4 displayed a strong tendency toward E = 4 nucleotides. We took a conservative approach to substantiate the identified A5E events, by realigning all corresponding transcripts to the same genomic sequence with two other algorithms, EXALIN and BLAT (the latter lacks an explicit splice site model). The results showed that for E = 4 the proportion of A5E events derived from SIM4 (~28%) was markedly higher than alignments derived from EXALIN or BLAT – yet the bias for extensions was consistently shown at E = 4 nucleotides, though with a lower proportion of ~9% [see Additional File 1, Table S1]. Manual inspection of selected SIM4 alignments showed apparent sequence inconsistencies, when compared to the secondary alignments [see Additional File 1]. In all, 1,868 of 5,275 A5Es were taken for further analysis, where ~9% (171/1,868) accounted for E = 4 nucleotides extensions.

In order to compare these findings with A3E events, we obtained the distribution of short extensions and identified a similar, albeit distinctively different pattern (upper-right panel). Figure 1 shows that f(E) exhibits a clear peak at E = 3 nucleotides, with successively smaller peaks at E = 4, 5, and 6 nucleotides. Again, these AS patterns were corroborated in a comparative analysis for M. musculus (Figure 1, lower-right panel). The extension preference of alternative 5'ss and 3'ss exons is in accord with previous studies, where in particular E = 3 nucleotides for A3Es had been examined and found to obey the pattern NAG/NAG/[20, 24, 25].

Tandem donors and acceptors

Patterns of A5Es and A3E extensions with overlapping splice sites are interesting in their own context, because they are 1) occurring most abundantly; 2) possibly differently regulated than non-overlapping, i.e. competitive, splice sites of alternative 5'ss and 3'ss exons [26, 27]; and 3) predictive of different downstream effects of AS, resulting into preferred different modes of alternative splicing at the 5'ss (out-of-frame splicing) and the 3'ss (in-frame splicing). For overlapping 5'ss and 3'ss are mainly characterized by extensions of four and three nucleotides, respectively, hereafter we denote by "A5EΔ4" tandem donors with E = 4 and similarly by "A3EΔ3" tandem acceptors with E = 3 nucleotides. We study for tandem donors known sequence features involved in the recognition of the 5'ss, and compare them to the 3'ss of alternative and constitutive exons, including exons with pseudo donors.

Generally, the basic recognition and binding to 5'ss incorporates intronic (involving positions from 1 to 6) and exonic nucleotides (positions from -3 to -1). The consensus motif for 5'ss of mammalian genes is known as CAG/GTRAGT (at positions P-3P-2P-1/P1P2-P6), where the purine (R) is either an adenine (A) or a guanine (G) base. This nine nucleotide-long motif is highly degenerated and, in fact, in the present data set of human exons only proportions of ~0.9% (966/113,386) and ~1.3% (1,431/113,386) of inferred constitutive exons exhibited exact matches to the motifs CAG/GTAAGT or CAG/GTGAGT, respectively. Figure 2 illustrates splice sites and utilization of tandem donors for three selected human genes [see Additional file 2 for a complete list of inferred tandem donors]:

Figure 2
figure 2

Illustrative examples of inferred tandem donors. White boxes denote exon and lines intron nucleotides; exon numbers (E#) corresponded to 5'-to-3' enumerated REFSEQ-annotations, the splice site score as measured by MAXENTSCAN, and the transcript coverage of the proximal and distal donor site corresponded to the number of aligned sequences. In A), E8 of the RAD9A gene shows a tandem donor with extension/G C A G/; in B) E9 of the ACAD9 gene shows a tandem donor with extension/GTAG/; in C), E15 of the SFRS16 gene shows a tandem donor with extension/GT C A/. Tandem donors in A) and C) were preferentially included in different transcripts. The conservation plot (PHASTCON scores, not in scale with the stated exon and intron nucleotides) covers A5EΔ4 splicing exons, as well as adjacent introns and downstream exons, and shows alternating patterns of high/low levels across all three examples.

A. The gene RAD9A (Ensembl gene-identifier ENSG00000172613) is a homolog conserved from yeast to human, which encodes a cell cycle-check point control protein that is required for cell-cycle arrest and DNA damage repair. The primary transcript sequence of RAD9A exhibited two alternative, overlapping 5'ss at exon E8, identified as CAG/G C A G/GT at the distal 5'ss and CAG/GTAGT T at the proximal 5'ss that extends E8 (non-consensus nucleotides are underlined; exon extension bolded). The distal and proximal 5'ss gave rise to three and 17 mRNAs, respectively, which aligned to the primary transcript structure of RAD9A. In addition to the tandem donor pattern, Figure 2 shows the splice site strength, quantified by the MAXENT score (see Methods), and the conservation profile across exons and intron, quantified by the PHASTCON score [28] computed across several genomes (from P. troglodytes to T. rubripes). Local regions of high levels of sequence conservation for exons compared with the intron are apparent.

B. A tandem donor was detected for E9 (T TG/GTA G/GT and T AG/GTAAGT) of the ACAD9 gene (ENSG00000177646), which encodes a member of the Acyl-CoA dehydrogenase gene family and plays a role in lipid catabolism. The distal and proximal 5'ss gave rise to 13 and eight mRNAs, respectively. Figure 2 shows for E9 consistently elevated levels of sequence conservation.

C. The arginine/serine-rich splicing factor 16 (ENSG00000104859) showed a tandem donor at E15 (A AA/GT C A/GT and TCA/GTAAGA). Distal and proximal 5'ss choice gave rise to nine and six mRNAs of SFRS16, respectively. Figure 2 shows that the level of sequence conservation of E15 steadily rises toward the 3'-terminus and extends well across the exon-intron junction to I16, before it rapidly decays, which was indicative of conservation due to splicing-regulatory function [29].

Experimental validation of tandem donors

Having obtained sufficient evidence from stringent transcript alignments, we pursued to validate the functional utilization of tandem splice sites from independent lines of evidence. To this end, we first searched publicly available literature (see Availability and requirements section for Pubmed URL) for AS events involving short 5'ss extensions. Yet we found only a very limited number of reported cases of splice variants with short extensions that could be traced back to tandem acceptors. The human Clasp gene (known synonyms are SFRS16, or SWAP2 for the D. melanogaster homolog), for instance, encodes the Clk4-associating arginine/serine-rich (SR)-related protein that binds to the family of CDC2-like kinases [30, 31]. The 5'ss of E15 of the Clasp/SFSR16 is an alternative tandem donor, which gives rise to the splice forms ClaspS (with the extension GT C A) and ClaspL (without). Both isoforms differ by 246 nucleotides, where ClaspS carries a PTC due to out-of-frame splicing and thereby omits a third RS-domain encoded by Clasp/SFSR16. Both isoforms were tissue-enriched in the mice brain and testis, and displayed different intra-nuclear locations, possibly controlled by the third RS-domain [30]. Another AS event involving tandem splice sites has been detected in the human growth hormone (GH) gene cluster, whose expression is developmentally controlled. The gene GH-V differentially expressed three isoforms in the placenta and testis, one of which is due to a tandem donor splice site (/GTGG/GT) of exon E4; the tandem site was not sequence-conserved in the remaining four family members (GGG G/GT). The use of the distal out-of-frame splice site caused a reading-frame shift of E5 downstream, which, in turn, overread the original termination codon and utilized a new ("delayed") termination codon further downstream. Overall, the original splice variant and GH-V/Δ4 shared 124/219 and differed by 95/219 amino acids.

Clearly, the detection of alternative tandem splice site exons is hampered due to the high similarity of isoforms and often only detectable by direct sequencing and protein sequence analysis. Consequently, an experimental assay was used to explore the splicing patterns of computationally identified alternative tandem donors directly. Table 1 list the names of a set of 14 genes with tandem acceptors (~8% of total), which were manually selected from known genes exhibiting a varying degree of transcript coverage (ranging from one to 35 transcripts for tandem splice site usage) and tested in a battery of human organ systems and cell types by RT-PCR primers targeted to the flanking exons; panels of nine normal tissue samples (from the brain, colon, heart, kidney, small intestine, spleen, thymus, ovary, and leukocytes) were assayed. The products of these 45 RT-PCRs were used to verify the identity of these PCR products by sequencing (see Figure 2, as well as Methods). For instance, Figure 3 shows for E15 of SFRS16 schematically the gene structure, proximal and distal sites of the tandem donor, and the sequence electropherogram interrogated in samples derived from the human spleen and blood. Upstream of the E15 tandem donor, both transcript sequences identically overlap and thus cannot be distinguished in the electropherogram; downstream, two nucleotide signals appear above the base line, indicating the presence of two isoforms.

Table 1 Summary of the experimental assay for validating computationally inferred human tandem donors.
Figure 3
figure 3

Experimental validation of a tandem donor activated in E15 of the SFRS16 gene using RT-PCR and direct sequencing. The top shows the gene structure of SFRS16; in the middle and bottom, E14-16 are schematically extracted and the 3'-end core and full extension sequence of E15 for proximal (TCA/gtaaga) and distal (A AA/gt c a gt) splicing are shown. Prior to reaching the 5'ss of E15, both mRNA isoforms cannot be distinguished and consequently the electropherogram displays, for each position, one nucleotide signal peak above the base line. After the tandem donor site, two nucleotide signals above the base line become visible, indicating the presence of two isoforms.

Table 1 lists the outcome for all 14 genes. In all, 50 % (7 of 14 total) of selected A5EΔ4 splicing exons showed PCR-products displaying E = 4 nucleotides for the sets of interrogated alternative exons, and the experimentally observed splice ratio between minor and major form was in agreement with the ratio suggested by EST data. Six of seven A5EΔ4 splicing exons could be mapped to protein-coding gene sequences and all six CDS affecting alternative exons created a PTC. For human tissues samples were tried to match EST-associated cDNA libraries, using a larger battery of different organ systems and cell types might validate additional A5EΔ4 splicing exons and, therefore, conducted experiments were rather delivering a lower boundary of the presence of AS events involving tandem donors.

Two distinct levels of A5E proximal and distal splicing

Studies of the inclusion and exclusion of skipped exons of the human and mouse genomes have shown that SEs can be broadly subdivided into two types: SEs that are included in the majority of transcripts (termed 'major-form'), and those that are predominantly excluded ('minor-form'). Interestingly, such SEs posses different splicing and phylogenetic properties [32]. Here, we examined whether this property is more generally related to alternative exons, by analyzing the transcript coverage of 1,816 A5Es with one proximal/one distal 5'ss (no other inferred types of AS). Figure 4A shows a scatter plot of the distal against proximal 5'ss transcript coverage for both tandem and competitive donors; the individual transcript coverage of the distal (proximal) splice site is placed above (on the right-hand side). The scatter plot shows that the number of aligned transcripts ranges from a single transcripts up to more than one hundred, with the average centering on ~13, and is biased toward lower coverage (median value of 2). We defined the ratio of proximal over distal 5'ss usage (R) and computed R for human, as well as mouse, A5Es. The inset of Figure 4A shows that the histogram of the log(R) displays a bimodal distribution, which is indicative of the presence of two types (or subpopulations) of alternative 5'ss exons – one, which is characterized by the utilization of the proximal over the distal 5'ss (type-I), and another by the utilization of the distal over the proximal 5'ss (type-II). This is reminiscent of the "major/minor form" definition of SEs, albeit here it applies to both A5E proximal and distal splice sites. We used the threshold of Rc = 2 to group all A5Es into type-I and II, or a remaining type, based on the behavior of R (see also Methods). Having two subpopulations of tandem donors, we denote by "PΔ4" ("pΔ4") the major (minor) form proximal donor of type-I, and by "DΔ4" ("dΔ4") the major (minor) form distal donor of type-II. Similarly, competitive proximal and distal 5'ss splice sites are denoted as "PΔ" ("pΔ") for type-I and as "DΔ" ("dΔ") for type-II, respectively (cf. Table 2).

Table 2 Summary of selected features analyzed for A5Es with competitive donors (A) and A5EΔ4 splicing exons with tandem donors (B), separated into major (PΔ4, DΔ4) and minor (dΔ4, pΔ4) splice forms.
Figure 4
figure 4

Scatter plot of the transcript coverage of competitive and tandem donors (A) and acceptors (B). Vertical and horizontal axes refer to the coverage of distal and proximal splice sites; solid and dotted lines mark the transcript means; A5EΔ4 and A3EΔ3 splicing exons are bolded, green and blue mark the ΔP and ΔD (major) splicing exons, respectively. The inset shows the histogram of the log-ratio (R) of the coverage of the distal over the proximal 5'ss (3'ss); curves marked in black show the smoothed distribution (splines, R package). In A) the coverage scatters mainly along the vertical or horizontal axis, which is indicative of preferentially including or excluding the exon extension from the core sequence. The coverage pattern was used to partition all A5Es into two main types, I and II, and a remaining type. The inset shows for the histogram of R a bimodal shape, which is indicative of two subpopulations of A5Es with predominant proximal or distal splice site usage. In B) the overlap between distal and proximal tandem acceptor coverage is comparatively broader, and consequently the histogram of R exhibits a unimodal shape consistent with a single population of A3Es.

Figure 4B shows the scatter plot of the distal against proximal 3'ss transcript coverage. Here, the points are comparatively larger scattered than in Figure 4A and display an "arrow head" like structure. Using the same threshold as above, we find no clear distinction between splice sites for A3Es. Rather, the data are consistent with a single population of A3Es, and the inset shows the histogram of R as an approximately unimodal shape with values of R in a similar range as observed for A5Es.

In all, tandem and competitive A5Es comprise a set of 1,641 out of 1,868 (~88 %), remaining ~12% that either exceeded the threshold definition or were covered by a single transcript. The density of PΔ and DΔ splicing exons was ~59% (type-I) and ~41% (type-II), which was in some contrast to PΔ4 and DΔ4 of type-I with ~26% (44/171) and type-II with ~69% (118/171) exons, respectively (P < 0.0001; Fisher's exact test). Scatter plots, populations, and histograms were corroborated in a comparative analysis of the transcript coverage for A5Es in M. musculus (data not shown).

Splice sites of A5Es score differently between type-I and type-II

We computed the 5'ss score distribution to study the relationship between different types of transcript coverage and sequence-complementarity of base pairing to U1 snRNA. To this end, we applied a maximum-entropy (MAXENT), or Markov-random field, based model, which has been shown to capture additional statistical significant dependencies of splicing signals than standard position-weight matrix representations [33, 34], to score the 5'ss of all A5Es (see Methods). Figure 5A shows for all PΔ and PΔ4 splicing exons of type-I the score distribution, f(S), of the distal against proximal 5'ss. The score is large (S > 0) when the splice site is 'close' to the consensus sequence, and small (S < 0) when the splice site shows marked deviations from the consensus. For type-I, we found that the scores of most PΔ and PΔ4 splicing exons were positive, ranged up to S = 12 (units of bit), and clustered narrowly around a mean value of SPΔ ≈ SPΔ4 = 7.5 (marked by horizontal lines in Figure 5A). In contrast, scores of the corresponding dΔ and dΔ4 (the minor-forms) fluctuated more broadly, and mean values were between ΔSPΔ4 ≈ 4.5 and ΔSPΔ ≈ 8 weaker than the corresponding major-form splice site. Interestingly, this trend was reversed for exons of type-II (DΔ, DΔ4), where for SDΔ and SDΔ4 the score clustered between 7 to 8, yet for minor-forms was again broadly distributed and clustered around SpΔ ≈ 4.6 and SpΔ4 ≈ -3.9, respectively. The different pattern of narrow/broad scattering of A5EΔ4 splice site strengths in dependence of their type was corroborated in a comparative analysis of f(S) in M. musculus [see Additional File 1, Figure S2].

Figure 5
figure 5

Scatter plots of 5'ss scores of competitive and tandem donors (cf. notation of Figure 4). The upper panel shows the individual and mean scores (the latter is marked by solid/dashed lines); the lower panel compares on the left-hand side the cumulative score distribution of PΔ4 and dΔ4 splice sites with constitutive 5'ss and dΨ4 (pseudo distal 5'ss, in black), and on the right-hand side pΔ4 and DΔ4 splice variants with pΨ4 and 5'ss (pseudo proximal 5'ss, in black). The threshold at which the curves intersect (S*) marks the accuracy (A) at which sets can be distinguished with equal classification errors on major and minor splice variants. A(S*) ≈ 78% for PΔ4 versus dΔ4 (PΔ4/dΔ4) and A(S*) ≈ 92% for pΔ4/DΔ4, and A(S*) ≈ 95% for dΨ4/5'ss and A(S*) ≈ 99% for 5'ss/pΨ4. In the bottom, tables show the number of exons of each type above and below S*; ordered table entries are: TP, FP, TN, and FN (on white background).

Observed patterns (/GTNN/GT) of proximal (PΔ4) and distal (DΔ4) tandem splice sites occurred with markedly different proportions (see Table 3). To what extent were the observed PΔ4 and DΔ4 splicing exons different from constitutive splicing exons (CEs) with pseudo donors having a "genomic predisposition" for tandem splicing (but were not observed)? We addressed this question by looking for constitutive 5'ss (/GT) that were flanked by another GT dinucleotide at a distance of four nucleotides either upstream (denoted as "dΨ4") or downstream of the authentic 5'ss ("pΨ4"). We searched a set of ~63,000 CEs (out of ~113,400) that exhibited proximal and/or distal pseudo tandem donors. Assuming position-independent nucleotide concentrations, the expected proportions would be ~10% (dΨ4) and ~48% (pΨ4), where the latter reflects the GT motif at positions P5 and P6 of the 5'ss consensus. We found that dΨ4 was lower than its expected occurrence and was present only in ~4% of CEs (P < 0.001; z-test), whereas pΨ4 was similar, albeit still significantly different, to the expected occurrence and present in ~47% of CEs (P < 0.001; z-test); a substantial proportion of ~5% (5,211) was comprised by GYNN/GYNNGY, but was excluded from further analysis to avoid any ambiguity. The score distribution f(S) for the above sets showed related differences. The mean scores of PΔ4 and constitutive 5'ss (downstream of dΨ4), SPΔ4 = 7.5 and S5'ss = 7.9, were about equally large (P < 0.13, Mann-Whitney test), yet SdΨ4 = -3.6 was significantly lower as compared with SdΔ4 = 2.8 (P < 2.2e-16). Similarly, the mean scores of DΔ4 and constitutive 5'ss (upstream of pΨ4), SDΔ4 = 7.9 and S5'ss = 8.7, were found to be similar, but still significantly different (P < 0.003), whereas SpΨ4 = -10.2 was significantly lower than SpΔ4 = -3.9 (P < 1.9e-13). In words, minor splice variants of tandem donors (pΔ4, dΔ4) scored larger than pseudo variants (pΨ4, dΨ4), while lower than 5'ss of constitutive splicing exons, and were consequently sufficiently different from pseudo splice sites, despite the same genomic pattern.

Table 3 Summary of the transcript coverage for all possible different patterns of A5EΔ4 splicing exons.

Discriminating between major and minor A5EΔ4 versus constitutive splicing exons

We used the difference between the 5'ss score distribution f(S) of major and minor A5EΔ4 splicing exons of tandem donors to test, based on the behavior of f(S) alone, how accurate PΔ4 can be distinguished from dΔ4, and DΔ4 from pΔ4 splicing exons. To this end, for type-I we computed the cumulative distribution F(S(n)), with n = 1,2,...N, for the set {SPΔ4}, by 1) rank-ordering all scores S(n) from the smallest to the largest score; 2) calculating sN = Σm = 1..NS(m); and 3) normalizing F(S(n)) = s n /s N . By construction, F(S(n)) is a monotonically increasing function of S and takes on its largest value at F(S(N)) = 1. Similarly, we computed G(S) = 1 - F(S) for the set {SdΔ4}, a monotonically decreasing function of S that takes on its largest value at G(S(1)) = 1. The intersection of F(S*) and G(S*) yields for each set the accuracy at which {SPΔ4} and {SdΔ4} can be distinguished, with smallest probability of error on the classification of both sets [35, 36].

Figure 5C shows for dΔ4/PΔ4 splicing exons the cumulative distributions F(S) and G(S) in the score range between -20 and 15, together with F(S) and G(S) for constitutive dΨ4/5'ss splicing exons for comparison. On the one hand, we find for PΔ4 and constitutive 5'ss that F(S) collapses to approximately one curve for S > 0, and that constitutive 5'ss exhibit a long range of negative scores, which was not seen for tandem donors. G(S) for dΔ4 decays similarly to dΨ4, albeit overall shifted by about ten units toward larger scores, and hence leads to a greater overlap between the F(SPΔ4) and G(SdΔ4) as compared with F(S5'ss) and G(SdΨ4) for constitutive splicing exons. Consequently, the accuracy A(S* = 3.5) > 95% at which one can distinguish constitutive 5'ss from dΨ4 is larger than A(7.3) = 78% for dΔ4/PΔ4. On the other hand, in Figure 5D we find for DΔ4/pΔ4 and constitutive 5'ss/pΨ4 similar relationships for F(S) and G(S), with G(SpΔ4) overall shifted by about five units toward G(SpΨ4). Both distributions are wider gapped than observed in Figure 5C, and thus the accuracy reached A(6) = 92% for alternative and A(4.6) = 99% for constitutive splice sites, respectively.

Note that distinguishing the sets above by means of a 5'ss score difference and the log-likelihood difference (LLD), presented in [24], are closely related. This can most easily be seen, by considering splice site scores derived from a standard position specific weight-matrix (PSWM) model with independent nucleotide frequencies: provided the PSWM background model remains unchanged, the slice site score difference is equal to the LLD. For the MAXENT splice site model incorporates higher-order statistical dependencies between nucleotides, this exact relationship is replaced by correlated values.

For this data, the subsets of pΨ4 and dΨ4 splice sites hold an upper limit on the overall number of human tandem donors, where the pseudo splice site remained unobserved or unutilized. Using the threshold scores suggested from discriminating PΔ4 against dΔ4 (S* = 7.3), as well as DΔ4 against pΔ4 (S* = 6.0), one finds that 23 (~0.5%) of the dΨ4 set and 530 (~1.0%) exons of the pΨ4 set exceed these thresholds as putatively unobserved tandem donors.

Nucleotide conservation around major and minor A5EΔ4 splice sites

Given existing differences between tandem donors and constitutive splicing exons with either dΨ4 or pΨ4 splice sites, we compared and contrasted the nucleotide conservation around splice sites (cf. Table 4). To this end, we computed for each splice site position (Pi) the nucleotide frequencies of proximal and distal tandem donors in type-I and type-II, and represented their information score I by individual sequence logos [37] (see Methods). I is close to zero in the absence of nucleotide conservation with respect to the background, and increases with increasing conservation up to around two bit per sequence position.

Table 4 Pseudo tandem donors occurring upstream (dΨ4, distal) or downstream (pΨ4, proximal) of constitutive 5'ss.

Figure 6 shows in part A) pictograms for constitutive 5'ss and 3'ss, proximal (PΔ4) and distal (DΔ4) tandem donors, as well as A3EΔ3 splicing exons; in B) the information score difference (ΔI) between PΔ4 and DΔ4 tandem donors to constitutive 5'ss, respectively; and in C) a species comparison of splice site positions of human A5EΔ4 splicing exons that were sequence conserved at positions P-4P-3 or P3P4 in exon of the orthologous mouse gene. We compared base frequencies of dΔ4/PΔ4 to constitutive 5'ss/pseudo dΨ4 splice sites, as well as DΔ4/pΔ4 to 5'ss/pΨ4 splice sites (data not shown), in order to identify differences in the base composition between these classes.

Figure 6
figure 6

Splice site signals and sequence conservation around splice sites. A) Pictograms of 5'ss and 3'ss of constitutive, PΔ4 and DΔ4, and A3EΔ3 splicing exons. The height of a nucleotide represents the frequency of occurrence at a given position, represented in the range of 14 nucleotides around the splice junctions. Above the constitutive 5'ss, the 3'-end of the U1 snRNA is indicated. B) Information score difference (ΔI) between PΔ4 and DΔ4, respectively, and constitutive splicing exons, as well as A3EΔ3 and constitutive splicing exons. For each position, ΔI > 0 (ΔI < 0), indicates more (lack of) information of an alternative compared to a constitutive splice site. C) Sequence conservation of human PΔ4 and DΔ4 splice sites and splice sites of exons of orthologous mouse genes, 'anchored' at major splice sites and with > 80% exon sequence identity.

On the one hand, clear statistical differences were found for dΔ4/PΔ4 splicing exons with, e.g., significantly lower levels of C but higher levels of T at P-3 (P < 10-4, χ2-test) compared to dΨ4/5'ss splicing exons. Together with P-2 and P-1, which show a significant enrichment of G and A (P < 10-4, χ2-test) of dΔ4/PΔ4 over 5'ss/dΨ4 splicing exons, respectively, P-2 possibly mismatches to U1snRNA upon binding to PΔ4, while P-3 and P-1 possibly support splicing upon binding to dΔ4 due to sequence-complementarity of base pairing with U1 snRNA. Other elevated levels of dΔ4/PΔ4 splicing exons were found for T at P-12 (P < 10-4), A at P-6 (P < 0.05), G at both P-5 and P5 (P < 0.05), and C or T at P6 (P < 10-4, χ2-tests). On the other hand, DΔ4/pΔ4 splicing exons showed a significant decrease (increase) of A (T) (P < 0.02) worsening the match with U1 snRNA for both DΔ4 and pΔ4, while an increase of A at P8 (P < 0.01) and T at P10 (P < 0.02, χ2-tests) improved the U1 snRNA sequence-complementarity of pΔ4 over pΨ4. In all, several splice site positions were differently depleted or elevated, often with the possibility to enhance the sequence-complementarity to U1 snRNA [38–41]. In particular, G at position P-1 has been attributed as crucial for U1 but not U5 snRNA base pairing, creating stacking effects to G at P1 [42], and the association of P-1 and P+5 observed for A5EΔ4 major-forms, as well as A5Es and CEs but also for dΔ4 splicing exons (type-I), was pointed out in Carmel et al. [42]. Additionally, P-7 and P-6 of dΔ4/PΔ4 splicing exons showed elevated levels of A over dΨ4/5'ss and could promote U5 snRNA-dependent base pairing via uridines in the U5 invariant loop, suggested to compensate for weaker U1 snRNA affinity [42] (neither dΨ4/5'ss nor pΔ4 splicing exons showed elevated levels).

The different levels were in accord with the average information score that takes into account the levels of all nucleotides, at a given position, against a background level. Figure 6B shows the difference ΔI between tandem and constitutive 5'ss, which is positive (negative) for higher (lower) scores of tandem against constitutive 5'ss. We found that dΔ4/PΔ4 splicing exons carried overall more information at P-12, P-6-P-2, and P-3, but as well at P-5, whereas we found that DΔ4/pΔ4 carried less information at P-2 and P-1, but more at P5 and P6. Interestingly, Figure 6B shows no marked fluctuations of ΔI between tandem and constitutive 3'ss. Figure 6C supports the above positional constraints detected for type-I and type-II, by showing the conservation around major (PΔ4, DΔ4) splice sites between human A5EΔ4 splicing exons and mouse exons of orthologous genes, 'anchored' at/GT or/GC splice sites, respectively (the major site, but not the minor site, is conserved by construction). DΔ4/pΔ4 splicing exons only conserved positions P5 and P6, whereas dΔ4/PΔ4 showed two recognizable overlapping 5'ss (positions P-4-P-2 and P1-P6) and U1 snRNA sequence-complement base pairing with extension nucleotides [42].

Exon-flanking sequences show levels of conservation in type-I, but lack of it in type-II tandem donors

Exon and flanking sequences of alternative conserved exons, or ACEs, of orthologous human and mouse genes exhibit significant levels of sequence conservation. This has most clearly been demonstrated for ACEs that undergo exon-skipping [10–12], and has also been shown for comparatively smaller sets (and thus larger statistical fluctuations) of A5Es and A3Es, including A3EΔ3 tandem acceptors [10, 19]. Such conservation could imply the utilization of splicing regulatory signals that are common to orthologous sets of genes.

We examined whether A5Es and their flanking regions exhibited comparatively higher sequence conservation when compared with constitutive exons. To this end, we mapped the set of tandem and competitive A5E exons to exons of orthologous mouse genes. Imposing a level of at least 80% sequence identity and canonical splice sites, we obtained matches for about 75% of PΔ4 and 90% of DΔ4 splice variants. For each species, we extracted the sequences of exons and up to 200 nucleotides of their flanking sequences downstream of the donor splice sites, and assessed the conservation levels for exon and intron regions (cf. Table 4 and Methods). We mapped as control sets 536/653 A3EΔ3 splicing exons (1); a randomly selected subset of CEs with 4,145/4,910 and 4,082/4,910 up- (dΨ4) and downstream (pΨ4) pseudo splice sites, respectively (2); and a randomly selected subset of 2,705/4,910 SEs (3). Note that exons of orthologous mouse genes can be constitutive or alternative and, if so, of the same or a different AS type.

Figure 7A shows for PΔ4 test and control sets the exon conservation as a combined score, and the intron conservation in the range between one and 100 nucleotides. Similarly, Figure 7B shows for DΔ4 test and control sets the exon and intron conservation. Test sets have smaller overall sizes than the controls, and therefore possess larger statistical fluctuations. We observe for both exons and introns the highest level of conservation for the control set of human SEs, which exhibit a clear enrichment over tandem donor A5Es and the remaining controls, in accord with previous analyses [11, 12, 43]. On the one hand, we found for intron flanking regions of PΔ4 splicing exons a markedly higher level of conservation as compared with CEs, ranging up to 80 nucleotides (Figure 7A), while we found for intron flanking regions of DΔ4 splicing exons a conservation level similar to CEs (Figure 7B). On the other hand, Figure 7A and 7B show no marked differences of exon conservation levels between sequences of A5EΔ4 and the control sets (except SEs), and for all investigated exon types the average conservation level was found between 80% and 85%. Previous analyses used datasets enriched by AS events that were specifically conserved between exons of orthologous human and mouse genes (also being smaller sized [10]), and a follow-up study incorporating such data did not distinguish between PΔ4 and DΔ4 splicing exons [44].

Figure 7
figure 7

Sequence conservation and splicing regulatory elements of A5EΔ4, A3EΔ3, and SEs of orthologous human and mouse genes. Upper panels A) and B) show for different AS types graphs of the mean exon conservation and of the mean conservation of exon-flanking sequences up to 100 nucleotides downstream, respectively. The conservation is shown individually for PΔ4 (panel A, green) and DΔ4 (panel B, blue) splicing exons; extension regions of A5EΔ4 splicing exons were excluded. Lower panels C) and D) show plots of occurrences of different splicing regulatory elements, located within the first 200 nucleotides of exon-flanking sequences that share > 80% exon identity and splice site signals with mouse exons.

Occurrence of splicing signals in exon-flanking sequences

The above analyses suggested a higher downstream intron conservation of PΔ4 as compared to DΔ4 and constitutive splicing exons, in conjunction with a different splice site score between the major and minor splice variants. We examined whether the occurrence of splicing-regulatory elements could, to some extent, possibly explain the observed differences (see Methods). To this end, we searched for over-representations of known oligonucleotides (six to seven-mers) implicated in splicing regulation, which were enriched in A5EΔ4 over constitutive exon-flanking regions from one to 100 nucleotides. We made use of four sets of previously computationally and/or experimentally identified nucleic sequence elements: FAS2-ESS ( A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ ) and PESS elements ( B MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hlHieaaa@36A6@ ), IREs ( C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NaXpeaaa@374D@ ), as well as ESE elements ( D MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83aXteaaa@374F@ ).

Figure 7C compares for PΔ4 splicing exons the frequency of occurrences of all four sets of sequence elements, binned to non-overlapping 20 nucleotide windows and separated for type-I and -II, against the control. Similarly, Figure 7D shows for DΔ4 splicing exons the frequency of occurrences of all four sets of sequence elements. For introns, we found for both PΔ4 and DΔ4 splicing exons a generally higher frequency of sequence elements from sets A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ and C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NaXpeaaa@374D@ , particularly from the start of the splice junction to about 40 nucleotides downstream, while elements of set B MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hlHieaaa@36A6@ are differentially enriched in PΔ4 and suppressed in DΔ4 splicing exons. Sequence elements in exons (set D MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83aXteaaa@374F@ ) were indicative of a general enrichment of ESEs in PΔ4 splicing exons, particularly from about 40 nucleotides upstream to the splice junction, which was not found for DΔ4 splicing exons (with a peak at about 60 nucleotides upstream the splice junction).

Exon E15 of the gene SFRS16, e.g., showed two purine-rich motifs, GGGGGGC and GGTGGG, located at 65 and 87 nucleotides downstream of the 5'ss (contained in sets A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ and B MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hlHieaaa@36A6@ ), respectively. Additional hexamers were located between the positions 117 and 123 nucleotides (GGGAGG), while other sequence elements (set C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NaXpeaaa@374D@ ) occurred often closer to the E15 proximal donor of SFRS16, between five and 30 nucleotides. Poly(G)-rich sequence elements are binding sites for the family of hnRNP splicing regulators [45] and have been implicated in the control of 5'ss choice [46–48]. Interestingly, a phylogenetically conserved poly(G)-rich sequence element has previously been reported as involved in the selection of tandem/GTNNNN/GA splice sites in the splicing of the human FGFR gene [49].

A5EΔ4 splicing exons often produce NMD target substrates

Inferred AS events of A5EΔ4 and A3EΔ3 splicing exons showed a "splicing dichotomy" between the 5'ss and 3'ss – while AS events of the latter result in subtle but perhaps biologically significant in-frame variation of a single amino-acid, tandem donors result in out-of-frame shifts downstream of the tandem donor and could thus lead to a truncated protein with different function or unproductive splicing, depending on the (coding) exon position. Indeed, regulated unproductive splicing and translation (RUST) has been proposed to be a mechanistic link between AS and the NMD quality control pathway [50, 51]. What is the proportion of A5EΔ4 splicing exons in the present data that might be subjected to NMD? To address this, we 1) 'standardized' the initially obtained A5E annotation by matching it with REFSEQ-annotated sequences; 2) identified REFSEQ sequences with complete exon-intron structures and annotated start-stop codons of protein coding sequence (CDS) regions; and 3) imposed proximal and distal splice sites, and recalculated the altered reading-frame and stop codon position downstream of A5EΔ4 splicing exons, while neglecting possible compensating AS events at this step [see Additional File 1, Figure S3].

The detection of in-frame stop codons is schematically sketched in Figure 8. In all, 153/171 (~90%) inferred A5EΔ4 splicing exons were confirmed by at least one REFSEQ sequence at the distal (72%), proximal (27%) or either (1%) donor site, respectively. A large majority of A5EΔ4 splicing exons (~94%) was located in CDS regions, with only marginal proportions in the 5'-untranslated region (5'-UTR) or 3'-UTR. During splicing, choice of the out-of-frame tandem donor will create an mRNA isoform with an in-frame stop codon that introduces a premature termination codon (PTC) and shortens the C-terminus in ~97% of all considered cases. Tandem splicing of exon E8 of the human RAD9 gene at E8dΔ4, e.g., truncates the RAD9 domain by 52 amino acids (15% of total length). While possibly still maintaining the domain functionality, the loss of four C-terminal phosphoserines could prevent the interaction with the (9-1-1) cell-cycle checkpoint response complex [52]. In contexts of type-I and type-II, we found more than twice (~69 %) NMD candidates produced by DΔ4 splicing exons (where splicing of pΔ4 produced PTCs), as compared with ~26 % PΔ4 splicing exons (where splicing of dΔ4 produced PTCs). The reminder of about 5 % of NMD candidates did not stem from type-I or type-II.

Figure 8
figure 8

Annotation of A5EΔ4 splicing exons in REFSEQ genes. Percentages refer to fractions of A5EΔ4 splicing exons located in the 5'-UTR, coding sequence (CDS) region, or 3'-UTR. A black-colored "s" indicates the position of the stop codon relative to the REFSEQ transcript structure, whereas the red-colored version indicates the altered stop codon due to tandem donor splicing. A5EΔ4 splicing exons embedded within CDS regions are broken down into two categories, depending on the creation of a premature (PTC) or delayed termination codon (DTC). PTCs can signal mRNAs as substrates for non-sense mediated decay.

In all, about three-quarters (78%) of PTCs were located more than 50 nucleotides upstream of the last exon-exon junction, and thus predicted to produce a marked proportion of NMD substrates [5]. Interestingly, a small number of A5EΔ4 splicing exons (~3%) was going to avoid the truncation of the transcript due to the out-of-frame shift but instead extended it. In close relation to premature termination codons (PTCs), we term these "delayed" termination codons (DTCs), where all detected DTCs were produced from utilization of the minor donor (pΔ4). For instance, tandem splicing at the pΔ4 donor of exon E13 of the HNRPU gene (ENSG00000153187), which encodes the heterogeneous nuclear ribonucleoprotein (hnRNP) U, extended the CDS region by 27 amino acids. Due to the frame shift and the occurrence of synonymous and non-synonymous codons, the amino-acid sequence is changed such that the complexity at the protein level (determined by the tool SMART [53]) increases at the C-terminal end.

Discussion

Alternative splicing is essential for protein diversification and has recently been suggested as mechanistically linked to post-transcriptional gene regulation via nonsense mediated mRNA decay (NMD) [54]. The consequences for protein sequence and function alteration, as well as triggering of the NMD pathway, have been demonstrated for exon-skipping events in several studies [55–57]. While there is further evidence for the functioning and regulation of the remaining types of alternative exons [44], our understanding of their sequence evolution, produced AS patterns, regulation, and functioning still remains relatively vague [58]. In this paper, we analyzed differences and similarities between sets of A5Es, A3Es, and CEs, and focused on a particular type of a pair of alternative donors that are tandemly arrayed and overlapping.

Alternative 5'ss exons (A5Es) were computationally inferred from a collection of stringently aligned cDNA and EST sequences to the human genome, and their sequence features were compared to known features involved in RNA splicing. Spliced-alignments were obtained from the three independent algorithms (SIM4, BLAT, and EXALIN). EXALIN detected the smallest number of subtle AS patterns, which are characteristic of tandem donors (involving just a few nucleotides long extensions), most of which were also identified by SIM4 and BLAT. For there is no "true" method of inferring AS events, all analyses were based on the subset defined by the intersection of the predictions of all three algorithms. While one cannot rule out misalignments still arising from three methods in some instances, rigor was taken to produce a confidence-enriched set. In addition, we pursued other independent lines of evidence and experimentally validated a subset of 14 human genes with tandem donors across different tissues. The outcome confirmed about 50% A5EΔ4 splicing exons and provided evidence that a substantial fraction of tandem donors detectable in public sequence repositories are not explained by sequence alignment ambiguities. We found that almost one tenth of all human A5Es with exactly one shorter and one longer splice variant, and no other inferred splice type (SE, A3E, or RI), were A5EΔ4 splicing exons. Interestingly, Figure 1 also shows a small but persistent pattern of higher frequencies at E = 6, 9, 12, 15 and 18 nucleotides, which is indicative that competitive splice sites had biased extensions that preserve the reading-frame.

The central outcome of our study points to a splicing dichotomy between human alternative 5'ss and 3'ss exons in that they were markedly biased toward overlapping splice sites, with A5Es biased for E = 4 nucleotides (tandem donors, A5EΔ4), in contrast to A3Es biased for E = 3 nucleotides (tandem acceptors, A3EΔ3). Both, A3E and A5E biases in exon length variation have been previously reported [20, 24, 25], but their pertinent features have largely remained hidden. It is important to note that AS at both the 5'ss and 3'ss gives rise to splicing variations with very subtle changes to the encoded protein sequence, but further downstream A5EΔ4 and A3EΔ3 splicing exons lead to very different consequences. While A3EΔ3 splicing exons of the form of NAG/NAG/have been analyzed in some detail, in part with several controversial interpretations [20, 24], A5EΔ4 splicing exons had not previously been confirmed experimentally and only initially been characterized [25].

In this context, pertinent questions are whether 1) such frequently observed changes arise possibly by spliceosomal error, and 2) the eukaryotic cell has found a way to neutralize or even benefit from downstream consequences that arise from such AS events. Provided their biological authenticity, what is the nature of overlapping splice site choice? Several models for splice site choice have been proposed, including the competition between antagonistic splicing factors (e.g., ASF/SF2 and hnRNP A1) and U1 snRNP [59–61], a scanning mechanism [62], or cis-acting motifs with different free-energy for binding U1 snRNP and splice factors between competing sites [26]. These models take into account the binding property of the U1 snRNA and additional factors. Consequently, we investigated known features involved in splice site choice, as well as consequences to the post-transcriptional regulation of A5EΔ4-carrying genes, and compared A5EΔ4 splicing exons with A3EΔ3 and constitutive splicing exons in the light of existing models for 5'ss selection.

Examined features showed differences that individually came out subtle, yet taken in concert were indicative of a spliceosomal distinction of overlapping 5'ss. We found that overlapping tandem donors, but not acceptors, can be distinguished into major-form (PΔ4, type-I; DΔ4, type-II) and minor-form (dΔ4, type-I; pΔ4, type-II) splicing exons for both proximal and distal splice sites. This is further corroborated by splice site scores, which correlated with their respective major/minor-form behavior. On the one hand, splice sites deviated most from the consensus for PΔ4 splicing exons at positions P-4, P-3, and P3 (ΔI > 0) as well as P4, P5 (ΔI < 0), overlapping positions of U1 snRNA nucleotides implicated in 5'ss selection [26, 46]; some of which have also been related to codon preference [25]. Interestingly, more distant positions, such as P-12 also displayed statistically significant deviations from the consensus. Because of its close proximity to the edge of the U1 snRNA stem-loop it possibly contributes to U1 binding when dΔ4 is spliced. On the other hand, DΔ4 splicing exons showed different deviations from the consensus at P-2, P-1, P2 (ΔI < 0) as well as P5, P6 (ΔI < 0). Based on other experiments on position-specific stabilizing and advancing spliceosomal interactions with the 5'ss, these differences between type-I and type-II are indicative that PΔ4 improves above DΔ4 splicing compatibility with U1-snRNA,

Previous computational studies showed the conservation of sequences flanking ACEs at higher levels as compared with sequences around species-specific or constitutively spliced exons [12, 63]. We observed higher levels of conservation around PΔ4, but similar levels for DΔ4 splicing exons, when compared with constitutive exons (or the 5'ss of A3EΔ3 splicing exons). Interestingly, the higher level is in accord with a larger number of detected splicing-regulatory (ESS) elements, often positioned in proximity to A5E tandem donors. In contrast to typical AS events, however, tandem donors are hindered to place regulatory elements between alternative donors. Our data show an elevation of ESE elements near dΔ4, in conjunction with an enrichment of ESS elements of flanking introns. This could be interpreted in a model, in which tandem donors restrictively exploit elements in proximal polarity (near dΔ4), to attract the U1 snRNP to this site of the tandem donor, and/or in distal polarity to dΔ4, to impair binding to PΔ4 [61].

For the majority of tandem donors was embedded in CDS regions, the downstream effects of Δ4 splicing was predictive of producing PTCs. Splicing at pΔ4 produced putative NMD substrates in more than two-thirds of all cases, whereas dΔ4 splicing exons showed about one-quarter, suggesting that pΔ4 and dΔ4 (the minor-forms) were more likely to serve as the corresponding NMD candidates. Interestingly, a small set of A5EΔ4-carrying genes avoided PTCs, yet instead was inferred to use DTCs (delayed termination codons) positioned downstream of the original signal. Utilization of the E15 proximal tandem donor of the human SFRS16 gene, e.g., with significantly high levels of E15 flanking sequence conservation well over 120 nucleotides in I16 (typical of RNA splicing conservation across species [12]), produced a PTC that apparently avoided NMD [64]. Using differentially binding antibodies, a previous study [30] showed that SFRS16 produced two detectable isoforms, which correspond to E15 tandem splicing. In another example, a Δ 4-type 5'ss change from type-I (wild-type) to type-II splicing was observed in E10 of human patients with a deficiency in the adenosin deaminase (ADA) gene, where a P+1G>A transition downstream of E10 activated splicing of a latent proximal donor [65].

A survey of gene ontology (GO) functions of the categories "molecular function" and "biological process" for genes with PΔ4 and DΔ4 splicing exons showed a significant enrichment in several proteins, while after corrections for multiple testing only the single GO-term "RNA binding" (P < 0.005, t-test) was significantly enriched, when compared between PΔ4 and dΨ4, as well as DΔ4 and pΨ4, splicing exons (see Methods).

Conclusion

This study substantially affirms the utilization of tandem donors, thus supporting and complementing earlier findings of previously undetected AS events [25, 44]. While there exist examples of cryptic Δ 4-type 5'ss in the literature [33, 66], here we demonstrated that such splice variations are potentially enriched in authentic AS events, also supported by experimental studies [30, 67]. Critically, pertinent data are not yet at hand to make conclusive inference about the specific regulation of A5EΔ4 splicing exons (e.g. controlled expression of species-specific minor/major isoforms), here transcript data acquisition and careful spliced-alignments have added to a higher confidence of tandem donor (and acceptor) utilization, and deeper insight will require different types of data, e.g., from mini-genes in different organ systems and cell types, U1 snRNP mutants, or variations of splicing factor dosages.

In one extreme view, incorporating a mechanistic and dosage-dependent model [26, 61], the selection of AS sites depends on the properties of U1 and/or U6 snRNPs binding interrelated with antagonistic effects mediated by splicing enhancing and suppressing factors. Thus it was shown, e.g., that the choice of a tandem splice site of E10 of the FGFR gene can be determined by a higher sequence-compatibility of the E10 proximal splice site (pΔ6) to U6 snRNA [49]. In addition, constraints set by secondary mRNA structures [68, 69] have been shown to influence splice site choice. In the opposite extreme, suggested by the reduced difference of splice site scores, tandem donors could be the outcome of stochastic binding at overlapping 5'ss and lack implicit functional implications [24], which is supported by type-I isoforms. Either view largely requires the NMD pathway to control deliberatively or aberrantly produced truncated messages.

Coming back to the question of whether there is a possible benefit of generating flawed mRNA isoforms, by deliberately or aberrantly produced AS variants with out-of-frame shifts and PTCs (either due to A5EΔ4 or other types of AS), what could be their functional utilization on the transcriptional or translational level? If such splice variants would be generally produced across organ systems and cell types, in addition to their normal splice variants, cells would have means of producing low levels of imperfect proteins. Depending on the efficiency of mRNA quality control, a fraction of which is subjected to the NMD pathway during the first pioneer round of translation and degraded, while a remaining fraction could still misfold and – depending on the quality control of protein synthesis – form defective ribosomal products (DRiPs). Ubiquitin-tagged peptide fragments that originate from DRiPs have recently been identified as a potent source of antigens for display by the MHC class I molecules on the cell surface to cognate CD8+ T-cells, in agreement with a recently suggested mechanism of "immune surveillance" [70–72]. A motivating example is given by the human Tyrosinase-related protein 1 (TYRP1), which utilizes two different reading-frames to produce the protein gp75 (recognized by IgG) and a truncated 24 amino-acids long peptide. The latter was shown to be the source of an antigenic peptide specifically recognized by T-cells as a tumor rejection antigen [73]. It remains to be substantiated whether such antigenic peptides are linked to AS events that produce variants with out-of-frame shifts, such as produced by tandem donors.

Methods

Data set of alternative exons

Exons of human and mouse genes were extracted from the HOLLYWOOD database [23]. For two different transcripts aligned to a genomic locus, alternative 5'ss exons (A5Es) matched at their 3'ss, but exhibited exactly one short and one long splice form resulting from variation at the 5'ss. Alternative 3'ss exons (A3Es) matched at their 5'ss, but exhibited exactly one short and one long splice form resulting from variation at the 3'ss. Constitutive exons (CEs) were defined as exons of multi-exon genes that have as of date no transcript-supported evidence for undergoing any type of AS. In all AS events, A5Es, A3Es and CEs are "internal exons", and each exons had to obey the consensus splice sites/GT or/GC at the 5'ss and AG/at the 3'ss. U12-type introns were excluded from this analysis, because of their low fraction (less than 1% of the human introns).

Spliced-alignments

Manual inspection of A5Es with short extensions (E < 6 nucleotides), previously excluded in HOLLYWOOD, revealed a substantial amount of putative alignment artifacts due to misaligned nucleotides close to exon-intron junctions [see Additional File 1]. Alignments were derived for ESTs by the SIM4 program [74], and were corroborated in a recent performance study of spliced-alignment algorithms [75]. In particular, we found examples were SIM4 introduces shifts of EST nucleotides between genomic donor and acceptor sites at genomic loci that encode short varying alternative exon (cf. Figure 1). To decrease the number of spurious alignments in the dataset of A5Es and A3Es, we used the original ESTs and created new transcript-to-genomic alignments, by utilizing two different algorithms: 1) BLAT [76], as stored in the UCSC database (see Availability and requirements section for URL); and 2) EXALIN [75], with the parameter set (m, n, q, r, x) = (25, 25, -25, -25, and -25). Manual inspection of control samples in the alignment results confirmed a clearly improved quality in the correct exon-intron boundary recognition. In all, about 35% of all initial A5E predictions (~9 %) of A5EΔ4 splicing exons could be confirmed by both BLAT and EXALIN alignments. Subsequent analyses were performed using the subset confirmed by three alignment methods.

Classification of major and minor tandem donors

The number of transcripts that aligned either to the distal N(d) or proximal N(p) donor was used to classify A5Es. To this end, one can 1) calculate the ratio R (0 < R ≤ 1) of the lower over the higher transcript coverage as R = N(d)/N(p), if N(d) <N(p), or 1/R if N(p) <N(d); 2) compute the overall number of A5Es below a threshold value, R <T (0 < T < 1); and 3) define A5Es as "major" if the transcript coverage was at least twice as large as the corresponding "minor" splice site (T = 0.5). In this analysis, the threshold for minimal coverage was taken as a single transcript.

Statistical analysis of splice site

The deviation of splice sites from the consensus was quantified by a maximum-entropy scoring model, implemented in MAXENTSCAN and publicly available [34]. The 5'ss model incorporates the last three (first six) nucleotides of the exon (intron), and the 3'ss model incorporates the last 20 (first three) nucleotides of the intron (exon). Sequence logos and pictograms were computed and displayed using the WEBLOGO tool with finite-sample size correction [37].

P-values of splice site frequencies where calculated as follows: 1) frequencies of occurrences at the considered at PΔ4 and pΨ4 splicing exons, as well as DΔ4 and dΨ4 splicing exons, where compared by a 4 × 2 contingency table and χ2-test; 2) statistically significant positions were selected at P < 0.05; 3) at the same position, the nucleotide (maximally two nucleotides) with the largest difference of the frequency of occurrence between two types (e.g., PΔ4 and pΨ4) was subsequently tested against the remaining nucleotides by 2 × 2 contingency table and χ2-test, where P < 0.05 was considered as statistically significant.

The information along a sequence was calculated as the relative (or Kullback-Leibler) entropy, which estimates the "distance" between an observed frequency distribution (p) to an expected frequency distribution (q), according to [77]

I = ∑ k p k ⋅ log 2 ( p k q k ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaeyypa0ZaaabuaeaacqWGWbaCdaWgaaWcbaGaem4AaSgabeaaaeaacqWGRbWAaeqaniabggHiLdGccqGHflY1cyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGOmaidabeaakmaabmaajuaGbaWaaSaaaeaacqWGWbaCdaWgaaqaaiabdUgaRbqabaaabaGaemyCae3aaSbaaeaacqWGRbWAaeqaaaaaaOGaayjkaiaawMcaaaaa@444E@

where k denotes the number of possible outcomes. The summation over all relevant sequence positions gives the total information score. The background distribution was taken as (q1, q2, q3, q4) = {A,G,C,T} = (0.2, 0.3, 0.3, 0.2).

Identification of non-sense codons

For each A5EΔ4 splicing exon, the longest cDNA that mapped to the corresponding gene with annotated CDS start and end position was taken as a reference sequence. In most cases such a reference was only available for either the proximal or distal alternative splice form. Identification of mRNAs with the potential to trigger NMD was performed, by comparing the reading-frame after splicing at each tandem donor. Tandem events led to a new reading-fame, the first downstream non-sense codon of which was detected and analyzed for PTCs occurring more than 50 nucleotides upstream of the last exon-exon junction to elicit NMD [5, 50].

Detection of sequence conservation

The core of A5EΔ4 splicing exons was matched against mouse genomic DNA (version mm03), using BLAST with parameter values -a2 -gT -W10 -q-2 -r3 -e0.001. Significant matches of similarity were filtered for canonical splice sites and exon-flanking regions of 200 nucleotides were extracted from the genomic sequence. Subsequently, orthologous human and mouse intron regions were aligned using the DNA BLOCK ALIGNER [78], with parameter values -nomatchn -gap 0.02 -blockopen 0.2 -umatch 0.05 -pff, which detects block of conserved sequences located at possible different positions relative to splice junction. The sequence position of detected blocks of conservation was parsed and recorded with the script DBA-PARSER (Holste, unpublished data) and plotted in a region of 100 nucleotides, with a moving-average of ten nucleotides. Exon conservation was determined by the score (Sort) from CLUSTALW alignments, self-alignment of the larger exons to yield the score Sid, and calculation of the normalized score Stot = Sort/Sid.

Experimental assay

1) RT-PCR amplification: For validation of splice variants, nested PCR was performed using 100 ng cDNA templates from the Human Multiple Tissue cDNA Panels I and II (BD Biosciences). Splice variants were enriched for EST originating from different cDNA libraries and, for a given gene, suitable tissues were chosen according to the origin of ESTs for the minor splice variant or the expression profile found in the Stanford SOURCE data base [79]. Primers were obtained from Metabion. Nested RT-PCR reactions were set up with ReadyToGo PCR beads (Amersham) and 10 pmol primer in 25 μl total volume, according to the manufacturer's instructions. The thermocycle protocol was 1 min 30 sec initial denaturation at 93°C, followed by 25 cycles of 40 sec denaturation at 93°C, 40 sec annealing at 55°C, 1 min extension at 72°C, and a final 4 min extension step at 72°C. In the second round of nested PCR, 2 μl first-round product was amplified for 30 cycles. Ethanol-precipitated PCR products were directly sequenced using target-specific forward and reverse primers; 2) Sanger sequencing: Reactions were set up with 200 ng template DNA, 10 pmol primer, and BigDye v3.1 (Applied Biosystems) in 10 μl final volume, according to the supplier's instructions. The thermocycle protocol was 5 min initial denaturation at 95°C, followed by 29 cycles of 30 s denaturation at 95°C, 10 s annealing at 55°C, 4 min extension at 60°C. After ethanol precipitation, automated sequence separation and detection was done on an ABI 3730XL sequencer. Electropherograms were processed by PHRED [80]. After automated assembly (Staden package, [81]), sequence variations were verified by manual inspection using GAP4 (Staden package).

Presence of splicing-regulatory elements

Searching for splicing regulatory elements in exon-flanking regions was performed by using the following data sets (compiled in [82]): 176 predicted exonic splicing silencers identified in Wang et al. [83], 753 predicted intronic enhancers and/or silcencers identified in Yeo et al. [29], and 1,013 putative exonic splicing silencers identified in Zhang et al. [84]. All elements were searched for in a region of 100 nucleotides flanking proximal tandem donors, and exact matches were counted in non-overlapping sequence windows of 20 nucleotides.

Gene ontology (GO) annotations

GO-terms for genes with A5EΔ4 splicing exons (358 GO terms), A5Es (1,414), and CEs (3,655) were obtained from the Ensembl database (see Availability and requirments section for URL), corresponding to 129 and 1,283 genes with A5EΔ4 splicing exons or A5Es, respectively, and 8,664 genes of a control set. GO annotations for A5EΔ4 splicing exons of 129 of 166 genes (representing the total set of 171 A5EΔ4 splicing exons) were mapped, and the most frequent category annotations "molecular function" and "biological process" were selected; in decreasing order: "ATP binding", "Zinc ion binding", "Regulation of transcription, DNA-dependent", "Transferase activity", "Signal transduction", "Hydrolase activity", "RNA binding", "Protein binding", "Transcription factor activity" and "DNA binding". In order to compare the GO annotations of A5EΔ4 genes against a control, 10,000 genes with at least one pseudo splice site, dΨ4 or pΨ4 splicing exons (each comprising 129 genes) were sampled and the frequency of occurrence of a certain GO term was computed. The statistical significance (P-value) was calculated analogous to [29], by assessing the frequency of occurrence that a certain GO-term was present in the control more frequently than in the A5EΔ4 gene set, divided by 10,000. The outcome showed the following categories as significant as the 0.005 percent level: "Signal transduction (PΔ4/dΔ4 vs 5'ss/dΨ4, 0.07; DΔ4/pΔ4 vs 5'ss/pΨ4, 0.15), "RNA binding" (0.0004; 0.003), "GTP binding" (0.02; 0.04), "Electron transport" (0.02; 0.03), "Protein biosynthesis" (0.01; 0.03), "Signal transducer activity" (0.04; 0.08). To correct for multiple testing, we applied a (conservative) Bonferroni correction [85], divided the P-value chosen by the number of performed tests, and GO-terms occurring with Pc < 0.05/10 = 0.005 were considered as significant.

Availability and requirements

Original and supplementary data files are available. Additional File 1: Figures: human exon length distribution; scatter plots of 5'ss scores of competitive and tandem donors extracted from mouse M. musculus; occurrences of A5E and A5EΔ4 exons in REFSEQ sequences; WEBLOGO representations of A5Es and A3Es for E = 3, 4,...,15 nucleotides; Tables: statistics of human and mouse cDNA/EST-to-genome alignments; statistics of MAXENT score distributions of PΔ4/pΔ4 and DΔ4/dΔ4 splicing exons; transcript coverage of all possible dinucleotides (NN) defining donors in witch the motif/GTNN/GT; sequence conservation levels for exons (CLUSTALW) and their flanking regions of 14 A5EΔ4 splicing exons assayed in RT-PCR experiments; Electropherograms). Additional File 2: A5EΔ4 splicing exons with tandem donor sequences.

Pubmed: http://www.pubmed.org

UCSC: http://genome.ucsc.edu

Ensembl database: http://www.ensembl.org

Abbreviations

AS:

alternative splicing or alternatively spliced

5'ss:

5' splice site

3'ss:

3' splice site

cDNA:

complementary DNA

EST:

expressed sequence tag

SE:

skipped exon

A5E:

alternative 5'ss exon

A3E:

alternative 3'ss exon

PΔ4 (pΔ4):

proximal-major (proximal-minor) tandem donor

DΔ4 (dΔ4):

distal-major (distal-minor) donor

pΨ4:

constitutive exon with sequence match to/GT 3', but not 5', of the splice site (and with lack of evidence of AS)

dΨ4:

constitutive exon with sequence match to/GT 5', but not 3', of the splice site (and with lack of evidence of AS)

PTC:

premature termination codon

DTC:

delayed termination codon.

References

  1. Jurica MS, Moore MJ: Pre-mRNA splicing: awash in a sea of proteins. Mol Cell. 2003, 12 (1): 5-14. 10.1016/S1097-2765(03)00270-3.

    Article  CAS  Google Scholar 

  2. Berget SM: Exon recognition in vertebrate splicing. J Biol Chem. 1995, 270 (6): 2411--2414.

    Article  CAS  Google Scholar 

  3. Ladd AN, Cooper TA: Finding signals that regulate alternative splicing in the post-genomic era. Genome Biol. 2002, 3 (11): reviews0008-10.1186/gb-2002-3-11-reviews0008.

    Article  PubMed Central  Google Scholar 

  4. Graveley BR: Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001, 17 (2): 100--107. 10.1016/S0168-9525(00)02176-4.

    Article  CAS  Google Scholar 

  5. Maquat LE: Nonsense-mediated mRNA decay: splicing, translation and mRNP dynamics. Nat Rev Mol Cell Biol. 2004, 5 (2): 89--99. 10.1038/nrm1310.

    Article  CAS  Google Scholar 

  6. Mironov AA, Fickett JW, Gelfand MS: Frequent alternative splicing of human genes. Genome Res. 1999, 9 (12): 1288-1293. 10.1101/gr.9.12.1288.

    Article  CAS  PubMed Central  Google Scholar 

  7. Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger S, Reich J, Bork P: EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Letters. 2000, 474 (1): 83-86. 10.1016/S0014-5793(00)01581-7.

    Article  CAS  Google Scholar 

  8. Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 2001, 29 (13): 2850-2859. 10.1093/nar/29.13.2850.

    Article  CAS  PubMed Central  Google Scholar 

  9. Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003, 302 (5653): 2141-2144. 10.1126/science.1090100.

    Article  CAS  Google Scholar 

  10. Sugnet CW, Kent WJ, Ares M, Haussler D: Transcriptome and genome conservation of alternative splicing events in humans and mice. Pac Symp Biocomput. 2004, 66--77.

    Google Scholar 

  11. Sorek R, Shemesh R, Cohen Y, Basechess O, Ast G, Shamir R: A non-EST-based method for exon-skipping prediction. Genome Res. 2004, 14 (8): 1617--1623. 10.1101/gr.2572604.

    Article  CAS  PubMed Central  Google Scholar 

  12. Yeo GW, Nostrand EV, Holste D, Poggio T, Burge CB: Identification and analysis of alternative splicing events conserved in human and mouse. Proc Natl Acad Sci U S A. 2005, 102 (8): 2850--2855. 10.1073/pnas.0409742102.

    Article  CAS  PubMed Central  Google Scholar 

  13. Graveley BR: Sex, AGility, and the regulation of alternative splicing. Cell. 2002, 109 (4): 409--412. 10.1016/S0092-8674(02)00750-X.

    Article  CAS  Google Scholar 

  14. Mouchel N, Broackes-Carter F, Harris A: Alternative 5' exons of the CFTR gene show developmental regulation. Hum Mol Genet. 2003, 12 (7): 759--769. 10.1093/hmg/ddg079.

    Article  CAS  Google Scholar 

  15. Hutton M, Lendon CL, Rizzu P, Baker M, Froelich S, Houlden H, Pickering-Brown S, Chakraverty S, Isaacs A, Grover A, Hackett J, Adamson J, Lincoln S, Dickson D, Davies P, Petersen RC, Stevens M, de Graaff E, Wauters E, van Baren J, Hillebrand M, Joosse M, Kwon JM, Nowotny P, Che LK, Norton J, Morris JC, Reed LA, Trojanowski J, Basun H, Lannfelt L, Neystat M, Fahn S, Dark F, Tannenberg T, Dodd PR, Hayward N, Kwok JB, Schofield PR, Andreadis A, Snowden J, Craufurd D, Neary D, Owen F, Oostra BA, Hardy J, Goate A, van Swieten J, Mann D, Lynch T, Heutink P: Association of missense and 5'-splice-site mutations in tau with the inherited dementia FTDP-17. Nature. 1998, 393 (6686): 702--705. 10.1038/31508.

    Article  CAS  Google Scholar 

  16. Blanchette M, Chabot B: Modulation of exon skipping by high-affinity hnRNP A1-binding sites and by intron elements that repress splice site utilization. Embo J. 1999, 18 (7): 1939-1952. 10.1093/emboj/18.7.1939.

    Article  CAS  PubMed Central  Google Scholar 

  17. Shomron N, Alberstein M, Reznik M, Ast G: Stress alters the subcellular distribution of hSlu7 and thus modulates alternative splicing. J Cell Sci. 2005, 118 (Pt 6): 1151-1159. 10.1242/jcs.01720.

    Article  CAS  Google Scholar 

  18. Zavolan M, Kondo S, Schonbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T, Group RIKENG, Members GSL: Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 2003, 13 (6B): 1290--1300. 10.1101/gr.1017303.

    Article  CAS  PubMed Central  Google Scholar 

  19. Akerman M, Mandel-Gutfreund Y: Alternative splicing regulation at tandem 3' splice sites. Nucleic Acids Res. 2006, 34 (1): 23--31. 10.1093/nar/gkj408.

    Article  CAS  PubMed Central  Google Scholar 

  20. Hiller M, Huse K, Szafranski K, Jahn N, Hampe J, Schreiber S, Backofen R, Platzer M: Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity. Nat Genet. 2004, 36 (12): 1255--1257. 10.1038/ng1469.

    Article  CAS  Google Scholar 

  21. Xu Q, Modrek B, Lee C: Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res. 2002, 30 (17): 3754-3766. 10.1093/nar/gkf492.

    Article  CAS  PubMed Central  Google Scholar 

  22. Yeo G, Holste D, Kreiman G, Burge CB: Variation in alternative splicing across human tissues. Genome Biol. 2004, 5 (10): R74-10.1186/gb-2004-5-10-r74.

    Article  PubMed Central  Google Scholar 

  23. Holste D, Huo G, Tung V, Burge CB: HOLLYWOOD: a comparative relational database of alternative splicing. Nucleic Acids Res. 2006, 34 (Database issue): D56--D62. 10.1093/nar/gkj048.

    Article  CAS  PubMed Central  Google Scholar 

  24. Chern TM, van Nimwegen E, Kai C, Kawai J, Carninci P, Hayashizaki Y, Zavolan M: A simple physical model predicts small exon length variations. PLoS Genet. 2006, 2 (4): e45-10.1371/journal.pgen.0020045.

    Article  PubMed Central  Google Scholar 

  25. Dou Y, Fox-Walsh KL, Baldi PF, Hertel KJ: Genomic splice-site analysis reveals frequent alternative splicing close to the dominant splice site. RNA. 2006

    Google Scholar 

  26. Roca X, Sachidanandam R, Krainer AR: Determinants of the inherent strength of human 5' splice sites. RNA. 2005, 11 (5): 683--698. 10.1261/rna.2040605.

    Article  CAS  PubMed Central  Google Scholar 

  27. Wang Z, Xiao X, Nostrand EV, Burge CB: General and specific functions of exonic splicing silencers in splicing control. Mol Cell. 2006, 23 (1): 61--70. 10.1016/j.molcel.2006.05.018.

    Article  CAS  PubMed Central  Google Scholar 

  28. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.

    Article  CAS  PubMed Central  Google Scholar 

  29. Yeo GW, Nostrand EL, Liang TY: Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet. 2007, 3 (5): e85-10.1371/journal.pgen.0030085.

    Article  PubMed Central  Google Scholar 

  30. Katsu R, Onogi H, Wada K, Kawaguchi Y, Hagiwara M: Novel SR-rich-related protein clasp specifically interacts with inactivated Clk4 and induces the exon EB inclusion of Clk. J Biol Chem. 2002, 277 (46): 44220--44228. 10.1074/jbc.M206504200.

    Article  CAS  Google Scholar 

  31. Lin CL, Leu S, Lu MC, Ouyang P: Over-expression of SR-cyclophilin, an interaction partner of nuclear pinin, releases SR family splicing factors from nuclear speckles. Biochem Biophys Res Commun. 2004, 321 (3): 638-647. 10.1016/j.bbrc.2004.07.013.

    Article  CAS  Google Scholar 

  32. Xing Y, Lee C: Alternative splicing and RNA selection pressure--evolutionary consequences for eukaryotic genomes. Nat Rev Genet. 2006, 7 (7): 499--509. 10.1038/nrg1896.

    Article  CAS  Google Scholar 

  33. Roca X, Sachidanandam R, Krainer AR: Intrinsic differences between authentic and cryptic 5' splice sites. Nucleic Acids Res. 2003, 31 (21): 6321--6333. 10.1093/nar/gkg830.

    Article  CAS  PubMed Central  Google Scholar 

  34. Yeo G, Burge CB: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol. 2004, 11 (2-3): 377--394. 10.1089/1066527041410418.

    Article  CAS  Google Scholar 

  35. Holste D, Grosse I, Buldyrev SV, Stanley HE, Herzel H: Optimization of coding potentials using positional dependence of nucleotide frequencies. J Theor Biol. 2000, 206 (4): 525-537. 10.1006/jtbi.2000.2144.

    Article  CAS  Google Scholar 

  36. Duda RO, Hart PE, Stork DG: Pattern Classification. 2000, Wiley & Sons

    Google Scholar 

  37. Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.

    Article  CAS  PubMed Central  Google Scholar 

  38. Bi J, Xia H, Li F, Zhang X, Li Y: The effect of U1 snRNA binding free energy on the selection of 5' splice sites. Biochem Biophys Res Commun. 2005, 333 (1): 64--69. 10.1016/j.bbrc.2005.05.078.

    Article  CAS  Google Scholar 

  39. Freund M, Asang C, Kammler S, Konermann C, Krummheuer J, Hipp M, Meyer I, Gierling W, Theiss S, Preuss T, Schindler D, Kjems J, Schaal H: A novel approach to describe a U1 snRNA binding site. Nucleic Acids Res. 2003, 31 (23): 6963--6975. 10.1093/nar/gkg901.

    Article  CAS  PubMed Central  Google Scholar 

  40. Lund M, Kjems J: Defining a 5' splice site by functional selection in the presence and absence of U1 snRNA 5' end. RNA. 2002, 8 (2): 166--179. 10.1017/S1355838202010786.

    Article  CAS  PubMed Central  Google Scholar 

  41. Staley JP, Guthrie C: An RNA switch at the 5' splice site requires ATP and the DEAD box protein Prp28p. Mol Cell. 1999, 3 (1): 55-64. 10.1016/S1097-2765(00)80174-4.

    Article  CAS  Google Scholar 

  42. Carmel I, Tal S, Vig I, Ast G: Comparative analysis detects dependencies among the 5' splice-site positions. RNA. 2004, 10 (5): 828--840. 10.1261/rna.5196404.

    Article  CAS  PubMed Central  Google Scholar 

  43. Philipps DL, Park JW, Graveley BR: A computational and experimental approach toward a priori identification of alternatively spliced exons. Rna. 2004, 10 (12): 1838-1844. 10.1261/rna.7136104.

    Article  CAS  PubMed Central  Google Scholar 

  44. Koren E, Lev-Maor G, Ast G: The emergence of alternative 3' and 5' splice site exons from constitutive exons. PLoS Comput Biol. 2007, 3 (5): e95-10.1371/journal.pcbi.0030095.

    Article  PubMed Central  Google Scholar 

  45. Caputi M, Zahler AM: Determination of the RNA binding specificity of the heterogeneous nuclear ribonucleoprotein (hnRNP) H/H'/F/2H9 family. J Biol Chem. 2001, 276 (47): 43850-43859. 10.1074/jbc.M102861200.

    Article  CAS  Google Scholar 

  46. McCullough AJ, Berget SM: An intronic splicing enhancer binds U1 snRNPs to enhance splicing and select 5' splice sites. Mol Cell Biol. 2000, 20 (24): 9225--9235. 10.1128/MCB.20.24.9225-9235.2000.

    Article  CAS  PubMed Central  Google Scholar 

  47. Martinez-Contreras R, Fisette JF, Nasim FU, Madden R, Cordeau M, Chabot B: Intronic binding sites for hnRNP A/B and hnRNP F/H proteins stimulate pre-mRNA splicing. PLoS Biol. 2006, 4 (2): e21-10.1371/journal.pbio.0040021.

    Article  PubMed Central  Google Scholar 

  48. Wang E, Dimova N, Cambi F: PLP/DM20 ratio is regulated by hnRNPH and F and a novel G-rich enhancer in oligodendrocytes. Nucleic Acids Res. 2007, 35 (12): 4164-4178. 10.1093/nar/gkm387.

    Article  CAS  PubMed Central  Google Scholar 

  49. Brackenridge S, Wilkie AO, Screaton GR: Efficient use of a 'dead-end' GA 5' splice site in the human fibroblast growth factor receptor genes. Embo J. 2003, 22 (7): 1620-1631. 10.1093/emboj/cdg163.

    Article  CAS  PubMed Central  Google Scholar 

  50. Lewis BP, Green RE, Brenner SE: Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc Natl Acad Sci U S A. 2003, 100 (1): 189--192. 10.1073/pnas.0136770100.

    Article  CAS  PubMed Central  Google Scholar 

  51. Mansilla A, López-Sánchez C, de la Rosa EJ, García-Martínez V, Martínez-Salas E, de Pablo F, Hernández-Sánchez C: Developmental regulation of a proinsulin messenger RNA generated by intron retention. EMBO Rep. 2005, 6 (12): 1182--1187. 10.1038/sj.embor.7400539.

    Article  CAS  PubMed Central  Google Scholar 

  52. Zhang J, Zhang W, Zou D, Chen G, Wan T, Zhang M, Cao X: Cloning and functional characterization of ACAD-9, a novel member of human acyl-CoA dehydrogenase family. Biochem Biophys Res Commun. 2002, 297 (4): 1033-1042. 10.1016/S0006-291X(02)02336-7.

    Article  CAS  Google Scholar 

  53. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006, 34 (Database issue): D257-60. 10.1093/nar/gkj079.

    Article  CAS  PubMed Central  Google Scholar 

  54. Green RE, Lewis BP, Hillman RT, Blanchette M, Lareau LF, Garnett AT, Rio DC, Brenner SE: Widespread predicted nonsense-mediated mRNA decay of alternatively-spliced transcripts of human normal and disease genes. Bioinformatics. 2003, 19 Suppl 1: i118--i121. 10.1093/bioinformatics/btg1015.

    Article  Google Scholar 

  55. Liu HX, Cartegni L, Zhang MQ, Krainer AR: A mechanism for exon skipping caused by nonsense or missense mutations in BRCA1 and other genes. Nat Genet. 2001, 27 (1): 55-58. 10.1038/83762.

    Article  CAS  Google Scholar 

  56. Wollerton MC, Gooding C, Wagner EJ, Garcia-Blanco MA, Smith CWJ: Autoregulation of polypyrimidine tract binding protein by alternative splicing leading to nonsense-mediated decay. Mol Cell. 2004, 13 (1): 91--100. 10.1016/S1097-2765(03)00502-1.

    Article  CAS  Google Scholar 

  57. Baek D, Green P: Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing. Proc Natl Acad Sci U S A. 2005, 102 (36): 12813--12818. 10.1073/pnas.0506139102.

    Article  CAS  PubMed Central  Google Scholar 

  58. Ast G: How did alternative splicing evolve?. Nat Rev Genet. 2004, 5 (10): 773--782. 10.1038/nrg1451.

    Article  CAS  Google Scholar 

  59. Eperon IC, Ireland DC, Smith RA, Mayeda A, Krainer AR: Pathways for selection of 5' splice sites by U1 snRNPs and SF2/ASF. Embo J. 1993, 12 (9): 3607-3617.

    CAS  PubMed Central  Google Scholar 

  60. Chabot B, Blanchette M, Lapierre I, La Branche H: An intron element modulating 5' splice site selection in the hnRNP A1 pre-mRNA interacts with hnRNP A1. Mol Cell Biol. 1997, 17 (4): 1776-1786.

    Article  CAS  PubMed Central  Google Scholar 

  61. Eperon IC, Makarova OV, Mayeda A, Munroe SH, Caceres JF, Hayward DG, Krainer AR: Selection of Alternative 5´ Splice Sites: Role of U1snRNP and Models for the Antagonistic Effects of SF2/ASF and hnRNP A1. Molecular and Cellular Biology. 2000, 20 (22): 8303-8318. 10.1128/MCB.20.22.8303-8318.2000.

    Article  CAS  PubMed Central  Google Scholar 

  62. Borensztajn K, Sobrier ML, Duquesnoy P, Fischer AM, Tapon-Bretaudière J, Amselem S: Oriented Scanning Is the Leading Mechanism Underlying 5' Splice Site Selection in Mammals. PLoS Genet. 2006, 2 (9):

  63. Sorek R, Ast G: Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 2003, 13 (7): 1631--1637. 10.1101/gr.1208803.

    Article  CAS  PubMed Central  Google Scholar 

  64. Buhler M, Steiner S, Mohn F, Paillusson A, Muhlemann O: EJC-independent degradation of nonsense immunoglobulin-mu mRNA depends on 3' UTR length. Nat Struct Mol Biol. 2006, 13 (5): 462-464. 10.1038/nsmb1081.

    Article  Google Scholar 

  65. Santisteban I, Arredondo-Vega FX, Kelly S, Mary A, Fischer A, Hummell DS, Lawton A, Sorensen RU, Stiehm ER, Uribe L: Novel splicing, missense, and deletion mutations in seven adenosine deaminase-deficient patients with late/delayed onset of combined immunodeficiency disease. Contribution of genotype to phenotype. J Clin Invest. 1993, 92 (5): 2291-2302. 10.1172/JCI116833.

    Article  CAS  PubMed Central  Google Scholar 

  66. O'Neill JP, Rogan PK, Cariello N, Nicklas JA: Mutations that alter RNA splicing of the human HPRT gene: a review of the spectrum. Mutat Res. 1998, 411 (3): 179-214. 10.1016/S1383-5742(98)00013-1.

    Article  Google Scholar 

  67. Untergasser G, Hermann M, Rumpold H, Berger P: Complex alternative splicing of the GH-V gene in the human testis. Eur J Endocrinol. 1998, 139 (4): 424--427. 10.1530/eje.0.1390424.

    Article  CAS  Google Scholar 

  68. Buratti E, Baralle FE: Influence of RNA secondary structure on the pre-mRNA splicing process. Mol Cell Biol. 2004, 24 (24): 10505-10514. 10.1128/MCB.24.24.10505-10514.2004.

    Article  CAS  PubMed Central  Google Scholar 

  69. Hiller M, Pudimat R, Busch A, Backofen R: Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res. 2006, 34 (17): e117-10.1093/nar/gkl544.

    Article  PubMed Central  Google Scholar 

  70. Kloetzel PM: Generation of major histocompatibility complex class I antigens: functional interplay between proteasomes and TPPII. Nat Immunol. 2004, 5 (7): 661-669. 10.1038/ni1090.

    Article  CAS  Google Scholar 

  71. Yewdell JW, Reits E, Neefjes J: Making sense of mass destruction: quantitating MHC class I antigen presentation. Nat Rev Immunol. 2003, 3 (12): 952-961. 10.1038/nri1250.

    Article  CAS  Google Scholar 

  72. Shastri N, Schwab S, Serwold T: Producing nature's gene-chips: the generation of peptides for display by MHC class I molecules. Annu Rev Immunol. 2002, 20: 463-493. 10.1146/annurev.immunol.20.100301.064819.

    Article  CAS  Google Scholar 

  73. Wang RF, Parkhurst MR, Kawakami Y, Robbins PF, Rosenberg SA: Utilization of an alternative open reading frame of a normal gene in generating a novel human cancer antigen. J Exp Med. 1996, 183 (3): 1131-1140. 10.1084/jem.183.3.1131.

    Article  CAS  Google Scholar 

  74. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8 (9): 967--974.

    CAS  PubMed Central  Google Scholar 

  75. Zhang M, Gish W: Improved spliced alignment from an information theoretic approach. Bioinformatics. 2006, 22 (1): 13--20. 10.1093/bioinformatics/bti748.

    Article  Google Scholar 

  76. Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656--664. 10.1101/gr.229202. Article published online before March 2002.

    Article  CAS  PubMed Central  Google Scholar 

  77. Gorodkin J, Heyer LJ, Brunak S, Stormo GD: Displaying the information contents of structural RNA alignments: the structure logos. Comput Appl Biosci. 1997, 13 (6): 583-586.

    CAS  Google Scholar 

  78. Jareborg N, Birney E, Durbin R: Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 1999, 9 (9): 815-824. 10.1101/gr.9.9.815.

    Article  CAS  PubMed Central  Google Scholar 

  79. Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 2003, 31 (1): 219-223. 10.1093/nar/gkg014.

    Article  CAS  PubMed Central  Google Scholar 

  80. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8 (3): 175-185.

    Article  CAS  Google Scholar 

  81. Staden R, Beal KF, Bonfield JK: The Staden package, 1998. Methods Mol Biol. 2000, 132: 115-130.

    CAS  Google Scholar 

  82. Holste D, Ohler U: Strategies for identifying RNA splicing regulatory motifs and predicting alternative splicing events. PLoS Comput Biol. 2008, 4 (1): e21-10.1371/journal.pcbi.0040021.

    Article  PubMed Central  Google Scholar 

  83. Wang Z, Rolish ME, Yeo G, Tung V, Mawson M, Burge CB: Systematic identification and analysis of exonic splicing silencers. Cell. 2004, 119 (6): 831--845. 10.1016/j.cell.2004.11.010.

    Article  CAS  Google Scholar 

  84. Zhang XHF, Chasin LA: Computational definition of sequence motifs governing constitutive exon splicing. Genes Dev. 2004, 18 (11): 1241--1250. 10.1101/gad.1195304.

    Article  CAS  PubMed Central  Google Scholar 

  85. Glantz SA: Primer of Biostatistics. Edited by: Reinhardt S, Davis K. 2001, McGraw-Hill/Appleton & Lange, 5th

    Google Scholar 

Download references

Acknowledgements

We thank Juan Valcarcel for pointing out reference [50] as well as Chris Burge, Florian Losch, and Tracy Bergman for helpful discussions. This research was supported in part by the National Science Foundation, under Grant No. PHY05-51164, during a stay at the Kavli Institute for Theoretical Physics (DH), and by the BMBF Germany within the Jena Center of Bioinformatics (RB).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ralf Bortfeldt or Dirk Holste.

Additional information

Authors' contributions

DH and RB conceived the study. DH, RB, SS, and KS designed the experiments. RB performed the numerical, and SS and KS the laboratory experiments. RB, SS, KS, and DH contributed reagents/materials/analysis tools and analyzed the data. DH, RB and StS wrote the paper. All authors read and approved the final manuscript.

Electronic supplementary material

12864_2008_1395_MOESM1_ESM.doc

Additional file 1: Supplementary material. The data contain supporting figures, tables, and analysed electropherograms for the validation assay. Figures: length distribution of human exons, scatter plots of 5´ss scores of competitive and tandem donors extracted from mouse M.musculus; occurrences of A5E and A5EΔ4 exons in REFSEQ sequences; and WEBLOGO representations of A5Es and A3Es. Tables: human and mouse cDNA and EST-to-genome alignments; MAXENT score distributions of proximal major and minor (PΔ4 and pΔ4), as well as distal major and minor (DΔ4 and dΔ4), splicing exons; transcript coverage of all dinucleotides with the motif /GTNN/GT; and sequence conservation levels for exons and their flanking regions. (DOC 2 MB)

12864_2008_1395_MOESM2_ESM.xls

Additional file 2: Supplementary material. The data contain inferred and classified human as well as mouse alternative exons used in this study. Part 1) human A5Es, tandem splicing exons; 2) human A5Es, major proximal forms; 3) human A5Es, major distal forms; and 4) mouse A5Es, tandem splicing exons. (XLS 414 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bortfeldt, R., Schindler, S., Szafranski, K. et al. Comparative analysis of sequence features involved in the recognition of tandem splice sites. BMC Genomics 9, 202 (2008). https://doi.org/10.1186/1471-2164-9-202

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2164-9-202

Keywords