Bioinformatics prediction of overlapping frameshifted translation products in mammalian transcripts

Background Exceptionally, a single nucleotide sequence can be translated in vivo in two different frames to yield distinct proteins. In the case of the G-protein alpha subunit XL-alpha-s transcript, a frameshifted open reading frame (ORF) in exon 1 is translated to yield a structurally distinct protein called Alex, which plays a role in platelet aggregation and neurological processes. We carried out a novel bioinformatics screen for other possible dual-frame translated sequences, based on comparative genomics. Results Our method searched human, mouse and rat transcripts in frames +1 and -1 for ORFs which are unusually well conserved at the amino acid level. We name these conserved frameshifted overlapping ORFs 'matreshkas' to reflect their nested character. Select findings of our analysis revealed that the G-protein coupled receptor GPR27 is entirely contained within a frame -1 matreshka, thrombopoietin contains a matreshka which spans ~70% of its length, platelet glycoprotein IIIa (ITGB3) contains a matreshka with the predicted characteristics of a secreted peptide hormone, while the potassium channel KCNK12 contains a matreshka spanning >400 amino acids. Conclusion Although the in vivo existence of translated matreshkas has not been experimentally verified, this genome-wide analysis provides strong evidence that substantial overlapping coding sequences exist in a number of human and rodent transcripts.


Background
Overlapping translated open reading frames (tORFs) are usually associated with genomes under selection pressure to remain compact, such as those of viruses. However, such overlaps also exist in mammals. For example in human, an exon is shared by the INK4A and ARF genes and is translated in different frames over 317 bases [1].
Similarly, a transcript fusion between the human EIF4EBP3 and MASK genes results in the translation of 172 bases in two different frames [2]. An alternative splice variant of insulin-like growth factor 1 (IGF-I), called mechano-growth-factor (MGF), contains a frameshift which leads to translation of overlapping reading frames [3]. Expression profiling of IGF-1 and MGF indicates that the variants have distinct physiological roles.
The best-characterised case of overlapping tORFs in mammals is that of XL-alpha-s. This is a splice variant of a G protein alpha subunit, derived from the GNAS complex locus, which is expressed in neuroendocrine tissues and other tissues. The first exon of XL-alpha-s contains a downstream ORF which is frameshifted +1 relative to the XL-alpha-s initiator codon. This ORF gives rise to an entirely different protein called Alex, which is 356 amino acids long in rat [4]. Remarkably, XL-alpha-s and Alex interact, and this interaction can be disrupted by an insertion polymorphism in humans. The polymorphism leads to enhanced receptor-mediated cAMP formation in platelets and fibroblasts, increased trauma-related bleeding tendency, and in two families neurological problems and brachydactyly were observed [5]. Furthermore, the XL-alpha-s and Alex ORFs may extend in the 5' direction for several hundred nucleotides more [6], raising the possibility that longer variants of XL-alpha-s and Alex exist. Although the Alex termination codon lies well within 50 base pairs (bp) of the next 3' splice junction, the XL-alphas transcript does not appear to be degraded according to the usual rules for nonsense-mediated decay [7]. Figure 1 summarises these cases of overlapping mammalian tORFs.
During our own in silico comparative studies of entire translated human, mouse and rat genomes, we frequently observed overlapping tORFs conserved at the amino acid level. In an effort to explore this relatively uncharacterised aspect of gene structure and evolution, we screened for additional Alex-like cases in human and rodents using a bioinformatics approach. Specifically, we searched human, mouse and rat transcripts for frameshifted con-Known examples of overlapping translated ORFs Figure 1 Known examples of overlapping translated ORFs. A) The second exon of the INK4A and ARF genes is shared but translated in different reading frames. B) A transcript fusion can occur between the MASK and EIF4EBP3 genes, via an intermediate exon. In this case, the next exon is translated in a different frame from that in EIF4EBP3 transcripts. C) A splice variation in IGF-1 can cause a frameshift in the translation of C-terminal residues. D) Exon 1 of the G-protein subunit XL-alpha-s contains a second, frameshifted tRF which yields a distinct protein called Alex. Grey shading indicates overlapping frameshifted ORFs, exons and tORFs are drawn approximately to scale. A B C D served tORFs. Conservation of such sequences at the amino acid level may reflect a functional role. A related comparative genomics approach, supported by simulation-based statistics, has recently been published [8]. Based on conservation between human and mouse, Chung et al. convincingly demonstrate that these frameshifted ORFs (which they name alternative reading frames, or ARFs) are highly unlikely to occur by chance. In our study, the term 'matreshka' was coined to describe the overlapping tORFs, in analogy with Russian dolls, as one protein can be thought of as 'hiding' another. It should be kept in mind, however, that all matreshkas reported here are observations based on mRNA sequence. Translation of these sequences in vivo has not yet been experimentally confirmed.

Results and Discussion
Matreshkas are defined here as overlapping, frameshifted ORFs (relative to a known ORF) in transcripts, which are well conserved at the amino acid level. Matreshkas may potentially represent alternative translation products like Alex, or suggest the existence of functional frameshifted splice variants. To obtain matreshka predictions, we translated in silico all frame +1 and -1 ORFs from known human, mouse and rat "parent" transcripts, and applied a conservation filter (Fig. 2).

Open reading frame translation
The first step was to build Human-mouse-rat ortholog triplets, from which frame +1 and -1 ORFs were extracted (frames +1 and -1 are also sometimes known as frames 2 and 3, respectively). Ortholog sets were based on a variation of best reciprocal BLASTP hits, which favoured hits with a higher percentage identity (see Methods). This method yielded 9163 triplets. The coding nucleotide sequence was retrieved for each ortholog, and all translated ORFs (tORFs) at least 50 amino acids (aa) long were extracted computationally.
The matreshka identification pipeline was run twice, each time varying the tORF extraction method. In an effort to identify Alex-like cases, the first run retained only tORFs beginning with methionine (labelled here 'tORF M '). The second run was somewhat more comprehensive, including translated ORFs starting with any amino acid (labelled 'tORF X '). It should be kept in mind that although the tORF M set is contained within the tORF X set, the derived matreshka sets only partially overlap, due to the proper- Figure 2 Matreshka bioinformatics pipeline. A) Human, rat and mouse transcripts were assembled into orthologous triplets, and all frame +1 and -1 ORFs greater than 50aa in length translated in silico (frames +1 and -1 are sometimes also known as frames 2 and 3, respectively). A conservation filter was then applied to yield matreshka predictions.

Human
Mouse Rat Transcripts

Orthologous triplet
Translation of all ORFs from frames +1 and -1 Conservation filter tORFs Matreshkas ties of the conservation filter. The term 'tORF' is used here to collectively refer to tORF M and tORF X . This early stage in the analysis, after tORF extraction but before the conservation filter, already yielded surprising results: tORFs beginning with methionine (tORF M ) were found to be ~10-fold rarer in frame -1 than in frame +1 (Fig. 3A). This effect was also seen for tORF x , but in a much less pronounced way (Fig. 3B). The pattern was mirrored by human codon usage frequencies (Fig. 3C): stop codons are more frequent in frame -1 than frame +1, resulting in fewer frame -1 tORFs. In addition, ATG codons are ~5-fold rarer in frame -1 compared to +1, accounting further for the observed scarcity of frame -1 tORF M . In the ARF study [8], the difference between the two frames was echoed in simulated alignments. The importance of this apparent suppression of tORF M may be part of a mechanism to minimize the impact of translation initiation errors. Why there should be a difference between frames +1 and -1, however, is unclear. Perhaps an alternative mechanism already exists to prevent erroneous frame +1 translation initiation.
Examination of amino acid sequences in the parent frame sheds some light on how stop codons can occur in other frames. If, for instance, a tyrosine (codons TAT or TAC) is followed by an aspartic acid (codons GAT or GAC), the stop codon TGA occurs in frame -1 if, and only if, the codon used for tyrosine is TAT. Overall, codon TAT is used for tyrosine in 45% of the cases in human genes. However, when tyrosine is directly followed by aspartic acid in a coding sequence, TAT occurs in 75% of the cases to code for tyrosine (Table 1). Thus, the codon which leads to a stop codon in frame -1 is favoured.
In order to explore whether this observation can be generalized, we calculated a 64 by 64 bi-codon usage matrix for a set of human coding sequences (see Additional file 1). Table 1 shows as an example the combinations of two consecutive amino acids (bi-codons), where the second amino acid is aspartic acid (both codons for aspartate begin with AG, therefore the occurrence of a stop codon in frame -1 depends entirely on the first codon). Interestingly, there is a trend for codons to be favored which lead to a stop codon in the frame -1, in some cases leading to highly significant probability values ( Table 1). The same effect can be observed when the second amino acid in the bi-codon is glutamic acid (Additional file 1). In contrast, if the second amino acid is lysine or asparagine, the opposite trend appears (Additional file 1). This may be due to interacting effects on the complementary nucleotide strand, and illustrates the complexity of bi-codon effects on the appearance of stop codons in frame -1. Note that our analysis did not take into account the base following the stop codon, which can have a significant effect on efficiency of termination of translation [9].

Conservation filter
Having extracted all tORFs, the conservation filter was then applied. Within each ortholog triplet, all human frame +1 and -1 tORFs were compared in pairwise global alignments to all those computed from mouse and from rat. Well-conserved tORF triplets which passed the length similarity and identity criteria were retained for further analysis as candidate matreshkas. Matreshkas derived from tORF M are denoted here as matreshka M , those derived from tORF X are denoted matreshka X , while a lack of subscript refers to both matreshka sets collectively. The breakdown of tORF and matreshka statistics by frame is given in Table 2.
Of the 7679 matreshka X (beginning with any amino acid), 1853 are at least partially redundant with the matreshka M (beginning with methionine only), of which there are 1793 in total. The matreshka M total (1793) is smaller than the number of redundant matreshka X (1853), because the longer matreshka X can contain several of the shorter matreshka M . Frame -1 tORFs were filtered out more effectively than frame +1 (Table 2): the proportion of frame -1 tORF X dropped from 45% to 25% of the total after conservation filtering, and from 9% to 6% for tORF M . The greater stringency of the conservation filter on frame -1 tORFs was expected, and is a consequence of the variation in the third base of each parent codon: this can be demonstrated in a simple exercise by combining all possible nucleotide triplet pairs. For a given amino acid in frame 0, on average ~5 different amino acids can potentially be coded for in frame +1, while ~15 can be coded for in frame -1 (data not shown). The scarcity of frame -1 tORF M , combined with the greater efficacy of the conservation filter, resulted in very few (104) matreshka M in frame -1. For reference, Alex lies in frame +1.
No matreshkas were found corresponding to Alex/XLalpha-s [4], INK4A/ARF [1] or 4E-BP3/MASK [2] because they were not represented in the ortholog triplets (primarily because a RefSeq sequence was not available for all three organisms, human, mouse and rat) and in the case of IGF1 [3] because the region which is translated in two frames is short (16aa, the matreshka length filter being 50aa). As recovery of Alex-like entities but not Alex itself was our primary goal, we used stringent conservation filters that Alex could not have passed: human Alex is only 53% and 55% identical at the amino acid level to mouse and rat Alex, respectively [4], while our conservation threshold was set at 60%. This relatively high threshold was necessary partly because the exploratory nature of the project made clear-cut examples necessary, but was also based on separate benchmark studies we made on the Length distribution of human tORFs, broken down by frame Figure 3 Length distribution of human tORFs, broken down by frame. A) redundant methionine-start translated ORFs (tORF M ) and B) non-redundant tORFs beginning with any amino acid (tORF X ). White bars correspond to frame +1 tORFs, grey bars correspond to frame -1. C) Frequencies of methionine and stop codons in frames +1 and -1.
chemokine family. Chemokines are typically quite poorly conserved, and as such provide an indication of how much sequences may diverge while retaining similar functions. For instance, half of mouse-human chemokine orthologs in Homologene [10,11] have an amino acid sequence identity of 60-70% (from 32 entries in total). Since a 60% cut-off seems suitable for this family, and we were interested in detecting possible new peptide ligands, it was chosen as a threshold for the matreshkas.
A number of matreshkas mapped to alternative, frameshifted splice variants (see Additional file 2). In some cases, there is evidence that these variants are functionally distinct. Among these are the paired box gene 8 (PAX8) isoform c [12], and X-box protein (XBP1) isoform U [13].
Given the nucleotide sequence similarity between transcripts in orthologous triplets, conserved translation products in alternative reading frames would be expected to occur by chance. Additional parameters were required to differentiate likely matreshkas from false positives. Generally, longer matreshkas are less likely to arise by chance, therefore matreshka length was chosen as a first parameter for selection. A second useful parameter gave an approximate measure of the selection pressure maintaining a potential matreshka: this was the number of amino acid positions where a stop codon could have arisen, truncating the matreshka but leaving the parent (RefSeq) protein Table 1: Bi-codon frequencies. The third column lists those codons c for the first amino acid (AA1) in a consecutive pair which lead to a stop in frame -1. The overall relative frequency of c as a codon for AA1 is given in the fifth column. The relative frequency of c in the specific context AA1-AA2 is given in column six. The penultimate column shows the enrichment of the codon c in this context; an enrichment > 1 indicates that this codon is more frequently used for AA1 when it leads to a stop in frame -1 than in the average case. Probability (P) values were assigned using the χ 2 test, and significance determined with a Bonferroni multiple testing correction (15 analyses). Single (*) and triple (***) asterisks indicate significance at the 0.05 and 0.001 levels, respectively.  sequence unaffected (see Methods for an explicit example), denoted here N stop . Furthermore, we estimated a probability of seeing N stop given the parent amino acid sequence, and considering codon bias (see Methods), but not taking into account the effects of bi-codon bias mentioned above (see 'Open reading frame translation' section).

AA1 AA2 First codon leading to stop in
To gain clues as to the possible function of matreshkas, all matreshka protein sequences were analysed with functional motif and signal peptide prediction programs. Matreshka nucleotide sequences were also screened for tandem repeats, to exclude artifactual sequences which pass the length filter (e.g., if the repeat contains no stop codon in the matreshka frame). Matreshkas with tandem repeats in any one of the three organisms were discarded from in-depth analysis, even though Alex itself was found to contain such repeats. In addition, matreshka M nucleotide sequences were scanned for consensus Kozak translation initiation motifs [14,15], to help determine whether leaky ribosomal scanning [16]  The matreshka N stop versus length distribution can be seen in Fig. 4, with outliers of interest highlighted (data available in Additional file 3). Details on the highlighted matreshkas can be found in Table 3. Some matreshkas can be extended into 5' or 3' untranslated regions (UTRs) or (in the case of KCNK12 or POLG) into flanking genomic sequence (in total, 365 matreshkas extend 5' past the boundary of the parent RefSeq). Of particular interest is the THPO matreshka, which is of the rare methioninestart, frame -1 variety.
The matreshkas in Table 3 which don't begin with methionine are unlikely to represent complete proteins, since the human matreshka sequence either contains no methionines (KCNK12, GPR135, POLG, FOXA2) or methionines only very close to the 3' end of the matreshka (GPR27, TAF6). As with Alex and virtually all matreshka predictions, none of the candidates in Table 3 contain known protein functional motifs, as ascertained by pattern matching to Prosite motifs. The following sections describe matreshkas of particular interest.

Case study 1: Matreshka framing super conserved receptor expressed in brain 1 (GPR27)
The first example presented here is of a particularly large matreshka associated with the GPR27 transcript, also called super conserved receptor expressed in brain 1. GPR27 is an aminergic G-protein coupled receptor, cloned from human brain cDNA and part of a family of three members which are highly conserved between vertebrates [17]. This single-exon coding sequence is entirely contained within a frame -1 matreshka X , which begins and ends in the flanking UTRs (Fig. 5A). The matreshka X varies in length between 407 and 418 amino acids depending on the species, and in the RefSeq coding region alone has between 53 (mouse) and 60 (human) positions where a stop codon could have truncated the matreshka X , suggesting selection pressure to maintain it. the probability in human of a stop codon truncating the matreshka, P stop , was estimated at 0.999997. It seems possible from our analysis that the high level of conservation of this gene may be necessary to maintain the protein sequences of both GPR27 and the matreshka. Figure 5B shows the matreshka X amino acid sequence, which like Alex is basic (predicted pI 12.9) and enriched in proline residues (~3fold more than average). The GPR27 matreshka is unlikely to be translated from a leaky scanning event [16], since it contains only one methionine residue 30 amino acids from its C-terminus. It may be part of the exon of a longer, unknown transcript, in which case the AG straddling its 5' end could be a canonical splice signal.

Case study 2: Matreshka contained in thrombopietin (THPO)
Thrombopoietin is the major regulator of platelet production by megakaryocytes, with associated disorders including thrombocytosis and thrombocytopenia. Splice variant 1 of THPO contains a matreshka of the scarce frame -1 methionine-start type, which in human is 254 amino acids long, and covers 72% of the length of THPO (Fig.  6A). It contains 55 positions where a stop codon could have occurred without changing the parent RefSeq protein sequence (P stop = 0.999999). For comparison, in frame +1 there are 30 positions where a stop codon could occur, and 12 stop codons actually did. Like Alex, the THPO matreshka is proline-rich (Fig. 6B).
The genomic region upstream of the matreshka M start was computationally scanned for transcription start sites (TSS), in case it is in fact derived from a separate transcript overlapping with the THPO sequence: none was found.
The matreshka M start is 212 nucleotides downstream from the parent start, and the initiator ATG lies in a strong Kozak context (Fig. 6A) in all three organisms. This suggests that the matreshka could be produced from a leaky ribosomal scanning event [16].
An important indication, however, that the THPO matreshka M may be translated not from a leaky scanning event, but instead as part of a frameshifted splice variant, came from a BLAST search of the matreshka M against the UniProt protein database [18]. A 100% identity match was found against the translation of the 'nirs' THPO splice variant. No function specific to the nirs THPO appears to be documented. This variant contains a frameshift downstream of the matreshka M start, which effectively places the 3' half of the transcript in the same frame as the matreshka M (Fig. 6C).
A SNP [19] exists which can generate a stop codon in (and thus truncate) the matreshka M at the 45th amino acid (Fig.  6A). The SNP replaces a glycine with a glutamate at position 116 in THPO itself, on a loop between an alpha-helix and a beta sheet. Gly 116 is within the conserved, 184aa long Erythropoietin-Thrombopoietin domain in the Nterminal portion of THPO. This domain is responsible for binding to THPO's receptor, MPL (myeloproliferative leukemia virus oncogene). Three independent sitedirected mutagenesis studies have been carried out to determine which residues are necessary for binding to MPL, based on 3D structural data [20][21][22]. A total of 14 amino acids were determined to be important for binding, none of which lie within 15aa of Gly 116 . It therefore seems unlikely that the SNP affects THPO function. Also of note are two splice variants (labelled '2' and '3' in Ref-Seq), which retain the SNP and shorten the matreshka M by 4 and 39 aa, respectively.

Case study III: Matreshka contained in Platelet glycoprotein IIIa (ITGB3)
Platelet glycoprotein IIIa (GPIIIa; also known as integrin beta 3, ITGB3) is part of the GPIIb/GPIIIa complex, which mediates platelet aggregation by acting as a receptor for fibrinogen. A frame +1 matreshka M (the same type as Alex) spans 3 exons in the ITGB3 transcript. It is 192 amino acids long in human, and has strong Kozak sequences and a signal peptide prediction in all three organisms (Fig. 7A), coupled with dibasic cleavage sites clustered at the N and C-termini which could potentially produce conserved 'active' secreted peptides (Fig. 7B). Such characteristics are typical of GPCR ligands, for exam-ple [23], which are of great value for drug discovery. From an evolutionary perspective, it seems plausible that such a secreted peptide could 'signal' the expression of ITGB3, since any putative receptor would be able to freely coevolve with the matreshka, according to the sequence requirements of ITGB3 itself.
The significance of the matreshka's Kozak motif is underlined by the absence of a Kozak motif for the parent sequence: the first motif in the ITGB3 frame occurs 934 nucleotides downstream of the annotated start codon. This suggests that leaky ribosomal scanning is occurring in the parent frame, and may increase the likelihood that the matreshka M (a further 490 nucleotides downstream) is also translated from leaky scanning.
These are by no means the only interesting predictions. The largest matreshka discovered is 448aa long in human, although it has the potential to reach up to 500aa in mouse (Table 3). It is derived from the KCNK12 potassium channel, a member of the 2-pore domain superfamily of background K + channels. The G protein-coupled receptor 135, the gamma subunit of a DNA directed polymerase, TAF6 RNA polymerase II, and Forkhead box A2 also contain substantial size matreshkas (Table 3). These matreshkas may represent alternative proteins, in the way that Alex is, or be translated as part of a frameshifted splice variant (e.g., the THPO nirs splice variant, Fig. 6C). Conservation measurements have previously been used to support the functional importance of frameshifted splice variants [24]. Our findings suggest that other conserved frameshifted splice variants, such as caspase 7 isoform beta, are biologically relevant (see Additional file 2).
In a related study by Chung et al [8], alternative reading frames (ARFs, which can be considered synonymous with matreshkas) were searched for using a comparative genomics approach in human and mouse. A conservative list of 40 ARF-containing genes was derived from elegant simulation-based statistical and modelling methods. Of these 40 genes, 21 genes were included in our matreshka set (Additional file 4), although the ARF and matreshka sequences were not compared directly, as the former have not been published. Our study had a broader scope, including non-methionine-start reading frames which G protein-coupled receptor 27 matreshka   (Table 3). Given the stringent length threshold chosen by Chung et al., our analysis probably has a much greater representation of -1 frame matreshkas (see Table 3). All matreshka sequences have been provided as suppplementary data (Additional file 5, Additional file 6), and can be filtered according to our calculated P values (Additional file 3).

Conclusion
The genome-wide study reported here provides evidence for the existence of potentially dual coding sense-frames in a number of mammalian transcripts. Future studies should aim to determine actual proof-of-translation by raising antibodies against matreshkas, and probing cells or tissues where the mRNAs are known to be expressed. Alternatively genetic mouse models could be generated which would knock-out the putative matreshka while leaving the parent RefSeq sequence intact, thus enabling phenotypic analysis. Future studies could also explore the potential for overlaps between tORFs on the antisense strands. Indeed, genomic mapping of full-length mouse cDNAs has revealed transcriptional forests in which overlap of coding sequences on the sense and antisense strands occurs [25].
Integrin beta 3 matreshka Figure 7 Integrin beta 3 matreshka. A) Amino acid sequences of the human, mouse and rat ITGB3 matreshkas, with predicted signal peptide underlined, and predicted dibasic cleavage sites in bold capitals. B) The largest fragment produced by cleavage at dibasic sites would be very well conserved, as shown by this global alignment.

Generation of matreshkas
The strategy for identifying conserved frameshifted amino acid sequences (matreshkas) in transcripts is summarised in Fig. 2. First of all, we constructed human-mouse-rat orthologous triplets. Open reading frames (ORFs) were extracted from frames +1 and -1, and translated in silico. Sequences above certain conservation thresholds were retained for further analysis.
The starting point for the analysis was the Reference Sequence (RefSeq) set [26]. RefSeq is a well-curated collection of mRNA and corresponding protein sequences maintained by the National Centre for Biotechnology Information (NCBI). Protein entries for human, mouse and rat (release 9) were downloaded from the NCBI web site [10]. Orthologous triplets were built on best reciprocal matches in all vs. all BLASTP searches for humanmouse, human-rat and mouse-rat orthologs. For example, if human protein X is BLASTed against all mouse RefSeq proteins and returns protein Y as best hit, and the converse is true (when protein Y is searched against the human proteome, protein X is the best hit), then proteins X and Y are mutual best matches -and predicted to be in silico 'orthologs'. If in turn mouse protein Y has rat protein Z as best mutual match, and human protein X had rat protein Z as best mutual match, the ortholog triplet X-Y-Z can be built. 'Best' BLASTP hits were taken as the hits with the highest percentage identity (at least 60%) at the amino acid level to a query, and covering at least 70% of its length. If no hit matching these criteria were found, the ortholog triplet was not built.
The RNA coding sequence for each RefSeq protein in the orthologous sets was then retrieved for in silico generation of frameshifted translated ORFs (tORFs). tORFs at least 50 amino acids long were generated computationally from frames +1 and -1 on the sense strand. We used a software package called Scripter developed by ourselves, although this can be done with Getorf from the EMBOSS package [27,28]. Conservation was ascertained with global alignments between all human-mouse and human-rat ORFs within an ortholog triplet. Scripter was again employed for this step, although the EMBOSS package Needle program would be an adequate substitute. Output files were parsed using perl scripts. In a given alignment, the shortest ORF had to be >= 90% the length of the longer, the aligned region had to cover >= 80% of the length of the shorter ORF, and the aligned region had to comprise >= 60% identical amino acid residues. The term 'matreshka' was coined to describe a conserved tORF.
It is important to note that the bioinformatics pipeline was run twice (Fig. 8). The first time ('Method variation 1'), only tORFs beginning with methionine (tORF M ) were retained, in an effort to identify Alex-like candidates. tORF M were allowed to overlap in a redundant fashion (e.g., if a tORF M contained two methionines, two tORF M would be reported). The second run ('Method variation 2') was more comprehensive, and included all frame +1 and -1 ORF translations (tORF X ) regardless of the first predicted amino acid (this set was non-redundant). Although the tORF M sequences are contained within the tORF X set, the subsequent conservation filter produces different matreshka sets.

Matreshka analysis
A key parameter which differentiated effectively between matreshkas was the number of stop codons which could have arisen to truncate a matreshka, without affecting the 'parent' (frame 0, RefSeq) protein sequence. This is essentially a very approximate measure of the selection pressure maintaining the matreshka. For example, if part of the parent sequence is a valine followed by an isoleucine residue, the possible codons could be GTA, GTT, GTC and GTG (Val) and ATT, ATC and ATA (Ile). It can be seen that there are several codon combinations which could produce a stop codon in a frame +1 matreshka in this part of the sequence (GTGATT, GTGATC, GTGATA, GTAATT, GTAATC and GTAATA). This simple search can be carried out across the length of the parent sequence which overlaps with a matreshka. The final result counts how many times a matreshka could have been truncated by a stop codon without affecting the parent sequence, but was not truncated in either human, mouse or rat.
This stop codon count was supplemented with a probability value which takes codon bias into account, and was calculated as follows. From published codon frequencies [29], we calculated the relative probability of observing each codon for a given amino acid (aa), by dividing the frequency of each codon by the sum of frequencies for all codons for aa 'A'. These relative probabilities must sum to 1 for A. Extending this, we can find the relative probability of observing a di-codon for a pair of amino acids A 1 A 2 . Instead of the binary yes/no answer as to whether a stop codon could truncate a matreshka (described above), we can sum the relative probabilities of di-codons which do not cause a truncation, obtaining a probability that a matreshka is not truncated at a given position, abbreviated here to P no_stop . The overall probability of a matreshka 'surviving' is derived by multiplying P no_stop across the entire matreshka sequence, abbreviated here to ∏(P no_stop ). This assumes independence of consecutive (overlapping) bi-codons. We chose to report the probability of a matreshka being truncated by a stop, equal to 1 -∏(P no_stop ).
Matreshkas were also scanned for different types of motifs. Those starting with methionine (matreshka M , derived from tORF M ), for instance, might be translated from leaky ribosomal scanning events. A strong Kozak motif could support initiation of translation at these locations [14]. We therefore scanned the nucleotide context around the ATG start for Kozak motifs. The pattern [AG]NNATGG (where N is any nucleotide, and letters in square brackets indicate alternative bases at a given position) was used, derived from published data on initiator codons from highly curated sequences [15]. The significant enrichment of Kozak motifs in the longest matreshka M was ascertained with the Exact Binomial Test using R software [30], which was also used for the χ 2 test in Table 1. Matreshkas were additionally scanned for sig-nal peptide motifs using SignalP v.2.0 [31]. BLASTP searches of human matreshkas against UniProt (UniRef100, last updated 13 th Sep 2005) were carried out with an e-value threshold of 0.0001. To check for repeating sequence elements, the EMBOSS command line programs Equicktandem and Etandem were used with default settings to scan matreshka nucleotide sequences for tandem repeats. SpotFire DecisionSite v. 8.0 (Spotfire Inc., MA, USA), a data visualisation tool, was used to rapidly select the most interesting matreshka candidates.
Matreshka amino acid composition analysis was performed using the EMBOSS program pepstats [28,32]. Figure 8 ORF translation method. Two variations on the ORF translation step in Fig. 2 were performed. In an effort to identify Alex-like proteins which start with methionine, a first run of the analysis pipeline generated redundant (overlapping) methionine-start ORFs. A second, more comprehensive run generated all ORFs regardless of the first amino acid.

Method variation 2
Method variation 1 matreshka X