Bioinformatics prediction of overlapping frameshifted translation products in mammalian transcripts
© Ribrioux et al. 2008
Received: 14 June 2007
Accepted: 06 March 2008
Published: 06 March 2008
Skip to main content
© Ribrioux et al. 2008
Received: 14 June 2007
Accepted: 06 March 2008
Published: 06 March 2008
Exceptionally, a single nucleotide sequence can be translated in vivo in two different frames to yield distinct proteins. In the case of the G-protein alpha subunit XL-alpha-s transcript, a frameshifted open reading frame (ORF) in exon 1 is translated to yield a structurally distinct protein called Alex, which plays a role in platelet aggregation and neurological processes. We carried out a novel bioinformatics screen for other possible dual-frame translated sequences, based on comparative genomics.
Our method searched human, mouse and rat transcripts in frames +1 and -1 for ORFs which are unusually well conserved at the amino acid level. We name these conserved frameshifted overlapping ORFs 'matreshkas' to reflect their nested character. Select findings of our analysis revealed that the G-protein coupled receptor GPR27 is entirely contained within a frame -1 matreshka, thrombopoietin contains a matreshka which spans ~70% of its length, platelet glycoprotein IIIa (ITGB3) contains a matreshka with the predicted characteristics of a secreted peptide hormone, while the potassium channel KCNK12 contains a matreshka spanning >400 amino acids.
Although the in vivo existence of translated matreshkas has not been experimentally verified, this genome-wide analysis provides strong evidence that substantial overlapping coding sequences exist in a number of human and rodent transcripts.
Overlapping translated open reading frames (tORFs) are usually associated with genomes under selection pressure to remain compact, such as those of viruses. However, such overlaps also exist in mammals. For example in human, an exon is shared by the INK4A and ARF genes and is translated in different frames over 317 bases . Similarly, a transcript fusion between the human EIF4EBP3 and MASK genes results in the translation of 172 bases in two different frames . An alternative splice variant of insulin-like growth factor 1 (IGF-I), called mechano-growth-factor (MGF), contains a frameshift which leads to translation of overlapping reading frames . Expression profiling of IGF-1 and MGF indicates that the variants have distinct physiological roles.
During our own in silico comparative studies of entire translated human, mouse and rat genomes, we frequently observed overlapping tORFs conserved at the amino acid level. In an effort to explore this relatively uncharacterised aspect of gene structure and evolution, we screened for additional Alex-like cases in human and rodents using a bioinformatics approach. Specifically, we searched human, mouse and rat transcripts for frameshifted conserved tORFs. Conservation of such sequences at the amino acid level may reflect a functional role. A related comparative genomics approach, supported by simulation-based statistics, has recently been published . Based on conservation between human and mouse, Chung et al. convincingly demonstrate that these frameshifted ORFs (which they name alternative reading frames, or ARFs) are highly unlikely to occur by chance. In our study, the term 'matreshka' was coined to describe the overlapping tORFs, in analogy with Russian dolls, as one protein can be thought of as 'hiding' another. It should be kept in mind, however, that all matreshkas reported here are observations based on mRNA sequence. Translation of these sequences in vivo has not yet been experimentally confirmed.
The first step was to build Human-mouse-rat ortholog triplets, from which frame +1 and -1 ORFs were extracted (frames +1 and -1 are also sometimes known as frames 2 and 3, respectively). Ortholog sets were based on a variation of best reciprocal BLASTP hits, which favoured hits with a higher percentage identity (see Methods). This method yielded 9163 triplets. The coding nucleotide sequence was retrieved for each ortholog, and all translated ORFs (tORFs) at least 50 amino acids (aa) long were extracted computationally.
The matreshka identification pipeline was run twice, each time varying the tORF extraction method. In an effort to identify Alex-like cases, the first run retained only tORFs beginning with methionine (labelled here 'tORFM'). The second run was somewhat more comprehensive, including translated ORFs starting with any amino acid (labelled 'tORFX'). It should be kept in mind that although the tORFM set is contained within the tORFX set, the derived matreshka sets only partially overlap, due to the properties of the conservation filter. The term 'tORF' is used here to collectively refer to tORFM and tORFX.
Bi-codon frequencies. The third column lists those codons c for the first amino acid (AA1) in a consecutive pair which lead to a stop in frame -1. The overall relative frequency of c as a codon for AA1 is given in the fifth column. The relative frequency of c in the specific context AA1-AA2 is given in column six. The penultimate column shows the enrichment of the codon c in this context; an enrichment > 1 indicates that this codon is more frequently used for AA1 when it leads to a stop in frame -1 than in the average case. Probability (P) values were assigned using the χ2 test, and significance determined with a Bonferroni multiple testing correction (15 analyses). Single (*) and triple (***) asterisks indicate significance at the 0.05 and 0.001 levels, respectively.
First codon leading to stop in frame -1
Frequency when leading to stop
In order to explore whether this observation can be generalized, we calculated a 64 by 64 bi-codon usage matrix for a set of human coding sequences (see Additional file 1). Table 1 shows as an example the combinations of two consecutive amino acids (bi-codons), where the second amino acid is aspartic acid (both codons for aspartate begin with AG, therefore the occurrence of a stop codon in frame -1 depends entirely on the first codon). Interestingly, there is a trend for codons to be favored which lead to a stop codon in the frame -1, in some cases leading to highly significant probability values (Table 1). The same effect can be observed when the second amino acid in the bi-codon is glutamic acid (Additional file 1). In contrast, if the second amino acid is lysine or asparagine, the opposite trend appears (Additional file 1). This may be due to interacting effects on the complementary nucleotide strand, and illustrates the complexity of bi-codon effects on the appearance of stop codons in frame -1. Note that our analysis did not take into account the base following the stop codon, which can have a significant effect on efficiency of termination of translation .
Matreshka statistics. Matreshka statistics for tORFs and met-tORFs (generated from a separate run of the analysis, see text), broken down by frame. Note that tORFM are redundant (see Methods).
# Human-Mouse-Rat ortholog triplets
Frame -1 as a percentage of total
# tORF X
# matreshka X
# tORF M
# matreshka M
Of the 7679 matreshkaX (beginning with any amino acid), 1853 are at least partially redundant with the matreshkaM (beginning with methionine only), of which there are 1793 in total. The matreshkaM total (1793) is smaller than the number of redundant matreshkaX (1853), because the longer matreshkaX can contain several of the shorter matreshkaM. Frame -1 tORFs were filtered out more effectively than frame +1 (Table 2): the proportion of frame -1 tORFX dropped from 45% to 25% of the total after conservation filtering, and from 9% to 6% for tORFM. The greater stringency of the conservation filter on frame -1 tORFs was expected, and is a consequence of the variation in the third base of each parent codon: this can be demonstrated in a simple exercise by combining all possible nucleotide triplet pairs. For a given amino acid in frame 0, on average ~5 different amino acids can potentially be coded for in frame +1, while ~15 can be coded for in frame -1 (data not shown). The scarCity of frame -1 tORFM, combined with the greater efficacy of the conservation filter, resulted in very few (104) matreshkaM in frame -1. For reference, Alex lies in frame +1.
No matreshkas were found corresponding to Alex/XL-alpha-s , INK4A/ARF  or 4E-BP3/MASK  because they were not represented in the ortholog triplets (primarily because a RefSeq sequence was not available for all three organisms, human, mouse and rat) and in the case of IGF1  because the region which is translated in two frames is short (16aa, the matreshka length filter being 50aa). As recovery of Alex-like entities but not Alex itself was our primary goal, we used stringent conservation filters that Alex could not have passed: human Alex is only 53% and 55% identical at the amino acid level to mouse and rat Alex, respectively , while our conservation threshold was set at 60%. This relatively high threshold was necessary partly because the exploratory nature of the project made clear-cut examples necessary, but was also based on separate benchmark studies we made on the chemokine family. Chemokines are typically quite poorly conserved, and as such provide an indication of how much sequences may diverge while retaining similar functions. For instance, half of mouse-human chemokine orthologs in Homologene [10, 11] have an amino acid sequence identity of 60–70% (from 32 entries in total). Since a 60% cut-off seems suitable for this family, and we were interested in detecting possible new peptide ligands, it was chosen as a threshold for the matreshkas.
A number of matreshkas mapped to alternative, frameshifted splice variants (see Additional file 2). In some cases, there is evidence that these variants are functionally distinct. Among these are the paired box gene 8 (PAX8) isoform c , and X-box protein (XBP1) isoform U .
Given the nucleotide sequence similarity between transcripts in orthologous triplets, conserved translation products in alternative reading frames would be expected to occur by chance. Additional parameters were required to differentiate likely matreshkas from false positives. Generally, longer matreshkas are less likely to arise by chance, therefore matreshka length was chosen as a first parameter for selection. A second useful parameter gave an approximate measure of the selection pressure maintaining a potential matreshka: this was the number of amino acid positions where a stop codon could have arisen, truncating the matreshka but leaving the parent (RefSeq) protein sequence unaffected (see Methods for an explicit example), denoted here Nstop. Furthermore, we estimated a probability of seeing Nstop given the parent amino acid sequence, and considering codon bias (see Methods), but not taking into account the effects of bi-codon bias mentioned above (see 'Open reading frame translation' section).
To gain clues as to the possible function of matreshkas, all matreshka protein sequences were analysed with functional motif and signal peptide prediction programs. Matreshka nucleotide sequences were also screened for tandem repeats, to exclude artifactual sequences which pass the length filter (e.g., if the repeat contains no stop codon in the matreshka frame). Matreshkas with tandem repeats in any one of the three organisms were discarded from in-depth analysis, even though Alex itself was found to contain such repeats. In addition, matreshkaM nucleotide sequences were scanned for consensus Kozak translation initiation motifs [14, 15], to help determine whether leaky ribosomal scanning  may be occuring. To estimate how frequently Kozak motifs appear by chance, we scanned the 9163 human coding sequences used for building matreshkas, and counted how frequently ATG occurred versus full Kozak sequences (all frames included). On average, ~30 ATGs, of which ~5 (16%) were Kozak motifs, were observed per transcript. Of the 1793 matreshkaM, only 217 (12%) have a Kozak motif exactly at the start codon (data in Additional file 3). However, of the 15 longest matreshkaM, 8 have a Kozak motif exactly at the start codon, which is a significant enrichment of true positives (exact binomial test, N = 15, P = 0.0001).
Characteristics of selected matreshkas. Characteristics of the selected matreshkas from Fig. 4, for human (H.s.), mouse (M.m.) and rat (R.n.). Columns from left to right detail the database accessions of RefSeq transcripts used to extract matreshkas, the RefSeq description, the frame the matreshka lies in, the length in amino acids of the matreshka, whether the matreshkas start with methionine, and the number of positions where a stop codon could have occurred to interrupt the matreshka but leave the RefSeq parent unaffected.
RefSeq parent accessions (H.s., M.m., R.n)
Matreshka aa lengths (H.s., M.m., R.n)
Number of possible stop positions in matreshkas (H.s., M.m., R.n)
NM_022055; NM_199251; NM_022292
potassium channel, subfamily K, member 12 (KCNK12)
418; 430; 430
108; 114; 115
NM_018971; NM_008158; NM_023099
Superconserved receptor expressed in brain 1 (GPR27)
375; 356; 372
60; 53; 55
NM_022571; NM_181752; NM_181771
G protein-coupled receptor 135 (GPR135)
290; 284; 268
59; 64; 57
NM_000460; NM_009379; NM_031133
255; 230; 230
55; 56; 52
NM_002693; NM_017462; NM_053528
polymerase (DNA directed), gamma (POLG)
241; 224; 224
53; 55; 56
NM_153675; NM_010446; NM_012743
forkhead box A2 (FOXA2)
236; 256; 213
52; 58; 45
NM_139315; NM_009315; XM_213729
TAF6 RNA polymerase II (TAF6)
235; 220; 220
53; 50; 49
NM_000212; NM_016780; NM_153720
platelet glycoprotein IIIa (ITGB3)
192; 192; 192
6; 6; 6
The matreshkas in Table 3 which don't begin with methionine are unlikely to represent complete proteins, since the human matreshka sequence either contains no methionines (KCNK12, GPR135, POLG, FOXA2) or methionines only very close to the 3' end of the matreshka (GPR27, TAF6). As with Alex and virtually all matreshka predictions, none of the candidates in Table 3 contain known protein functional motifs, as ascertained by pattern matching to Prosite motifs. The following sections describe matreshkas of particular interest.
The genomic region upstream of the matreshkaM start was computationally scanned for transcription start sites (TSS), in case it is in fact derived from a separate transcript overlapping with the THPO sequence: none was found. The matreshkaM start is 212 nucleotides downstream from the parent start, and the initiator ATG lies in a strong Kozak context (Fig. 6A) in all three organisms. This suggests that the matreshka could be produced from a leaky ribosomal scanning event .
An important indication, however, that the THPO matreshkaM may be translated not from a leaky scanning event, but instead as part of a frameshifted splice variant, came from a BLAST search of the matreshkaM against the UniProt protein database . A 100% identity match was found against the translation of the 'nirs' THPO splice variant. No function specific to the nirs THPO appears to be documented. This variant contains a frameshift downstream of the matreshkaM start, which effectively places the 3' half of the transcript in the same frame as the matreshkaM (Fig. 6C).
A SNP  exists which can generate a stop codon in (and thus truncate) the matreshkaM at the 45th amino acid (Fig. 6A). The SNP replaces a glycine with a glutamate at position 116 in THPO itself, on a loop between an alpha-helix and a beta sheet. Gly116 is within the conserved, 184aa long Erythropoietin-Thrombopoietin domain in the N-terminal portion of THPO. This domain is responsible for binding to THPO's receptor, MPL (myeloproliferative leukemia virus oncogene). Three independent site-directed mutagenesis studies have been carried out to determine which residues are necessary for binding to MPL, based on 3D structural data [20–22]. A total of 14 amino acids were determined to be important for binding, none of which lie within 15aa of Gly116. It therefore seems unlikely that the SNP affects THPO function. Also of note are two splice variants (labelled '2' and '3' in RefSeq), which retain the SNP and shorten the matreshkaM by 4 and 39 aa, respectively.
The significance of the matreshka's Kozak motif is underlined by the absence of a Kozak motif for the parent sequence: the first motif in the ITGB3 frame occurs 934 nucleotides downstream of the annotated start codon. This suggests that leaky ribosomal scanning is occurring in the parent frame, and may increase the likelihood that the matreshkaM (a further 490 nucleotides downstream) is also translated from leaky scanning.
These are by no means the only interesting predictions. The largest matreshka discovered is 448aa long in human, although it has the potential to reach up to 500aa in mouse (Table 3). It is derived from the KCNK12 potassium channel, a member of the 2-pore domain superfamily of background K+ channels. The G protein-coupled receptor 135, the gamma subunit of a DNA directed polymerase, TAF6 RNA polymerase II, and Forkhead box A2 also contain substantial size matreshkas (Table 3). These matreshkas may represent alternative proteins, in the way that Alex is, or be translated as part of a frameshifted splice variant (e.g., the THPO nirs splice variant, Fig. 6C). Conservation measurements have previously been used to support the functional importance of frameshifted splice variants . Our findings suggest that other conserved frameshifted splice variants, such as caspase 7 isoform beta, are biologically relevant (see Additional file 2).
In a related study by Chung et al , alternative reading frames (ARFs, which can be considered synonymous with matreshkas) were searched for using a comparative genomics approach in human and mouse. A conservative list of 40 ARF-containing genes was derived from elegant simulation-based statistical and modelling methods. Of these 40 genes, 21 genes were included in our matreshka set (Additional file 4), although the ARF and matreshka sequences were not compared directly, as the former have not been published. Our study had a broader scope, including non-methionine-start reading frames which form the bulk of the longest matreshkas (Table 3). Given the stringent length threshold chosen by Chung et al., our analysis probably has a much greater representation of -1 frame matreshkas (see Table 3). All matreshka sequences have been provided as suppplementary data (Additional file 5, Additional file 6), and can be filtered according to our calculated P values (Additional file 3).
The genome-wide study reported here provides evidence for the existence of potentially dual coding sense-frames in a number of mammalian transcripts. Future studies should aim to determine actual proof-of-translation by raising antibodies against matreshkas, and probing cells or tissues where the mRNAs are known to be expressed. Alternatively genetic mouse models could be generated which would knock-out the putative matreshka while leaving the parent RefSeq sequence intact, thus enabling phenotypic analysis. Future studies could also explore the potential for overlaps between tORFs on the antisense strands. Indeed, genomic mapping of full-length mouse cDNAs has revealed transcriptional forests in which overlap of coding sequences on the sense and antisense strands occurs .
The strategy for identifying conserved frameshifted amino acid sequences (matreshkas) in transcripts is summarised in Fig. 2. First of all, we constructed human-mouse-rat orthologous triplets. Open reading frames (ORFs) were extracted from frames +1 and -1, and translated in silico. Sequences above certain conservation thresholds were retained for further analysis.
The starting point for the analysis was the Reference Sequence (RefSeq) set . RefSeq is a well-curated collection of mRNA and corresponding protein sequences maintained by the National Centre for Biotechnology Information (NCBI). Protein entries for human, mouse and rat (release 9) were downloaded from the NCBI web site . Orthologous triplets were built on best reciprocal matches in all vs. all BLASTP searches for human-mouse, human-rat and mouse-rat orthologs. For example, if human protein X is BLASTed against all mouse RefSeq proteins and returns protein Y as best hit, and the converse is true (when protein Y is searched against the human proteome, protein X is the best hit), then proteins X and Y are mutual best matches - and predicted to be in silico 'orthologs'. If in turn mouse protein Y has rat protein Z as best mutual match, and human protein X had rat protein Z as best mutual match, the ortholog triplet X-Y-Z can be built. 'Best' BLASTP hits were taken as the hits with the highest percentage identity (at least 60%) at the amino acid level to a query, and covering at least 70% of its length. If no hit matching these criteria were found, the ortholog triplet was not built.
The RNA coding sequence for each RefSeq protein in the orthologous sets was then retrieved for in silico generation of frameshifted translated ORFs (tORFs). tORFs at least 50 amino acids long were generated computationally from frames +1 and -1 on the sense strand. We used a software package called Scripter developed by ourselves, although this can be done with Getorf from the EMBOSS package [27, 28]. Conservation was ascertained with global alignments between all human-mouse and human-rat ORFs within an ortholog triplet. Scripter was again employed for this step, although the EMBOSS package Needle program would be an adequate substitute. Output files were parsed using perl scripts. In a given alignment, the shortest ORF had to be >= 90% the length of the longer, the aligned region had to cover >= 80% of the length of the shorter ORF, and the aligned region had to comprise >= 60% identical amino acid residues. The term 'matreshka' was coined to describe a conserved tORF.
A key parameter which differentiated effectively between matreshkas was the number of stop codons which could have arisen to truncate a matreshka, without affecting the 'parent' (frame 0, RefSeq) protein sequence. This is essentially a very approximate measure of the selection pressure maintaining the matreshka. For example, if part of the parent sequence is a valine followed by an isoleucine residue, the possible codons could be GTA, GTT, GTC and GTG (Val) and ATT, ATC and ATA (Ile). It can be seen that there are several codon combinations which could produce a stop codon in a frame +1 matreshka in this part of the sequence (GTGATT, GTGATC, GTGATA, GTAATT, GTAATC and GTAATA). This simple search can be carried out across the length of the parent sequence which overlaps with a matreshka. The final result counts how many times a matreshka could have been truncated by a stop codon without affecting the parent sequence, but was not truncated in either human, mouse or rat.
This stop codon count was supplemented with a probability value which takes codon bias into account, and was calculated as follows. From published codon frequencies , we calculated the relative probability of observing each codon for a given amino acid (aa), by dividing the frequency of each codon by the sum of frequencies for all codons for aa 'A'. These relative probabilities must sum to 1 for A. Extending this, we can find the relative probability of observing a di-codon for a pair of amino acids A1A2. Instead of the binary yes/no answer as to whether a stop codon could truncate a matreshka (described above), we can sum the relative probabilities of di-codons which do not cause a truncation, obtaining a probability that a matreshka is not truncated at a given position, abbreviated here to Pno_stop. The overall probability of a matreshka 'surviving' is derived by multiplying Pno_stop across the entire matreshka sequence, abbreviated here to ∏(Pno_stop). This assumes independence of consecutive (overlapping) bi-codons. We chose to report the probability of a matreshka being truncated by a stop, equal to 1 - ∏(Pno_stop).
Matreshkas were also scanned for different types of motifs. Those starting with methionine (matreshkaM, derived from tORFM), for instance, might be translated from leaky ribosomal scanning events. A strong Kozak motif could support initiation of translation at these locations . We therefore scanned the nucleotide context around the ATG start for Kozak motifs. The pattern [AG]NNATGG (where N is any nucleotide, and letters in square brackets indicate alternative bases at a given position) was used, derived from published data on initiator codons from highly curated sequences . The significant enrichment of Kozak motifs in the longest matreshkaM was ascertained with the Exact Binomial Test using R software , which was also used for the χ2 test in Table 1. Matreshkas were additionally scanned for signal peptide motifs using SignalP v.2.0 . BLASTP searches of human matreshkas against UniProt (UniRef100, last updated 13th Sep 2005) were carried out with an e-value threshold of 0.0001. To check for repeating sequence elements, the EMBOSS command line programs Equicktandem and Etandem were used with default settings to scan matreshka nucleotide sequences for tandem repeats. SpotFire DecisionSite v. 8.0 (Spotfire Inc., MA, USA), a data visualisation tool, was used to rapidly select the most interesting matreshka candidates.
Matreshka amino acid composition analysis was performed using the EMBOSS program pepstats [28, 32]. Sequence annotation, such as the number of introns, flanking genomic sequence and SNP data, was extracted from Ensembl [33, 34], and multiple sequence alignments generated with CLUSTAL W v. 1.83 [35, 36]. To verify whether the THPO matreshka start is associated with a transcription start site, the program Eponine was used, available from the Sanger Center web site .
This study was entirely funded by the Novartis Institutes for Biomedical Research, Basel, Switzerland.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.