Frame disruptions in human mRNA transcripts, and their relationship with splicing and protein structures

Background Efforts to gather genomic evidence for the processes of gene evolution are ongoing, and are closely coupled to improved gene annotation methods. Such annotation is complicated by the occurrence of disrupted mRNAs (dmRNAs), harbouring frameshifts and premature stop codons, which can be considered indicators of decay into pseudogenes. Results We have derived a procedure to annotate dmRNAs, and have applied it to human data. Subsequences are generated from parsing at key frame-disruption positions and are required to align significantly within any original protein homology. We find 419 high-quality human dmRNAs (3% of total). Significant dmRNA subpopulations include: zinc-finger-containing transcription factors with long disrupted exons, and antisense homologies to distal genes. We analysed the distribution of initial frame disruptions in dmRNAs with respect to positions of: (i) protein domains, (ii) alternatively-spliced exons, and (iii) regions susceptible to nonsense-mediated decay (NMD). We find significant avoidance of protein-domain disruption (indicating a selection pressure for this), and highly significant overrepresentation of disruptions in alternatively-spliced exons, and 'non-NMD' regions. We do not find any evidence for evolution of novelty in protein structures through frameshifting. Conclusion Our results indicate largely negative selection pressures related to frame disruption during gene evolution.


on informat
on (mRNAs, cDNAs, ESTs, microarray data) onto the genomic DNA is an essential part of the gene annotation process.However, many transcripts appear to have frame disruptions in them, which would interfere with the formation of a stable protein product [1].Such frame disruptions have generally been considered symptoms of decay into a pseudogene [2,3].

Previously, we have analysed the distribution of a special case of such frame-disrupted transcripts, the 'transcribed processed pseudogene' or 'transcribed retropseudogene' [1].Retropseudogenes are copies of messenger RNAs that have been reverse-transcribed and re-integrated into the genome, probably as a by-product of LINE-1 retrotransposition [4].These intronless copies of genes usually decay and are deleted from the genomic DNA [1,[5][6][7][8]3].However, some retropseudogenes are transcribed, perhaps through co-option of local promoter elements, as supported by their increased density near genes [2,9].

Many mammalians genes (~40-80%) mak alternativelyspliced transcripts [10].It has previously been noted that many such alternatively spliced exons (perhaps up to ~30%) harbour premature stop codons [11,12], and thus may be considered 'pseudogenic' alternative transcripts.Other studies have demonstrated that several hundred human transcripts can be considered to harbour alternative reading frames, offset from each other by one or more frameshifts, which can be preserved for millions of years of evolution [13][14][15].Frith, et al., found that about onetenth of mouse cDNAs have apparent frame disruptions [16].Sorek, et al., showed that ~7% of human genes generate an alternative transcript with an Alu, with the vast majority of these insertions yielding frame-disrupted mRNAs [17].

How frequent is genuine frame disrupti n in human mRNAs?Is it significantly associated with the positions of protein structures and exons in coding sequences?Here, to answer these and other questions, we analyze a data set of high-quality human mRNAs, using an annotation pipeline which insures that spurious frame disruptions (due simply to bad sequence alignment) are discarded.We perform statistical calculations which demonstrate non-random distribution of the initial frame disruptions in these sequences with respect to: (i) protein domain annotations, (ii) alternative exons and (iii) rules for nonsensemediated decay (NMD).Messenger RNA transcripts that have premature stop codons greater than fifty nucleotides 5' to the last intron-exon junction of a gene are degraded by the nonsense-mediated decay (NMD) pathway [18].Some NMD substrates have been shown to produce functional proteins in yeast and mammalian cells [19,20].In addition, using our pipeline, we find no evidence for a role of frameshift in protein domain evolution.


Results and discussion


Overall statistics

s, we verified 16,1
3 high-quality mRNAs from the NCBI Refseq and Unigene consensus collections, through mapping onto human genomic DNA.A small subpopulation of these (419, or 3% of the total) mRNAs harbour significant frame disruptions (either frameshifts or premature stop codons) (Table 1), which is of a similar order to previous analyses of such disruptions in sets of transcripts [16,2,9].Most of these are disrupted by frameshifts (83% of cases), rather than premature stop codons.Using a small modification to the basic annotation pipeline, we defined a small minority of these frameshifted transcripts (17, 4% of the dmRNAs) that harbour compensating frameshifts, resulting in movement back into frame.Previous analysis of mouse cDNAs also indicated that a small fraction of them (~2%) may have such compensatory frameshifts [16].Three examples of dmRNAs are illustrated in Figure 1.There are two multiply-disrupted examples (homologous to a cytochrome P450, and to a zinc-finger -containing transcription factor), and a frameshifted alternative mRNA transcript, from the gene C20orf59, which appears to be a transmembrane sugar transporter.

In general, the dmRNAs demonstrate functional prev lences that are typical of the population of human transcripts in general, as judged from counting up Gene Ontology functional category annotations (Additional File 1).The duplication behaviour of the genes from which the disrupted mRNAs arise is also typical of the whole human gene complement (Figure 2; median value of 5 paralogs per gene for the disrupted mRNAs versus 6 for the whole set; mean = 36 [± 62] versus 32 [± 81]).However, dmRNAs have significantly fewer exons than mRNAs in general (mean = 7.9 [± 8.6] exons, compared to 10.0 [± 11.5] exons in general, P < 0.05 using normal statistics for the distribution of the sample mean).Such shorter lengths are expected from the truncating effect of frameshifts and stop codons.A large fraction (44%) of the dmR-NAs have multiple frame disruptions, with the frequencies of numbers of frame disruptions exhibiting a powerlaw relationship, as observed for processed pseudogenes [7,8] (Figure 3).The vast majority of frameshifts in dmR-NAs (326/346, 94%)) result in truncation from premature stop codons.

We examined the etiology of the frame disruptions in dmR As in more detail.Some dmRNAs have apparent frame disruptions from 5' and 3' insertions of retrotransposons (24/325, 7%), or from an overlapping antisense gene (45/325, 14%).Interestingly, also, a large proportion of dmRNAs arise from antisense homologies to other distal genes (47/325, 14%).Such antisense fragments are of potential importance in transcription regulation.A functional pseudogene with antisense homology to the nitric oxide synthase gene downregulates this gene the snail Lymnea stagnalis [21].These three categories of dmRNA (retroposon insertions, antisense protein homologies and overlapping gene pair homologies) comprise a subset of dmRNAs arising from 'probable UTR (untranslated region) features'.Another possible source of dmR-NAs are unassigned selenoproteins [22].We have filtered for known selenoproteins [22], but it is possible there are Three examples of dmRNAs
L T Q H Q R I H T G E K P Y E C K E C E K A F R S G S K L I Q H Q R M H T G E K P Y E C K E C G K T F S S G ENST00000263095: L T Q H Q R M H T / E K S H Q C N R C G K T F Y K T N L ctcactcagcatcaaaggatgcatactgg gaaaaatcccatcagtgtaacagatgtgggaagaccttttacaagtggacaaacctc

S Y P S R T S F R E G L F E C N H * G K Y F A R G agttacccttcgagaacttcttttagggaaggactttttgaatgtaaccactgaggcaaatattttgccagagga (c) Frame disruption from exon insertion

CT059_HUMAN:
M Q P P P D E A R R D M A G D T Q W S ENST00000335819: M Q P P P D E A R R D M A G D T Q W S atgcagccacccccagacgaggcccgcagg

agcaggcgccaggacagtcaggaggcca
further cases.However, we f
und no indication of this, since there is no significant over-representation (relative t

ction (288/2432, 12%) of the coding exons in their transcripts.Frame-disrupted exons are, on average, significantly
tatistics for the distribution of the sample mean).

Although, both frame-disrupted and non-frame-disrupted exons show a tendency for very short exon lengths (≤ 200 nucleotides), there is a greater proportion of long exons (>1000 nucleotides) in the frame-disrupted set (24% of frame

isrupted exon
versus 4% of those that are not frame-disrupted; Figure 4).To analyze exon lengths we disregarded the 'probable UTR features', but their inclusion does not change the trend observed; also, the exon length trend is maintained when exons are split into subsets of constitutive and alternatively-spliced exons.We examined the exons >1000 nucleotides in detail, and found that a significant fraction of them come from Znfinger -containing transcription factors (36/67, 54%) with >1/3 of their sequences composed of zinc finger motifs.Zinc-finger -containing transcription factors have dynamic evolution patterns in mammals, with expansions of family sizes specific to primates and rodents [23]; large numbers of dmRNAs are a signature of other dynamically evolving mammalian gene families, such as olfactory receptors and immune system genes [1].A significantly greater proportion of disrupted exons are at the 3' terminus of mRNAs (58/67, 87%), even if the zincfinger -containing genes are excluded.Such 3' exons have a general tendency to be longer (51% of 3' exons in multiple-exon transcripts verified by Refseq mRNAs are ≥ 1000 nucleotides in length) (Figure 4).This greater length has been suggested to be because of a greater amount of important conserved sequences in 3' UTRs, compared to 5' ones [24].


Positions of frame disruptions in dmRNAs

We analysed the distribution of the initial frame disruptions in the disrupted mRNAs with respect to the positioning of: (i) structural protein domains, (ii) alternativelyspliced exons, and (iii) the areas of the transcripts not susceptible to NMD (Tables 2, 3, 4, 5).In all of these analyses, we examine trends for the whole data set of dmRNAs, and the subset of these mRNAs for which the matching proteins have a verifying alignment in a divergent mammal or vertebrate (see Methods for details).The significant tendencies listed in Tables 2,

and 5 for the whole data set (combined st
p codon and frameshift disruptions) remain significant or become more significant if those examples labelled 'probable UTR features' are removed from the data (this is illustrated for those with verifying alignments to divergent orthologs, in each case).


Protein structure disruption

Do the frame disruptions in these mRNAs avoid disruption of protein structure domains?To answer this question, we analysed the distribution of initial frame disruptions in sequences relative to the placement of protein structure domains from the SCOP data (see Methods for details).For both frameshifts and premature stop codons, we find significant underrepresentation within protein domains (P in range <0.05 to <0.001; all of the Pvalues quoted for Tables 2, 3, 4, 5 are for χ 2 tests; see Table


Numbers of frame disruptions

Figure 3 Numbers of frame di

uptions.The number of frame d
sruptions in dmRNAs plotted versus the total occurrences of this number, on a log-log scale.This distribution is governed by a power law relationship, with the parameters for this linear relationship indicated on the plot.


Number of paralogs Proportion of occurrences

all genes genes yielding disrupted mRNAs 2 footnote for details of statistical tests).This non-random distribution of frame disruptions is observed for a wide range of margins for definition of overlap with protein domains (between 0 and 25 nucleotides) (Table 2 footnote).This avoidance of protein structure domains is evidence for selection pressures to avoid protein structure Distribution of frame-disrupted and non-frame-disrupted exon lengths in the disrupted mRNAs Figure 4 Distribution of frame-disrupted and non-frame-disrupted exon lengths in the disrupted mRNAs.The exon lengths are in b ns labelled at either end of the bin with the upper (≤) and lower (>) bounds, with occurrences in each bin on the y axis.The percentage of exons >1000 nucleotides is given for each data set.The upper left panel is for the whole set of exons; the lower left panel for 5' exons, the upper right for internal exons, and the lower right for 3' exons.*** The ratio stands for 'the number of frame disruptio

not disrupting a pro
ein structure domain assignment versus the number that do'.A margin for ascertaining overlap with a protein domain assignment of 15 nucleotides was used in the calculations.The expectations for the statistical tests (χ 2 ) are calculated by adding up the total amount of coding sequence that can be assigned to a SCOP protein structure domain for the sample of transcripts analysed in each row of the table.† stands for P < 0.05, † † for P < 0.01 and † † † for P < 0.001.The significant results remain significant to at least P < 0.05 when margins for calculating overlap with protein domains of 0, 5, 10, 20 or 25 nucleotides are also used.

disruption and supports a significantly negative role for frame disruption in the evolution of protein structures.

Because of the proportion of dmRNAs that contain large arrays of Zn finger domains, we also checked specifically for avoidance of disruption of Zn finger motifs.Zn finger motif assignments were taken from the feature table records o

the Uniprot database [25].We find significant avoidanc
of disruption of Zn finger motifs only for overlap margins of between 1 and 4 residues inclusive (Table 3).


Alternative splicing

We examined whether there is a relationship between the position of initial frame disruptions in mRNAs and the location of alternatively spliced exons (Table 4).We find a highly significant two-fold overrepresentation of initial frame disruptions in alternatively-spliced exons (P < 0.001; Table 4).These correspond to almost half (~46%, 191/419) of the dmRNAs.This may arise because the selection pressure on alternative splicings that are not transcribed at high levels will be considerably less, leading to increased likelihood of frame disruption as evolution progresses [10].It is possible that many of these framedisrupted alternative splicings have a regulatory role [11,26].Small numbers of the alternatively-spliced frameshifted dmRNAs arise from exon skipping (4 cases), and exon insertion

1 cases).This approximately twofold over-representation is
maintained (P < 0.05) in the subset of alternative splicings that contain SCOP [27] protein domain assignments within them.


Transcripts not susceptible to nonsense-mediate decay

Messenger RNA transcripts that have premature stop codons greater than fifty nucleotides 5' to the last intronexon junction of a gene are degraded by nonsense-mediated decay (NMD) [18].We analyzed the distribution of initial frame disruptions relative to this NMD rule (Table 5).There are significantly more transcripts with frame disruptions in the 'non-NMD' region (P < 0.001; Table 5), as would be expected logically (since these transcripts would not be degraded).However, this over-representation of initial frame disruptions in the 'non-NMD' region also arises for the subsets of transcripts in which the frame disruptions disrupt a SCOP protein structure domain (P < 0.05), and which are thus unlikely to form a stable func- tional protein product.Such unstable protein products are more likely for shorter truncatio

, and thus
MD provides an evolutionary guard against excessive expression of unstable proteins [28].


Checking for gene evolution through frame-shift formation

It is possible that through analysis of this comprehensive data set of dmRNAs, that we can find evidence for a positive role for such protein-coding frame disruptions in gene evolution.Specifically, is there evidence that such frameshifts can produce significant structural novelties?To check this, we derived a modification for the initial pipeline (Figure 5), with matches to SCOP protein structure domains replacing those for whole protein sequences from the SWISSPROT database, finding 36 cases (9% of the dmRNAs) which produce a significant alignmen for both subsequences delimited by the initial frameshift (Figur

.However, none of these (0%) overla
another protein domain assignment in a different frame, yielding no evidence for generation o protein structure novelties through single frameshifts.Nonetheless, a more thorough analysis of multiple vertebrates would be required to provide a more conclusive perspective on the role of frameshift in protein structure evolution.


Conclusion

We analyzed human mRNAs for both frameshift and stop codon frame disruptions, using a pipeline that was designed to discard spurious frame disruptions arising from alignment error.We performed statistical calculations and found non-random distributions of frame disruptions with respect to protein structures, alternativelyspliced exons, and 'non- MD' regions.The significant avoidance of protei structure disruption and highly significant placement in alternatively-spliced exons (rather than constitutive ones), together with the observation of a lack of protein structure generation through frameshift, support largely negative selection pressures related to frame disruption during gene evolution.masked alignments proved significant, a final third FASTX/Y alignment was generated without masking in the sequences (with e ≤ 0.01).

Given that cDNAs may have error rates as high as as 1 in 1 × 10 -2 , and that the genomic DNA error rate is 1 × 10 -4 , we can expect that 1 in 1 × 10 -6 frame disruptions detected have arisen from sequencing error.Given, that we have a total of 729 frame disruptions in a total of 942,752 nucleotides, we would expect that, at most, only one of these frame disruptions has arisen from sequencing error.This implies tha the data set of dmRNAs is of sufficiently high quality for in-depth bioinformatic analysis.

Alu elements are a common pollutant in protein databases [17].We obtained a large number of matches to Alu elements producing dmRNAs; since these are not a focus of our current analysis, they were removed using proteinlevel translations of Alu sequences, with a more accommodating BLAST threshold of e ≤ 0.01.

We removed selenoproteins from the dataset, since these are a known example of a re-coding phenomenon [22].Thi was achieved through protein-level BLAST comparisons (e ≤ 10 -4 ) to the determined human selenoproteome, downloaded from the SelenoDB database [22].

To insure that we are not considering spurious frameshifts arising from bad protein annotations, we used an additional filter to insure that the protein readin frames in question are well conserved in a distant mammal or vertebrate.We required that the matching protein is conserved in a rodent or non-mammal vertebrate (with BLAST evalue ≥ 10 -4 ) over ≥ 95% of its length.

(iii) Realignment to remove spurious frame dis uptions: Spurious frame disruptions can arise as alignment errors when comparing a protein to a nucleotide sequence [3].Such spurious frame disruptions are more frequent in more divergent aligned sequences; they are typically near the ends of aligned subsequences, and can also arise in Pipeline for annotating dmRNAs Figure 5 Pipeline for annotating dmRNAs.The steps discussed in Methods are illustrated schematically.

compositionally-biased or low-complexity regions [3].To insure that spurious frame disruptions are not considered in the present analysis, we parse the disrupted coding sequences at the initial frame disruption into two subsequences, and require that both of these subsequences align significantly to the original matching protein (BLAST e-value ≤ 10 -4 ).

In addition, we checked for compensatory frameshifts (i.e., pairs of frameshifts that move a coding sequence out of frame, and then back into frame).It is possible that compensatory frameshifts provide a mechanism for generating sequence diversity in proteins over evolution.We checked for compensatory frameshifting, using an additional filter in the initial pipeline at step 3 (Figure 5).For every case of an initial frameshi t, we checked for a second frameshift 3' to it in the transcript that corrects for the first frameshift.Then we checked whether the three subsequences delimited by these two frameshifts all align significantly with the original matching protein.

(iv) Protein domain matching: The Ensembl and NCBI transcripts were searched against the ASTRALSCOP protein domain database [27], using BLAST (e-value ≤ 10 -4 ), and the best-matching domains at each position in a transcript were retained, as described previously [8,3].

Specifically, also, we extracted zinc-finger motif assignments from the feature table records of the UniProt/ SWISSPROT database [25].


Checking for evolution hrough frameshifting

We checked for evidence of protein structure evolution through frameshift, using a modification of the initial pipeline (Figure 5).All steps were performed as before, except with SCOP protein domain sequences in lieu of SWISSPROT sequences.Then, we hecked for any significant undisrupted matches to other protein domains at the same positions as these frameshifted protein domain matc

s, using a similar protocol of BLAST database
searching, followed by refined alignment using FASTX/Y.

Figure 1
1
Three examples of dmRNAs.The translated dmRNA sequence is shown along with the corresponding nucleotide sequence; the aligning protein sequence is shown above these in each case.They are as follows: (a) a multiply-disrupted example (homologous to a cytochrome P450); (b) a multiply-disrupted example from a zinc-finger -containing transcription factor family; (c) an alternative splicing of the transmembrane sugar transport