- Research article
- Open Access
Evolution of alternative and constitutive regions of mammalian 5'UTRs
BMC Genomicsvolume 10, Article number: 162 (2009)
Alternative splicing (AS) in protein-coding sequences has emerged as an important mechanism of regulation and diversification of animal gene function. By contrast, the extent and roles of alternative events including AS and alternative transcription initiation (ATI) within the 5'-untranslated regions (5'UTRs) of mammalian genes are not well characterized.
We evaluated the abundance, conservation and evolution of putative regulatory control elements, namely, upstream start codons (uAUGs) and open reading frames (uORFs), in the 5'UTRs of human and mouse genes impacted by alternative events. For genes with alternative 5'UTRs, the fraction of alternative sequences (those present in a subset of the transcripts) is much greater than that in the corresponding coding sequence, conceivably, because 5'UTRs are not bound by constraints on protein structure that limit AS in coding regions. Alternative regions of mammalian 5'UTRs evolve faster and are subject to a weaker purifying selection than constitutive portions. This relatively weak selection results in over-abundance of uAUGs and uORFs in the alternative regions of 5'UTRs compared to constitutive regions. Nevertheless, even in alternative regions, uORFs evolve under a stronger selection than the rest of the sequences, indicating that some of the uORFs are conserved regulatory elements; some of the non-conserved uORFs could be involved in species-specific regulation.
The findings on the evolution and selection in alternative and constitutive regions presented here are consistent with the hypothesis that alternative events, namely, AS and ATI, in 5'UTRs of mammalian genes are likely to contribute to the regulation of translation.
Alternative splicing (AS) has emerged as a major mechanism for regulating gene expression and function in animals, particularly, in mammals. Large-scale studies based on mapping of expressed sequence data on genomic sequences have produced estimates of as many as 30–60% of human genes undergoing alternative splicing [1–5]. The impact of alternative splicing on protein function has been studied in great detail and is generally recognized as a source of protein diversity that expands the repertoire of protein function [6–8].
In contrast, little is known about the prevalence and impact of alternative events, such as AS and alternative transcription initiation (ATI), in 5'-untranslated regions (5'UTRs). AS and ATI are the primary sources of 5'UTR transcript diversity, and several reports have conjectured that these mechanisms might play an important role in orchestrating complex regulatory mechanisms within the 5'UTRs [9–13]. Estimates of the number of genes with alternative 5'UTRs vary from 12%  to 22% , while estimates of alternative promoter usage range from 10%  to 18% . Anecdotally, studies have shown that alternative events are responsible for 5'UTR transcript diversity in mammals, but to our knowledge, there have been no detailed, genome-wide studies aimed at elucidating the functional role of transcript diversity in mammalian 5'UTRs.
The bias toward studying AS in the coding regions versus 5'UTR probably reflects two obvious sources of complications in the analysis of transcript diversity in UTRs. First, it is easier to assess the functional impact of AS in protein-coding regions because elimination or disruption of known protein domains resulting from AS is readily interpretable [6–8]. Second, boundaries of the coding sequence typically are identified with relative ease because protein sequences are defined by their start and stop codons whereas precise delineation of the 5'UTR often is problematic.
Translational efficiency of eukaryotic mRNAs depends on the presence of regulatory elements within the 5'UTR. It has been shown that the occurrence of initiation codons and open reading frames upstream of the authentic start codon (uAUGs and uORFs, respectively) can affect the translation of mRNA into protein. The presence of uAUGs and uORFs in mammalian 5'UTRs is typically associated with translational repression [10, 17] but cases of increased translation efficiency also have been reported . A complementary computational study has demonstrated substantial conservation of uAUGs and uORFs in 5'UTRs of mammalian mRNAs, suggesting that at least some of these elements are functionally important . These experimental and computational findings raise the possibility that 5'UTR diversity has the potential to produce mRNA isoforms that differ with respect to their uAUG and uORF content, which could be an important facet of the regulation of translation.
Experimental evidence shows that 5'UTR transcript diversity is achieved during transcription, via ATI, and after transcription, via AS. In some instances, both mechanisms are employed. For example, AS in the 5'UTR of human axin2, a negative regulator of Wnt/B-catenin signaling, generates three isoforms with different arrangements of uAUGs and uORFs, resulting in a set of 5'UTRs that each confer different mRNA stabilities and translational efficiencies upon the respective isoforms . Similarly, an alternatively spliced exon located in the 5'UTR of neuronal nitric-oxide synthase (nNOS) has been shown to introduce a translational control element that inhibits translation of the mRNA . The diversity and complexity of the mu-opioid receptor gene expression is achieved by a combination of alternative splicing and alternative promoter usage . Translational repression of the mouse mu-opioid receptor expression using uORFs and leaky scanning has been recently reported .
The significance of the ATI and AS mechanisms in generating 5'UTR transcript diversity lies with the ability to alter the 5'UTR landscape by rearranging both the number and type of translational control elements included in each transcript. Slight differences in the arrangement of translational control elements between isoforms can lead to major changes in regulatory effects on translation. For instance, translational regulation of the multidrug resistance-associated protein 2 (Mrp2) is mediated by ATI. The 5'UTR of Mrp2 contains four different transcription start sites and three uORFs, and experimental data show that uORF3 has a much stronger inhibitory effect on translation than uORF1 and uORF2 . A combination of ATI and AS in Dicer, a ribonuclease that mediates RNA interference at the transcriptional and post-transcriptional levels, appears to regulate translational efficiency as well, resulting in long and short transcript variants. Both variants encode uAUGs (9 in the long form and five in the short form), and although both forms show decreased levels of translation, the longer form appears to exhibit greater inhibitory effects, probably, because of the increased number of uAUGs . Likewise, Tie2, an endothelium-specific receptor tyrosine kinase required for blood vessel maturation, contains multiple transcription start sites and encodes five uORFs. Apparently, the greater the number of uORFs contained in the 5'UTR, the greater the inhibitory effects on translation, suggesting an accumulative effect on the overall level of translational efficiency .
In addition, it has been shown that both evolutionarily conserved and non-conserved regulatory control elements exhibit inhibitory effects, suggesting that non-conserved control elements may regulate at a species-specific level. For example, human oncogene mdm2 contains two uORFs, both conserved in mouse. Although the inhibitory effect of uORF1 exceeds that of uORF2, both uORFs are required for maximum inhibition of translation . In contrast, the human stress response transcription factor ATF5 contains two uORFs that are conserved in mouse, as well as three non-conserved species-specific uAUGs. The alternative transcript encoding the conserved uORFs exhibits a stronger inhibitory effect than the transcript encoding the non-conserved uAUGs, but the effect of the non-conserved elements is non-negligible as well [23, 24].
We performed a genome-wide comparative study of 5'UTR sequences in primates and rodents with the principal goal of understanding how alternative events impact regulation of translation. To this end, we compared the abundance, conservation and evolution of translational control elements within alternative and constitutive regions of 5'UTR. Alternative nucleotides were not classified according to type of alternative event, because the focus of the study was to examine the broad impact of transcript diversity on mammalian 5'UTRs. Accordingly, we reasoned that the regulatory effects of uAUGs and uORFs located in alternative regions of 5'UTRs should be the same regardless of whether the existing transcript diversity for the given gene is generated via ATI, AS, or both. We find that, although alternative regions of mammalian 5'UTRs evolve faster and are subject to a weaker purifying selection than constitutive regions, they possess extensive potential for translation regulation.
Results and discussion
5'UTR statistics for human and mouse
A genome-wide comparative analysis of 5'UTRs in human and mouse was carried out to investigate the prevalence, conservation, and evolution of putative translational control elements within alternative and constitutive regions of mammalian 5'UTRs. Starting with high quality annotation from the ASD Database, we restricted our analysis of alternative events to those that involved exclusively 5'UTR sequences, and limited the pool of alternative transcript data to reliably identified, high quality isoforms with mRNA evidence and known protein-coding sequences (see Methods for details). A total of 7735 human isoforms (with 5'UTR sequences) were used to identify 2915 human genes with alternative 5'UTRs, and 2165 mouse isoforms (also with 5'UTR sequences) were similarly used to identify a set of 909 mouse genes with alternative 5'UTRs (hereinafter ALT_5'UTR sets). These stringent criteria likely yield a relatively small subset of the complete set of genes with alternative 5'UTRs, but were chosen to eliminate potentially unreliable splice predictions and alignment artifacts from biasing the results. Our ALT_5'UTR subsets represent 12% of human genes and 3.4% of mouse genes. Previous reports suggest that 10–22% of human genes [5, 14, 15] and 19–20% of mouse genes [25, 26] contain alternative 5'UTRs. Our estimates for mouse are lower than those previously reported [25, 26]; most likely, these low values reflect the fact that there is less high quality annotation for alternative events in 5' UTRs available for mouse than there is for human. Our estimates, which are based on mRNA evidence, probably, represent the lower bounds for the number of mammalian genes with ALT_5'UTRs.
Are 5'UTR lengths distributed differently between genes that exhibit 5'UTR transcript diversity versus those that do not? We address this question by identifying a set of control genes that do not contain alternative events (hereinafter referred to as the nonALT control sets). Comparison of the 5'UTR length distributions between transcripts from the ALT and nonALT sets reveals that, in humans, 5'UTR lengths are distributed differently (Additional file 1). The human ALT 5'UTRs are, on average, slightly but significantly longer than nonALT 5'UTRs, whereas, in mouse, 5'UTR lengths are approximately the same between the ALT and nonALT sets (Table 1). The 5'UTR lengths reported here are close to the generally accepted range of 160–210 nucleotides for human [27, 28], and just above the estimate of 139 nucleotides for mammals . The origin and significance of the length difference between ALT and nonALT 5'UTRs in humans but, apparently, not in mouse remain unclear. A distinct possibility seems to be that the length of mouse ALT 5'UTRs is underestimated owing to the insufficient availability of sequences of low-abundance isoforms. Should the greater length of the human ALT 5'UTRs compared to nonALT be taken as an accurate representation of the relationship in mammals, it is likely to reflect the greater opportunity for alternative events in longer 5'UTRs.
Higher prevalence of alternative sequences in 5'UTRs compared to coding regions
To assess the prevalence of alternative events within mammalian 5'UTRs, we classified 5'UTR nucleotides as ALT or CONSTIT by examining nucleotide inclusion levels within the pool of alternative transcripts for each gene. CONSTIT nucleotides are present in all isoforms, whereas ALT nucleotides are present only in a fraction of the isoforms (Figure 1). Alternative isoforms were mapped to the genomic sequence, classified as 5'UTR or CDS (based on the 5'UTR and CDS boundaries defined in Methods and Figure 1), and then annotated as either ALT or CONSTIT. Among human genes, we counted 189,778 CONSTIT and 548,760 ALT nucleotides in the 5'UTRs included in the present analysis. For purposes of comparison, we counted 2,984,181 CONSTIT and 1,072,755 ALT nucleotides in the corresponding coding sequences of the same genes (Table 2). Thus, the ratio of alternative-to-constitutive nucleotides is reversed in the 5'UTRs compared to the CDS: the average of alternative-to-constitutive nucleotides was ~3:1 in the 5'UTR, and ~1:3 in the CDS. A qualitatively similar pattern was observed in mouse, where estimates show an average ratio of ~2:1 between alternative-to-constitutive nucleotides in the 5'UTR, and ~1:4 in the CDS of the analyzed genes (Table 2). The CDS boundaries chosen for this calculation include the variable region located between the most upstream and downstream start codons (Figure 1). We define this stretch of genomic sequence as variable because it is included as 5'UTR in some isoforms, but as CDS in others. If we remove this variable region from the CDS altogether and define the upstream CDS boundary by the most downstream start codon, we estimate an average ratio of ~1:9 between alternative-to-constitutive nucleotides in the human CDS, and a ~1:12 ratio between alternative-to-constitutive nucleotides in the mouse CDS. A variable region also exists at the CDS|3'UTR boundary, but this region was removed from our analysis by simply choosing the most upstream stop codon as the downstream CDS boundary (Figure 1). These results indicate that, among genes with alternative 5'UTRs, the extent of alternative events in the 5'UTRs is much greater than that in the coding regions.
Excess of potential regulatory elements in alternative regions of 5'UTR
Given that the fraction of alternative sequence is greater in the 5'UTRs than in the coding region among genes with alternative 5'UTR events, it is natural to ask how often 5'UTRs from this set harbor putative control elements that might be involved in translation regulation. We examined the organization of putative regulatory motifs within ALT and CONSTIT regions of 5'UTR by searching for potential regulatory elements including uAUGs and uORFs.
Previously, uAUGs have been detected in 29–48% of mammalian 5'UTRs [30–33], and anecdotally, often have been found in alternatively spliced genes [34, 35]. We compared the frequency of uAUGs between ALT and CONSTIT regions of 5'UTR to see if the 5'UTR transcript diversity contributes to the overall levels of uAUG abundance. Raw counts indicate that uAUGs are 3.9 times more frequent in ALT versus CONSTIT regions in human and 2.7 times more frequent in mouse (Table 3).
We compared the relative frequency of all 64 codons within ALT and CONSTIT regions of 5'UTR and, in accord with previous observations , found that the AUG codon is significantly depleted in both regions (Additional file 1). In addition, we compared the frequency of AUG with those of the 5 triplets that represent permutations of AUG (AGU, GUA, GAU, UAG and UGA), and again observed significant depletion of AUG in both ALT and CONSTIT regions in human and mouse (Table 4). Thus, there seems to be purifying selection against uAUGs in both alternative and constitutive regions of mammalian 5'UTRs, but this selection appears to be substantially weaker in alternative regions, resulting in the observed higher frequency of AUG.
The efficiency of translation initiation depends on the arrangement of nucleotides surrounding the translational start codon, so the AUG context is thought to be an important regulatory factor (reviewed in ). The nucleotide contexts of uAUGs located in ALT and CONSTIT regions were evaluated using previously published methods , and no significant differences were detected in either human or mouse. The majority of uAUGs in ALT and CONSTIT regions exhibited weak contexts (Additional file 2), consistent with previous results [19, 28] and with the above conclusion on selection against AUGs in 5'UTRs, in that uAUGs are unlikely to efficiently initiate translation.
Next, we estimated the size and abundance of uORFs in ALT and CONSTIT regions of 5'UTR in order to assess the role of uORFs as potential control elements. The uORFs were identified in the 5'UTR sequences of transcripts, mapped to the genomic sequence, and then classified as ALT or CONSTIT using the same approach that was applied to the mapping of uAUGs. To be included in the analysis, a uORF must be fully contained in either the ALT or CONSTIT regions of the corresponding 5'UTR (uORFs that spun the boundaries between ALT and CONSTIT regions were removed from the final data set). The great majority of uORFs (81% and 79% in mouse) indeed are fully contained within ALT or CONSTIT regions, indicating that restricting the analysis to these ORFs would not significantly bias the results.
The alternative regions of 5'UTRs contained 3.6 times more uORFs than constitutive regions in humans and 2.6 times more uORFs in mouse (Table 3). A comparison of the uORF length distributions between ALT and CONSTIT regions of 5'UTR showed that ALT uORFs are significantly longer than CONSTIT uORFs (P = 1.1 × 10-26 for human and P = 6.5 × 10-9 for mouse; Student's t-test) (Figure 2 and Table 1). To control for length differences between ALT and CONSTIT regions of 5'UTR, we compared uORF length distributions between size-matched ALT and CONSTIT regions for a subset of 320 human genes taken from the original ALT_5'UTR set. The results indicate that after controlling for length, ALT uORFs are still markedly and significantly longer than CONSTIT uORFs (72.9 versus 48.6 nucleotides; P = 0.00006; Student's t-test; Additional file 1).
We further addressed the question whether uORFs are more prevalent in 5'UTRs of genes that contain alternative regions? Comparison of uORF abundance between genes in the ALT and nonALT control sets indicates that ~53% of human and ~51% of mouse ALT genes contain uORFs, whereas only ~44% of human and ~42% of mouse nonALT control genes contain uORFs, suggesting that uORFs are indeed more common in 5'UTRs containing alternative regions (Table 1). Comparison of uORF densities (the number of uORFs per gene) between genes from the ALT and nonALT control sets showed substantially greater densities in the ALT sets in both human (P = 1.46 × 10-11; Student's t-test) and mouse (P = 0.0004; Student's t-test) (Additional file 1). Genes from the ALT set contain an average of 3.1 and 2.7 uORFs per gene in human and mouse, respectively, whereas genes from the nonALT control set contain an average of 2.4 uORFs per gene in human and 2.2 in mouse. A comparison of uORF length distributions between the ALT and nonALT sets showed that uORFs from the nonALT control set are significantly longer than uORFs located in constitutive regions of ALT genes, but shorter than uORFs located in alternative regions of ALT genes (Table 1 and Additional file 1). Taken together, these observations show that genes with alternative 5'UTRs are more likely to encode uORFs than genes without such regions, and that uORFs are more abundant in alternative than in constitutive regions. The results obtained for uORFs are fully consistent with those for the uAUGs and suggest that there is a weaker selection against translation initiation upstream of the authentic AUG in alternative regions of mammalian 5'UTRs than there is in constitutive regions. From a complementary perspective, one would note that alternative regions contain a greater concentration of potential control elements that could regulate translation.
We examined the nucleotide composition of human and mouse 5'UTRs and found that GC-contents for human and mouse 5'UTRs were estimated, respectively, at 60% and 59% (Additional file 2), in good agreement with previous reports [27, 28]. No significant differences in nucleotide composition or GC-content were detected between ALT and CONSTIT regions of the 5'UTR. Thus, the excess of uORFs in alternative versus constitutive regions of 5'UTRs is not caused by differences in nucleotide compositions of these regions.
Conservation of putative control elements in constitutive and alternative regions of mammalian 5'UTRs
The evolutionary conservation of uAUGs and uORFs in ALT and CONSTIT regions of mammalian 5'UTRs was evaluated using alignments of orthologous sequences in two pairs of closely related species, namely, human-macaque and mouse-rat. In general, and in agreement with previous observations , the uAUGs were found to be highly conserved, which is consistent with their widespread regulatory roles. Between human and macaque, 81% of the CONSTIT uAUGs are conserved as compared to 74% of the ALT uAUGs, a relatively small but statistically significant difference (P = 0.000002; Fisher's exact test) (Table 5). A similar pattern was observed for the mouse-rat comparison: 70% of CONSTIT uAUGs and 59% of ALT uAUGs are conserved (P = 0.00007; Fisher's exact test). We also compared the frequency of conserved AUGs to the frequency of the five shuffled triplets that are permutations of AUG, and found that the AUGs are significantly more conserved than the other five triplets within CONSTIT regions of 5'UTR in human (P = 3.7 × 10-6) and mouse (P = 8.2 × 10-6). The fraction of conserved AUGs in ALT regions was slightly, but significantly greater than the fraction of the other five triplets in human (P = 0.05), but not in mouse (P = 0.21) (Table 5). Thus, the AUGs in CONSTIT and ALT regions are significantly conserved compared to the background levels of conservation, but that conservation is considerably more pronounced in CONSTIT regions.
Analysis of uORF conservation yielded similar results, i.e., 71.8% of CONSTIT uORFs and 60.5% of ALT uORFs were conserved between human and macaque (P = 6.5 × 10-9; Fisher's exact test), and the corresponding values for mouse-rat were 56.6% and 41.5% (P = 0.001; Fisher's exact test) (Table 5). With regard to the substantially lower conservation of uORFs in alternative regions, it is necessary to point out that, although evolutionary conservation is a strong indicator of the functional relevance of the corresponding element, it is not a strict requirement. In particular, both conserved and non-conserved control elements have been implicated in translational repression [22, 23, 37]; obviously, non-conserved elements are more likely to exert species-specific regulation. For instance, the 5'UTR of the mu-opioid receptor gene contains non-conserved control elements that inhibit translation in a species-specific fashion . Similarly, species-specific patterns of 5'UTR alternative exon usage have been linked to species-specific patterns of tissue expression [39–41].
A recent analysis of uORF conservation in four Cryptococcus species has led to the estimate that approximately one-third of the uORFs are conserved owing to their importance for the regulation of translational efficiency . Substantial conservation of uORFs has also been observed in 5'UTRs of Saccharomyces  and plants , findings that are compatible with the possibility that many uORFs are associated with biological functions. Neafsey and Galagan report that the majority of conserved uORFs in Cryptococcus do not exhibit codon usage bias or conservation at the amino acid level, effectively ruling out the possibility that a significant fraction of the uORFs encode functional peptides . We performed a similar analysis and found no evidence of codon usage bias among the human and mouse uORF sequences included in this study. We repeated this analysis after partitioning the uORFs according to ALT and CONSTIT in human, and found that, in general, profiles of relative codon frequencies are similar between all codons in ALT and CONSTIT uORFs (Additional file 1). Thus, it appears likely that, to the extent that they are functional, most of the uORFs are control elements, although a minority might encode biologically relevant peptides. Indeed, a recent proteomic analysis resulted in the identification of 54 proteins, less than 100 amino acids in length each, that are suspected of being translated from uORFs located in the 5'UTRs of human mRNAs .
Fast evolution of alternative regions in mammalian 5'UTRs
To further assess the evolutionary forces that affect mammalian 5'UTRs, we compared the evolutionary rates between ALT and CONSTIT regions and found that ALT regions diverge faster than CONSTIT regions in both primates and rodents. Significant differences between the ALT and CONSTIT regions of 5'UTRs were detected for both synonymous (Ks) (P = 0.022 for human-macaque and P = 0.004 for mouse-rat; Wilcoxon rank test) and non-synonymous (Ka) (P < 0.001 for human-macaque and mouse-rat; Wilcoxon rank test) substitution rates in uORFs (Table 6). We calculated non-synonymous (Ka) and synonymous (Ks) substitution rates for uORFs in ALT and CONSTIT regions, and then re-calculated rates for the subset with lengths ≥ 30 nts, to ensure that short uORFs did not bias the results (Table 6). The substitution rates of the uORFs did not significantly change after removal of the short subset from either the ALT or the CONSTIT regions in human-macaque and mouse-rat comparisons. Substitution rates within ALT and CONSTIT portions of 5'UTR were calculated separately for regions that encode uORFs (K5 uORFs) and regions that lack uORFs (K5 uORFs excluded), and again, ALT regions appear to diverge more rapidly than CONSTIT regions (Table 6). The uORFs in both ALT and CONSTIT regions evolve slower and, by inference, are subject to stronger selection than the rest of the sequence (K5 uORFs < K5 uORFs exluded) (P = 0.009 for human-macaque and P = 0.04 for mouse-rat; Wilcoxon rank test). Furthermore, there is a weak but significant overall trend between substitution rates, Ka < Ks, in both ALT and CONSTIT regions, suggesting that non-negligible, although weak purifying selection affects the amino acid sequences encoded in uORFs, conceivably, owing to a small fraction of the uORFs that produce functional peptides. Taken together, these observations indicate that ALT regions are subject to a weaker purifying selection than CONSTIT regions, however, a fraction of the uORFs in the ALT regions could represent conserved translational control elements. Because the evolutionary rate distributions for uORFs in ALT and CONSTIT regions substantially overlap, it is impossible to rule out the possibility that there are small subsets of highly conserved alternative uORFs, a potential target for future investigations.
The present analysis indicates that ALT regions of mammalian 5'UTRs evolve faster than CONSTIT regions, in all likelihood, owing to relaxed purifying selection. There is no consensus regarding trends in evolutionary rates between ALT and CONSTIT regions of coding sequence (CDS) . Nevertheless, most comparisons indicate a higher non-synonymous substitution rate (Ka) in alternative exons [47, 48], but a significantly lower synonymous substitution rate (Ks) , resulting in a much higher Ka/Ks ratio than in constitutive exons. The increased Ka values in alternative regions, at least in part, seem to be due to positive selection at the protein sequence level . The cause of the low Ks values in ALT regions is unclear, but might have to do with more stringent requirements for RNA secondary structure in AS [49, 51]. Here we did not observe this paradoxical relationship between Ka and Ks in ALT and CONSTIT uORFs in mammalian 5'UTRs (even after removing short uORFs from the calculation), but instead found a consistent pattern for all measured rates (Table 6). The likely explanation for this apparent difference between the evolution of alternative regions in 5'UTRs and CDS is that for uORFs, purifying selection at the amino acid level is weak at best and there is no positive selection; in contrast, selection for RNA secondary structure would almost equally affect Ka and Ks.
Tight regulation of expression in genes with alternative 5'UTRs
We characterized the biological roles of genes that contain alternative regions in 5'UTRs, by examining patterns of gene expression and functional annotation for genes and transcripts within the ALT_5'UTR sets. Gene expression patterns for ALT_5'UTRs were evaluated using Atlas2 microarray and expressed sequence tag (EST) data. Gene Ontology annotation was used to classify ALT_5'UTRs according to biological process, molecular function and cellular localization patterns.
Average probe expression levels were calculated using probe data for the human genes in the analyzed set. Atlas2 expression data were separated into two groups: the ALT_5'UTR set (2097 probes) and the nonALT CONTROL set (8060 probes). We calculated average probe expression levels across the 79 tissue types in human and found that transcripts from the ALT_5'UTR set were expressed at significantly lower levels, on average, than transcripts from the nonALT CONTROL set (P = 7.3 × 10-13; Student's t-test). Furthermore, when sets of 1611 ALT and nonALT transcripts with size-matched 5'UTRs were compared, the statistically significant difference remained (P = 5.0 × 10-5; Student's t-test). This observation indicates that the difference in expression levels could not be explained, simply, by the greater average length of the 5'UTRs in the ALT_5'UTR set.
Average probe expression levels were also calculated using EST abundance as a measure of expression. The number of gene-specific EST sequences in the EST databases gives a reasonably accurate approximation of relative gene expression . Alignments of transcript sequences from ALT_5'UTR and nonALT CONTROL sets with ESTs from the human normal tissue GenBank EST libraries were selected for analysis using thresholds given in the Methods section. Gene expression levels based on the analysis of EST database were calculated for 57 different human tissues. Significant differences between the EST data for ALT_5'UTR and nonALT CONTROL sets was demonstrated with a Monte Carlo approximation of Fisher's exact test (P < 10-5); compatible results were obtained with sets of 5733 transcripts with length-matched 5'UTRs (P < 10-3). Thus, the analysis of EST abundance data confirmed that transcripts from the ALT_5'UTR set that contain alternative regions in 5'UTRs are expressed at lower levels, on average, than transcripts from the nonALT CONTROL set, and the difference cannot be explained solely through the length differences between the 5'UTRs.
We classified ALT_5'UTR genes according to their biological roles, by searching for evidence of keyword enrichment from Gene Ontology annotation. The results suggest that the ALT_5'UTR set is enriched for genes that are strongly and tightly regulated. Human and mouse genes with alternative 5'UTRs are significantly enriched for keywords associated with biological functions such as signal transduction, receptor activity and translation (Table 7 and Additional file 2). This subset of genes also includes a large fraction of growth factors and transcription factors, which are known to be finely and strongly regulated [53, 54]. Furthermore, in accordance with this observation, we found that over 17% of the annotated genes in our data set of human genes with alternative events in the 5'UTRs are classified as "precursor" proteins, which is compatible with the tight regulation of protein expression in this set of genes.
Qualitatively, the two groups of observations, those on the lower expression level of genes containing alternative regions in 5'UTRs and those on the tight regulation of the corresponding genes, appear congruent and compatible with plausible hypothesis that these genes are subject to especially strong down-regulation at both levels, transcription and translation, with the control elements in the alternative regions involved in the latter. This hypothesis is strengthened by results from a recent study which show that uORF-containing transcripts, on average, are expressed at lower levels and have shorter half-lives than transcripts without uORFs . Thus, the effects of translational repression in uORF-containing transcripts, in part, might be achieved via an RNA decay mechanism.
All findings presented here seem to be consistent with the hypothesis that alternative events, namely, AS and ATI, in 5'UTRs of mammalian genes contribute to the regulation of translation. At least within the set of genes that was conservatively defined to include genes with reliably demonstrated alternative events within 5'UTRs, the fraction of the 5'UTRs that is involved in an alternative event is much greater than that in the corresponding coding regions. In retrospect, this finding might not be particularly unexpected considering that 5'UTRs are not bound by constraints on protein structure and function, which limit the number of alternative nucleotides that are admissible in coding regions. The ratio of alternative-to-constitutive nucleotides is much higher in 5'UTRs than in coding regions, suggesting the possibility that alternative regions play a major role in regulating translation of the respective genes. With regard to this hypothesis, the results of the present analysis are somewhat ambiguous. The alternative regions of mammalian 5'UTRs contain a greater density of potential control elements (uAUGs and uORFs) than constitutive regions, but these elements are less conserved than those in constitutive regions. Thus, the fraction of conserved control elements among the uORFs contained within the alternative regions of mammalian 5'UTRs is lower than that in constitutive regions. Nevertheless, the uORF sequences even in alternative regions are, on average, subject to a stronger selection than the rest of the sequence. Furthermore, there is anecdotal experimental evidence that implicates non-conserved uORFs in translation regulation. Therefore, such variable elements provide considerable potential for regulation that, however, can be adequately explored only by direct experimentation.
The genes containing alternative regions in the 5'UTRs are relatively lowly expressed and typically belong to functional categories such as transcription factors, receptors, and other signaling pathway molecules, whose expression is strongly and tightly controlled. Together with the observations on the patterns of evolution of potential regulatory elements, these trends indicate that AS and ATI are probably important mechanisms of translation regulation in mammals.
The 5'UTR data set
Alternative splicing data were obtained from the ASD Database for human (Ensembl v36.35i) and mouse (Ensembl v37.34e) . Those ASD isoforms that were generated on the basis of computational predictions and/or EST evidence alone were discarded, in effort to eliminate potentially unreliable splice predictions and alignment artifacts from biasing the results. In order to be included in the analysis, isoforms had to meet the following criteria: 1) isoforms must have supporting mRNA evidence, and 2) start and stop codon positions for coding sequences (CDS) must be known. Start codon positions were used to identify the subset of isoforms that included 5'UTR|CDS boundaries. The 7735 human transcripts selected by these criteria were used to identify 2915 human genes on the basis of the ASD information, and 2165 mouse transcripts were similarly used to identify 909 mouse genes. This filtering method underestimates the number of genes with alternative regions in 5'UTRs but yields a high quality dataset based upon the alignment of full-length mRNAs with protein and start codon annotations. We refer to these sets of human and mouse genes as the ALT_5'UTR sets, because the 5'UTRs of these genes contain alternative (ALT) and constitutive (CONSTIT) nucleotides. By definition, alternative nucleotides are present in a fraction of the transcripts of the respective genes, whereas constitutive nucleotides are present in all of the genes transcripts.
Determination of the boundaries of 5'UTR and CDS
The 5'UTR and CDS boundaries for genes in the ALT_5'UTR set were determined by combining start and stop codon annotations from all isoforms that mapped to a gene. The 5'UTR boundaries for each gene were determined as follows: the upstream 5'UTR boundary was selected by choosing the most upstream isoform position; the downstream 5'UTR boundary was selected by choosing the most upstream start codon position from the set of isoforms that map to the gene (if more than one authentic start codon exists among the set of isoforms, the most upstream start codon position is chosen) (Figure 1). These conservative criteria ensure that protein-coding regions are excluded from the analysis of 5'UTR sequence variability. The exclusion of fragmentary sequence data is especially important in identifying the upstream 5'UTR boundary, as short non-coding RNA sequences are often positioned immediately upstream of the 5'-end of coding genes .
The CDS boundaries were determined as follows: the upstream CDS boundary was selected by choosing the most upstream start codon position from the set of isoforms; the downstream CDS boundary was selected by choosing the most upstream stop codon position from the set of isoforms that map to the gene (if more than one authentic stop codon exists, the most upstream stop codon position is chosen (Figure 1).
Detection of potential regulatory elements in mammalian 5'UTRs
Starting with 5'UTR leader sequences from the isoforms in the ALT_5'UTR set, we calculated the uAUG and uORF abundance in ALT and CONSTIT regions of these 5'UTR sequences. The uAUGs were identified by searching for AUG triplets within the 5'UTRs. The uAUGs were then mapped to the genomic sequence, where redundant elements (uAUGs from different isoforms that mapped to the same genomic coordinates) were removed from the final set. The uORFs were also detected in the 5'UTRs and similarly mapped to the genomic sequence. The uORFs were identified using the EMBOSS program getorf , with the following criteria: 1) uORFs must be at least 6 nucleotides in length, 2) a uORF must start with methionine and end with one of the three stop codons, and 3) uORFs are identified in all reading frames.
uAUGs and uORFs were classified as ALT or CONSTIT based upon their overlap with ALT and CONSTIT regions of the corresponding genomic sequence. Only uAUGs and uORFs that are fully contained within ALT and CONSTIT regions of 5'UTRs were included in this study (i.e., uAUGs and uORFs that overlapped between constitutive and alternative regions were excluded).
Identification of orthologous genes
Human-macaque (human version NCBI36; macaque version MMUL1) and mouse-rat (mouse version NCBI36; rat version RGSC3.4) orthologous gene pairs were downloaded from Ensembl using the BioMart data mining tool . 5'UTR alignments were generated using the OWEN alignment tool  with the following parameters: a P-value < 0.001 for each hit was required, and 5'UTRs were required to be bound at the 3' ends by exons that align across > 80% of length . For the beginning of the CDS, alignment of the nucleotide sequences was guided by the amino acid sequence alignment . Putative ortholog alignments were cleaned using previously reported thresholds . In cases where greater than 40% of the gaps or unannotated regions of orthologous sequences did not align, the orthologs were removed from the final set. For example, 5'UTRs of many macaque genes are not properly annotated (contain 'NNNNN'), making it difficult to identify the upstream 5'UTR boundary. The human-macaque alignments with uncertainties of this type were discarded. In all, in a total of ~2800 human-macaque and ~900 mouse-rat whole gene alignments were generated.
Comparison of substitution rates in coding and non-coding DNA
Synonymous (Ks) and non-synonymous (Ka) substitution rates for the uORFs located in the 5'UTR were calculated using the Pamilo-Bianchi-Li method [64, 65] which takes into account transition and transversion rates. Rates of divergence were calculated separately for uORFs located in constitutive and alternative regions of 5'UTR. We calculated Ka and Ks separately for all uORFs and for the subset of uORFs ≥ 30 nucleotides in length, to ensure that short uORFs did not bias the results. Substitution rates for the 5'-untranslated regions with uORFs (K5 uORFs) and without uORFs (K5 uORFs excluded) were calculated along the 5'UTRs for each gene using Kimura's two-parameter model .
Gene expression analysis
GNF Atlas2 expression data was downloaded from the UCSC genome browser (gnfAtlas2 table) to study gene expression patterns in human (Mar.2004 assembly) and mouse (Aug.2005 assembly) [67, 68]. GenBank mRNA accession IDs were used to map isoform sequences to probe data. Probe data were partitioned into two groups: the ALT_5'UTR sets, and the nonALT CONTROL sets, which contained the full complement of probe data. The nonALT CONTROL set consists of transcripts whose 5'UTRs have not been shown to undergo alternative events such as AS or ATI. Atlas2 expression data are classified into 79 distinct tissue types in human and 61 tissues types in mouse, and duplicate intensity values for each tissue type are included in the raw data. Replicates were combined by calculating the average intensity value for each tissue type. Average expression levels were calculated separately for individual probes and individual tissue types. Average probe expression levels were calculated by summing intensity values across all tissue types in the probe, and dividing by the total number of tissues. Average tissue expression levels were calculated by summing intensity values for a set of probes and dividing by the total number of probes in the dataset.
Gene expression levels based on EST abundance
Gene expression levels were also evaluated by tallying the numbers of gene-specific EST sequences in the databases. Transcript sequences from the ALT_5'UTR set and the nonALT control set were aligned with ESTs from the human normal tissue GenBank EST libraries (~8 millions of ESTs, release 071808) using the BLASTN program . EST hits with the identity more than 95% and longer than 80% of EST sequence length were accepted as matches. Gene expression levels based on EST abundance calculated for 57 normal human tissues  were used for statistical analysis. Similar tissue-specific preferences were considered for both the ALT_5'UTR and nonALT control sets in the final classification and statistical analysis. A Monte Carlo approximation of Fisher's exact test implemented in the COLLAPSE program  was used to assess the significance of the differences between the EST data for ALT_5'UTR and nonALT control sets.
Gene Ontology Annotation
Functional annotation for human and mouse was downloaded from the Gene Ontology (GO) database . Starting with a total of 16,468 annotated human genes, GO annotations were mapped to 89% (2608) of the genes in the ALT_5'UTR set. With 17,480 annotated mouse genes, GO annotations were mapped to 94% (861) of the genes in our ALT_5'UTR set. Keyword frequencies were tabulated for the ALT and ALL sets, and normalized by the total numbers in each set. P-values were calculated using the χ2 test.
Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger S, Reich J, Bork P: EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000, 474 (1): 83-86. 10.1016/S0014-5793(00)01581-7.
Mironov AA, Fickett JW, Gelfand MS: Frequent alternative splicing of human genes. Genome Res. 1999, 9 (12): 1288-1293. 10.1101/gr.9.12.1288.
Croft L, Schandorff S, Clark F, Burrage K, Arctander P, Mattick JS: ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome. Nat Genet. 2000, 24 (4): 340-341. 10.1038/74153.
Kan Z, Rouchka EC, Gish WR, States DJ: Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 2001, 11 (5): 889-900. 10.1101/gr.155001.
Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 2001, 29 (13): 2850-2859. 10.1093/nar/29.13.2850.
Black DL: Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000, 103 (3): 367-370. 10.1016/S0092-8674(00)00128-8.
Kriventseva EV, Koch I, Apweiler R, Vingron M, Bork P, Gelfand MS, Sunyaev S: Increase of functional diversity by alternative splicing. Trends Genet. 2003, 19 (3): 124-128. 10.1016/S0168-9525(03)00023-4.
Resch A, Xing Y, Modrek B, Gorlick M, Riley R, Lee C: Assessing the impact of alternative splicing on domain interactions in the human proteome. J Proteome Res. 2004, 3 (1): 76-83. 10.1021/pr034064v.
Irvin-Wilson CV, Chaudhuri G: Alternative initiation and splicing in dicer gene expression in human breast cells. Breast Cancer Res. 2005, 7 (4): R563-569. 10.1186/bcr1043.
Ji H, Zhang Y, Zheng W, Wu Z, Lee S, Sandberg K: Translational regulation of angiotensin type 1a receptor expression and signaling by upstream AUGs in the 5' leader sequence. J Biol Chem. 2004, 279 (44): 45322-45328. 10.1074/jbc.M407261200.
Pan YX: Diversity and complexity of the mu opioid receptor gene: alternative pre-mRNA splicing and promoters. DNA Cell Biol. 2005, 24 (11): 736-750. 10.1089/dna.2005.24.736.
Park EH, Lee JM, Pelletier J: The Tie2 5' untranslated region is inhibitory to 5' end-mediated translation initiation. FEBS Lett. 2006, 580 (5): 1309-1319. 10.1016/j.febslet.2006.01.049.
Zhang Y, Li W, Vore M: Translational regulation of rat multidrug resistance-associated protein 2 expression is mediated by upstream open reading frames in the 5' untranslated region. Mol Pharmacol. 2007, 71 (1): 377-383. 10.1124/mol.106.029793.
Nagasaki H, Arita M, Nishizawa T, Suwa M, Gotoh O: Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns. Bioinformatics. 2006, 22 (10): 1211-1216. 10.1093/bioinformatics/btl067.
Zhang T, Haws P, Wu Q: Multiple variable first exons: a mechanism for cell- and tissue-specific gene regulation. Genome Res. 2004, 14 (1): 79-89. 10.1101/gr.1225204.
Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM: Identification and functional analysis of human transcriptional promoters. Genome Res. 2003, 13 (2): 308-312. 10.1101/gr.794803.
Song KY, Hwang CK, Kim CS, Choi HS, Law PY, Wei LN, Loh HH: Translational repression of mouse mu opioid receptor expression via leaky scanning. Nucleic Acids Res. 2007, 35 (5): 1501-1513. 10.1093/nar/gkm034.
Reynolds K, Zimmer AM, Zimmer A: Regulation of RAR beta 2 mRNA expression: evidence for an inhibitory peptide encoded in the 5'-untranslated region. J Cell Biol. 1996, 134 (4): 827-835. 10.1083/jcb.134.4.827.
Churbanov A, Rogozin IB, Babenko VN, Ali H, Koonin EV: Evolutionary conservation suggests a regulatory function of AUG triplets in 5'-UTRs of eukaryotic genes. Nucleic Acids Res. 2005, 33 (17): 5512-5520. 10.1093/nar/gki847.
Hughes TA, Brady HJ: Expression of axin2 is regulated by the alternative 5'-untranslated regions of its mRNA. J Biol Chem. 2005, 280 (9): 8581-8588. 10.1074/jbc.M410806200.
Newton DC, Bevan SC, Choi S, Robb GB, Millar A, Wang Y, Marsden PA: Translational regulation of human neuronal nitric-oxide synthase by an alternatively spliced 5'-untranslated region leader exon. J Biol Chem. 2003, 278 (1): 636-644. 10.1074/jbc.M209988200.
Jin X, Turcott E, Englehardt S, Mize GJ, Morris DR: The two upstream open reading frames of oncogene mdm2 have different translational regulatory properties. J Biol Chem. 2003, 278 (28): 25716-25721. 10.1074/jbc.M300316200.
Watatani Y, Ichikawa K, Nakanishi N, Fujimoto M, Takeda H, Kimura N, Hirose H, Takahashi S, Takahashi Y: Stress-induced translation of ATF5 mRNA is regulated by the 5'-untranslated region. J Biol Chem. 2008, 283 (5): 2543-2553. 10.1074/jbc.M707781200.
Matveeva OV, Shabalina SA: Intermolecular mRNA-rRNA hybridization and the distribution of potential interaction regions in murine 18S rRNA. Nucleic Acids Res. 1993, 21 (4): 1007-1011. 10.1093/nar/21.4.1007.
Zavolan M, van Nimwegen E, Gaasterland T: Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res. 2002, 12 (9): 1377-1385. 10.1101/gr.191702.
Nagasaki H, Arita M, Nishizawa T, Suwa M, Gotoh O: Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene. 2005, 364: 53-62. 10.1016/j.gene.2005.07.027.
Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S: Structural and functional features of eukaryotic mRNA untranslated regions. Gene. 2001, 276 (1–2): 73-81. 10.1016/S0378-1119(01)00674-6.
Rogozin IB, Kochetov AV, Kondrashov FA, Koonin EV, Milanesi L: Presence of ATG triplets in 5' untranslated regions of eukaryotic cDNAs correlates with a 'weak' context of the start codon. Bioinformatics. 2001, 17 (10): 890-900. 10.1093/bioinformatics/17.10.890.
Lynch M, Scofield DG, Hong X: The evolution of transcription-initiation sites. Mol Biol Evol. 2005, 22 (4): 1137-1146. 10.1093/molbev/msi100.
Davuluri RV, Suzuki Y, Sugano S, Zhang MQ: CART classification of human 5' UTR sequences. Genome Res. 2000, 10 (11): 1807-1816. 10.1101/gr.GR-1460R.
Pesole G, Gissi C, Grillo G, Licciulli F, Liuni S, Saccone C: Analysis of oligonucleotide AUG start codon context in eukariotic mRNAs. Gene. 2000, 261 (1): 85-91. 10.1016/S0378-1119(00)00471-6.
Suzuki Y, Ishihara D, Sasaki M, Nakagawa H, Hata H, Tsunoda T, Watanabe M, Komatsu T, Ota T, Isogai T, et al: Statistical analysis of the 5' untranslated region of human mRNA using "Oligo-Capped" cDNA libraries. Genomics. 2000, 64 (3): 286-297. 10.1006/geno.2000.6076.
Iacono M, Mignone F, Pesole G: uAUG and uORFs in human and rodent 5'untranslated mRNAs. Gene. 2005, 349: 97-105. 10.1016/j.gene.2004.11.041.
Anant S, Mukhopadhyay D, Hirano K, Brasitus TA, Davidson NO: Apobec-1 transcription in rat colon cancer: decreased apobec-1 protein production through alterations in polysome distribution and mRNA translation associated with upstream AUGs. Biochim Biophys Acta. 2002, 1575 (1–3): 54-62.
Araud T, Genolet R, Jaquier-Gubler P, Curran J: Alternatively spliced isoforms of the human elk-1 mRNA within the 5' UTR: implications for ELK-1 expression. Nucleic Acids Res. 2007, 35 (14): 4649-4663. 10.1093/nar/gkm482.
Kozak M: Pushing the limits of the scanning mechanism for initiation of translation. Gene. 2002, 299 (1–2): 1-34. 10.1016/S0378-1119(02)01056-9.
Child SJ, Miller MK, Geballe AP: Translational control by an upstream open reading frame in the HER-2/neu transcript. J Biol Chem. 1999, 274 (34): 24335-24341. 10.1074/jbc.274.34.24335.
Shabalina SA, Zaykin DV, Gris P, Ogurtsov AY, Gauthier J, Shibata K, Tchivileva IE, Belfer I, Mishra B, Kiselycznyk C, et al: Expansion of the human mu-opioid receptor gene architecture: novel functional variants. Hum Mol Genet. 2009, 18 (6): 1037-1051. 10.1093/hmg/ddn439.
Kleene KC: Patterns, mechanisms, and functions of translation regulation in mammalian spermatogenic cells. Cytogenet Genome Res. 2003, 103 (3–4): 217-224.
Osada N, Hirata M, Tanuma R, Kusuda J, Hida M, Suzuki Y, Sugano S, Gojobori T, Shen CK, Wu CI, et al: Substitution rate and structural divergence of 5'UTR evolution: comparative analysis between human and cynomolgus monkey cDNAs. Mol Biol Evol. 2005, 22 (10): 1976-1982. 10.1093/molbev/msi187.
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.
Neafsey DE, Galagan JE: Dual modes of natural selection on upstream open reading frames. Mol Biol Evol. 2007, 24 (8): 1744-1751. 10.1093/molbev/msm093.
Cvijovic M, Dalevi D, Bilsland E, Kemp GJ, Sunnerhagen P: Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation. BMC Bioinformatics. 2007, 8: 295-10.1186/1471-2105-8-295.
Hayden CA, Jorgensen RA: Identification of novel conserved peptide uORF homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups and preferential association with transcription factor-encoding genes. BMC Biol. 2007, 5: 32-10.1186/1741-7007-5-32.
Oyama M, Kozuka-Hata H, Suzuki Y, Semba K, Yamamoto T, Sugano S: Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol Cell Proteomics. 2007, 6 (6): 1000-1006. 10.1074/mcp.M600297-MCP200.
Artamonova , Gelfand MS: Comparative genomics and evolution of alternative splicing: the pessimists' science. Chem Rev. 2007, 107 (8): 3407-3430. 10.1021/cr068304c.
Ermakova EO, Nurtdinov RN, Gelfand MS: Fast rate of evolution in alternatively spliced coding regions of mammalian genes. BMC Genomics. 2006, 7: 84-10.1186/1471-2164-7-84.
Chen FC, Wang SS, Chen CJ, Li WH, Chuang TJ: Alternatively and constitutively spliced exons are subject to different evolutionary forces. Mol Biol Evol. 2006, 23 (3): 675-682. 10.1093/molbev/msj081.
Xing Y, Lee C: Evidence of functional selection pressure for alternative splicing events that accelerate evolution of protein subsequences. Proc Natl Acad Sci USA. 2005, 102 (38): 13526-13531. 10.1073/pnas.0501213102.
Ramensky VE, Nurtdinov RN, Neverov AD, Mironov AA, Gelfand MS: Positive selection in alternatively spliced exons of human genes. Am J Hum Genet. 2008, 83 (1): 94-98. 10.1016/j.ajhg.2008.05.017.
Xing Y, Lee C: Alternative splicing and RNA selection pressure – evolutionary consequences for eukaryotic genomes. Nat Rev Genet. 2006, 7 (7): 499-509. 10.1038/nrg1896.
Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA: Selection for short introns in highly expressed genes. Nat Genet. 2002, 31 (4): 415-418.
Kozak M: An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987, 15 (20): 8125-8148. 10.1093/nar/15.20.8125.
Geballe AP, Morris DR: Initiation codons within 5'-leaders of mRNAs as regulators of translation. Trends Biochem Sci. 1994, 19 (4): 159-164. 10.1016/0968-0004(94)90277-1.
Matsui M, Yachie N, Okada Y, Saito R, Tomita M: Bioinformatic analysis of post-transcriptional regulation by uORF in human and mouse. FEBS Lett. 2007, 581 (22): 4184-4188. 10.1016/j.febslet.2007.07.057.
Thanaraj TA, Stamm S, Clark F, Riethoven JJ, Le Texier V, Muilu J: ASD: the Alternative Splicing Database. Nucleic Acids Res. 2004, D64-69. 10.1093/nar/gkh030. 32 Database
Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, et al: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007, 316 (5830): 1484-1488. 10.1126/science.1138341.
Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.
Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004, 14 (1): 160-169. 10.1101/gr.1645104.
Ogurtsov AY, Roytberg MA, Shabalina SA, Kondrashov AS: OWEN: aligning long collinear regions of genomes. Bioinformatics. 2002, 18 (12): 1703-1704. 10.1093/bioinformatics/18.12.1703.
Kondrashov AS, Shabalina SA: Classification of common conserved sequences in mammalian intergenic regions. Hum Mol Genet. 2002, 11 (6): 669-674. 10.1093/hmg/11.6.669.
Shabalina SA, Ogurtsov AY, Lipman DJ, Kondrashov AS: Patterns in interspecies similarity correlate with nucleotide composition in mammalian 3'UTRs. Nucleic Acids Res. 2003, 31 (18): 5433-5439. 10.1093/nar/gkg751.
Shabalina SA, Ogurtsov AY, Rogozin IB, Koonin EV, Lipman DJ: Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals. Nucleic Acids Res. 2004, 32 (5): 1774-1782. 10.1093/nar/gkh313.
Li WH: Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993, 36 (1): 96-99. 10.1007/BF02407308.
Pamilo P, Bianchi NO: Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993, 10 (2): 271-281.
Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16 (2): 111-120. 10.1007/BF01731581.
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.
Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, et al: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31 (1): 51-54. 10.1093/nar/gkg129.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Ogurtsov AY, Marino-Ramirez L, Johnson GR, Landsman D, Shabalina SA, Spiridonov NA: Expression patterns of protein kinases correlate with gene architecture and evolutionary rates. PLoS ONE. 2008, 3 (10): e3599-10.1371/journal.pone.0003599.
Khromov-Borisov NN, Rogozin IB, Pegas Henriques JA, de Serres FJ: Similarity pattern analysis in mutational distributions. Mutat Res. 1999, 430 (1): 55-74.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.
The authors' research is supported by the DHHS (NIH, National Library of Medicine) intramural funds.
The authors declare that they have no competing interests.
AMS contributed to the design of the study, performed the bulk of the data analysis and wrote the initial draft of the manuscript, AYO contributed to data analysis, IBR contributed to the design of the study and data analysis, SAS contributed to the design of the study and data analysis, EVK contributed to the design of the study and wrote the final version of the manuscript; all authors read and approved the final version.