Assessing the genomic evidence for conserved transcribed pseudogenes under selection
© Khachane and Harrison. 2009
Received: 11 March 2009
Accepted: 15 September 2009
Published: 15 September 2009
Skip to main content
© Khachane and Harrison. 2009
Received: 11 March 2009
Accepted: 15 September 2009
Published: 15 September 2009
Transcribed pseudogenes are copies of protein-coding genes that have accumulated indicators of coding-sequence decay (such as frameshifts and premature stop codons), but nonetheless remain transcribed. Recent experimental evidence indicates that transcribed pseudogenes may regulate the expression of homologous genes, through antisense interference, or generation of small interfering RNAs (siRNAs). Here, we assessed the genomic evidence for such transcribed pseudogenes of potential functional importance, in the human genome. The most obvious indicators of such functional importance are significant evidence of conservation and selection pressure.
A variety of pseudogene annotations from multiple sources were pooled and filtered to obtain a subset of sequences that have significant mid-sequence disablements (frameshifts and premature stop codons), and that have clear evidence of full-length mRNA transcription. We found 1750 such transcribed pseudogene annotations (TPAs) in the human genome (corresponding to ~11.5% of human pseudogene annotations). We checked for syntenic conservation of TPAs in other mammals (rhesus monkey, mouse, rat, dog and cow). About half of the human TPAs are conserved in rhesus monkey, but strikingly, very few in mouse (~3%). The TPAs conserved in rhesus monkey show evidence of selection pressure (relative to surrounding intergenic DNA) on: (i) their GC content, and (ii) their rate of nucleotide substitution. This is in spite of distributions of Ka/Ks (ratios of non-synonymous to synonymous substitution rates), congruent with a lack of protein-coding ability. Furthermore, we have identified 68 human TPAs that are syntenically conserved in at least two other mammals. Interestingly, we observe three TPA sequences conserved in dog that have intermediate character (i.e., evidence of both protein-coding ability and pseudogenicity), and discuss the implications of this.
Through evolutionary analysis, we have identified candidate sequences for functional human transcribed pseudogenes, and have pinpointed 68 strong candidates for further investigation as potentially functional transcribed pseudogenes across multiple mammal species.
Pseudogenes (derived from protein-coding genes) are gene copies that show signs diagnostic of protein-coding deficiency. Such signs commonly include premature stop codons and coding-sequence frameshifts, or neutral codon substitution patterns [1, 2]. Pseudogenes can arise in two chief ways: (i) from retrotransposition, i.e., reverse transcription of a cellular messenger RNA, followed by reintegration into the genomic DNA [3–5], or (ii) from decay of genes that originated (however long ago) from duplication [1, 6]. These genomic entities have generally been believed to be non-functional. Historically, there were some early individual reports of transcribed pseudogenes in the scientific literature [2, 7, 8]. Examples included: human leukocyte interferon (LeIFN) pseudogene , glyceraldehyde-3-phosphate dehydrogenase pseudogene , glucocerebrosidase pseudogene , steroid 21-hydrolase pseudogene , glutamine synthetase pseudogene , tumor repressor ΨPTEN .
More recently, genome-wide screens have detected transcription evidence for many retropseudogenes (>200) in humans [14–17]. In mouse oocytes, transcribed pseudogenes have been shown to play a significant role in the generation of small interfering RNAs (siRNAs) [18, 19], which regulate the expression of homologous genes.
Collectively, these reports indicate that an unknown cohort of human transcribed pseudogenes could be potentially functional in regulation of gene transcription. A key indicator of such function is significant conservation in other mammalian genomes. Svensson et al.  have explored conservation of apparent pseudogenes in three mammals (human, chimpanzee and mouse). Their analysis revealed 30 cases of transcribed pseudogenes that are preserved in mouse. Here, we analyze the distribution of transcribed pseudogene annotations (TPAs) in the human and mouse genomes, examine their conservation in an expanded panel of mammals (rhesus monkey, mouse, rat, dog and cow), and assess evidence for significant selection pressures. TPAs that are conserved in rhesus monkey show evidence of significant selection pressure, despite also displaying codon substitution patterns characteristic of non-protein-coding sequences. Also, we have discovered a short-list of 68 putative human transcribed pseudogenes that are syntenically conserved in at least two other mammals from our panel. These sequences represent a strong subset of candidates for further investigation as functional transcribed pseudogenes.
To identify transcribed pseudogenes, transcript sequences from the Unigene, RefSeq and H-InvDB databases were mapped onto the human genome and were examined for overlap with pseudogene annotations. These pseudogene annotations were taken from the VEGA http://vega.sanger.ac.uk/ and http://pseudogene.org websites (see Methods for details). We pooled these datasets with re-mappings of: (i) 'disrupted mRNAs' (dmRNAs) , and (ii) transcribed processed pseudogenes , from previous analyses [14, 16]. Overlap of the transcripts with pseudogenes was verified through using the positions of mid-sequence disablements (frameshifts and premature stop codons) as 'anchors'. That is, at least one disablement position was required to occur in both the genomic DNA and the transcript sequence (see Methods for further details).
Percentages of TPAs in human and mouse.
Total = 866/8160 (10.6%)
# Processed = 383/3737 (10.24%)
# Non-processed = 71/1078 (6.58%)
Total = 71/4187 (1.7%)
http://Pseudogene.org(excluding ambiguous pseudogenes)
Total = 1035/13354 (7.75%)
# Processed = 767/11072 (6.93%)
# Non-processed = 268/2282 (11.74%)
Total = 65/15064 (0.5%)
We performed a similar survey in the mouse genome for TPAs. Surprisingly, in the mouse genome, we found a very low percentage of TPAs, in comparison to the human genome (<2%) (P << 0.001 for the likelihood of the number in mouse, given the human percentage as an expectation, using binomial statistics). This is despite these two animals having similar amounts of pseudogene annotation data (Table 1), and transcript data (203,785 transcript sequences in total for human, and 203,550 for mouse). This indicates that transcribed pseudogenes are rarer in mice than in humans.
Transcription of pseudogenes per se does not necessarily indicate functionality. It has been shown that transcriptional activation at a particular genomic locus has a ripple effect on the neighboring loci . It is therefore possible that many transcribed pseudogenes arise simply because of this. However, of the various identified human TPAs in our present study, those that are evolutionarily conserved across mammals due to natural selection are more likely to be biologically functional. Therefore, we set out to identify a list of such orthologous transcribed pseudogenes that have conserved in ≥2 of our panel of mammals.
Certain gene families are known to spawn large numbers of pseudogenes. Examples include olfactory receptors , ribosomal-protein genes , human thioredoxin and glutaredoxin , ABC transporters , and heat shock proteins . In such cases, identifying orthologs in other mammals using the standard bi-directional best-hit procedure is problematic, since the rates of sequence evolution may vary in different lineages and genomic regions. Furthermore, such a procedure does not work well for pseudogenes, since these sequences are not evolving like protein-coding genes, which are under strong purifying selection. Because of this, the best match obtained using blastp to a pseudogene query is expectedly the parent protein-coding gene or a pseudogene recently evolved from the parent gene. Thus, the standard bi-directional best-hit procedure alone is not sufficient. Therefore, here, we have used synteny information between two organisms to pin-point pseudogene orthologs. We have used synteny maps along with homology-based searches to identify conserved orthologs in five mammals (rhesus monkey, mouse, rat, dog and cow) (see Methods for details). We identified a set of 68 human TPAs that are conserved in at least two of these mammals, representing potentially functional candidates (see Additional file 2: Table S1). In general, although approximately half (742/1750) of the human TPAs are syntenically conserved in rhesus monkey, only 3% are syntenically conserved in mouse. This suggests that a large number of human transcribed pseudogenes are primate-specific.
Our analysis indicated that TPA orthologs have higher sequence homology in comparison to their syntenic flanking regions. Average sequence identities among different syntenic regions in human and rhesus monkey are as follows: 75.0% (s.d. 12.6%) in the 5' areas, 81.0% in the 3' areas (s.d. 13.5%), and 86.7% (s.d. 9.6%) in the TPA sequences. The difference in the sequence identities between pseudogenes and the respective flanking regions is statistically significant (Wilcoxon signed rank test, P-value: < 5e-50). In the majority of cases (~86%, 293/341), the percentage sequence identity between orthologous TPA sequences is greater than that of the flanking regions. This suggests that significant selection pressure exists to maintain them. We note that similar analysis comparing the human and chimpanzee genomes is not informative because the species are too similar. Also, comparisons of the human genome with the other mammals in our panel are not informative either, because the appropriate regions cannot be aligned accurately or significantly.
Proportions of pseudogenes that have GC contents higher relative to their flanking regions in different categories of VEGA annotated human pseudogenes.
Transcribed and conserved (monkey) pseudogenes
P-value for statistical difference*
We checked whether the age of pseudogenes could be causing the GC content differences noticed above. To do this, we looked at the GCdiff in the following (VEGA) subsets of TPAs, i.e.: (i) TPAs unique to humans; (ii) TPAs conserved in rhesus monkey only; (iii) TPAs conserved in more divergent mammals such as mouse, rat, dog and cow. We found that 82.5% (381/462) of set (i) have GCdiff > 0 (i.e., GC content of pseudogene greater than that of the flanking region). Similar percentages were observed in the other classes: 84.1% (317/377) in set (ii), and 77.78% (21/27) in set (iii). There was no statistically significant difference in the GCdiff between any two of the classes (χ2 test, P-value > 0.55), suggesting that age of pseudogenes does not have any influence on the observed GC content differences.
We decided to assess the genomic evidence for a lack of protein-coding ability in the human TPAs that are syntenically conserved in rhesus monkey. We compared TPA characteristics to the characteristics of two other groupings: (i) known human protein-coding genes with orthologs in rhesus monkey; (ii) populations of simulated sequences that are randomly mutating without coding-sequence selection pressures. The human TPA sequences are used as starting sequences for these simulations. The protocol for these simulations is described in 'Methods: Ka/Ks ratio calculations'.
In addition, we calculated the distribution of Ka/Ks values for sequences that are randomly mutating without coding-sequence selection pressures. These sequences were generated using the simulation protocol described in 'Methods: Ka/Ks ratio calculations', using the human TPAs as starting sequences (green curve, Figure 5). The Ka/Ks distribution for these simulated sequences does not peak at ~1.0, as would be classically expected. This is likely due to some inaccuracy in modeling the expected frequency for the different possible nucleotide substitutions, which varies for different genomic areas . The distribution for TPADDs peaks in the range 0.6 to 1.0. This peak is similar for the randomly-mutating sequences (Figure 5). For the TPANDDs, the peak is at lower Ka/Ks values (0.4-0.6).
As a further comparison, we have calculated the Ka/Ks curve for orthologous pairs of protein-coding genes from the rhesus monkey and the human (blue curve, Figure 5). Clearly, these protein-coding sequences behave very differently from the TPAs, with a substantial mode in the range 0.0 to 0.2. In summary, these Ka/Ks trends indicate that the substitution patterns in the TPAs generally behave like non-protein-coding sequences, and not like protein-coding ones. This is despite the overall significant conservation relative to surrounding intergenic genomic DNA that was discussed in the previous section.
The Ka/Ks results suggest that these transcribed pseudogenes are relaxing to higher Ka/Ks values, since origination from their parents. But why do they not have Ka/Ks values of ~1.0? We suggest that this is chiefly because: (i) there may be some inaccuracy in modelling the expected frequency for the different possible nucleotide substitutions, which varies for different genomic areas (as noted in the previous section); (ii) in some cases, the transcribed pseudogenes were originally protein-coding, and became disabled subsequently in multiple lineages; (iii) some of them maintain an imprint of the original coding sequence because of selection pressure for regulation of homologous genes via antisense interference (e.g., through genesis of siRNAs); (iv) selection pressures on non-synonymous codon substitution rates in protein-coding genes, may have relaxed in the pseudogenes, contributing to an apparent relative increase in Ks; (v) it is also possible that some of these sequences are currently protein-coding, and have evolved through multiple coding-sequence disablements, as discussed previously .
To examine these data more closely, we calculated whether the Ka/KsΨ-ortho values are significantly less than would be expected for mutation without coding-sequence selection pressures (using the simulational analysis described in the Methods section). Several cases with such significant values (that may indicate purifying selection typical of protein-coding sequences), are observed (represented by circles in the Figure 6 plots). These Ka/Ks values (that apparently indicate protein-coding ability) may arise for the reasons listed in the preceding paragraph.
In addition, we examined whether the TPAs contain a protein domain of known three-dimensional structure, that is disabled by a frameshift or a premature stop codon (denoted 'TPA DD s'; see Methods for details of annotation of such domains). The TPA DD s are indicated by unfilled symbols in parts (a) and (b) of Figure 6. Interestingly, in the human-dog comparisons, there are three cases of TPA orthologous pairs that have such a disabled protein domain, despite Ka/Ks values that indicate apparent purifying selection. These sequences are thus of 'intermediate' character, i.e., they have evidence of both protein-coding ability and pseudogenicity.
Transcribed pseudogenes can regulate the expression of other genes by RNA interference mechanisms. For example, antisense transcribed RNA from a NOS pseudogene regulates the expression of neural nitric oxide synthase (nNOS) protein through formation of RNA duplex [34, 35]. Therefore, we investigated how many of the TPAs have antisense homology to the annotated full-length human cDNAs (E-value < 1e-10 and alignment length > = 100 nucleotides). A small proportion (8.3%, 69/828) of the human (VEGA) pseudogenic transcripts have either complete or partial, but significant, reverse complement (antisense) homology to human cDNAs. Of these, 63 have short length strong antisense homology to human cDNAs (alignment length > = 20, mismatches < = 2). However, there is no significant association of such antisense homologies to pseudogene transcription, since non-transcribed pseudogenes have similar levels of antisense homology (7.65%, χ2 test P-value = 0.5).
Human conserved TPAs that have antisense homology to human full length cDNAs.
ENSEMBL transcript id
Antisense identity (%)
Transcribed pseudogenes can also regulate the transcription of genes by producing siRNAs that have antisense homology [18, 19]. Due to unavailability of genome-wide human siRNA data, we used the siRNA data for the mouse genome from Tam et al.  and Watanabe et al.  to check how many of the small RNAs mapped to mouse transcribed pseudogenes that we identified. Interestingly, 24 out of 136 (17.6%) mouse TPAs had siRNA mappings compared to ~1% (178/18168) of the total mouse pseudogenes. The above difference is statistically significant (P-value < 0.05, using normal statistics for the distribution of the mean number of transcribed pseudogenes in a sample of 136 cases). This demonstrates that transcribed pseudogenes are significantly likely to generate siRNAs in mouse. For comparison, in Arabidopsis thaliana, ~40% of 572 pseudogenes have small RNA mappings .
In this study, we identified hundreds of cases of putative transcribed pseudogene annotations (TPAs), in the human genome. Importantly, we detected evidence for selection pressure on these transcribed elements. These findings therefore draw wider attention towards the potential functionality of these genomic elements. In addition, we found that 68 human TPAs are conserved in at least 2 other studied mammals. These human TPAs have ancient origins dating back >120 million years ago, as evidenced by their conservation patterns across distantly related mammals. These pseudogenes represent novel genomic elements of potential functional relevance.
We have shown that human TPAs that are syntenically conserved in rhesus monkey generally behave like non-protein-coding sequences, despite significant selection pressure on them, relative to the surrounding genomic DNA. Examination of Ka/Ks values for TPAs that are conserved in more divergent species (mouse and dog), indicated that some TPAs might actually be protein-coding. However, we cannot rule out other reasons for these low Ka/Ks values. For example, it is possible that some of these sequences had phases of protein-coding ability at some evolutionary stage. Also, it is possible that there is an imprint of purifying selection on these sequences because of selection pressure to form small interfering RNAs with homologous protein-coding genes. Ultimately, these questions can only be answered by detailed experimental characterization of these molecules; our analysis here provides a rich data source for prioritizing likely candidates of functional importance as transcribed pseudogenes.
Complete genome sequences of mammals were obtained from http://www.ensembl.org (Ensembl release 47 for human genome; Ensembl release 48 for other mammals, namely, rhesus monkey, mouse, rat, cow and dog). Pseudogene annotations for both processed and nonprocessed categories, were obtained from [http://www.pseudogene.org; [37, 38]] and for VEGA pseudogenes from http://vega.sanger.ac.uk/, for disrupted mRNAs (dmRNAs) from Harrison and Yu  and for other transcribed processed pseudogenes from Harrison et al. . The Blastx program  was used to determine the parent protein coding genes for VEGA pseudogenes (using E-value < 1e-09 as significance threshold), whereas for other datasets the annotations were readily available at the respective websites mentioned above.
Orthologous counterparts to a human pseudogene are detected by the presence of a homologous sequence at the syntenic position in the other mammalian genome. Based on this criterion, a search was carried out within 100 kb nucleotides distance of the exact syntenic coordinate (because genes can shuffle locally) in the target mammal as indicated in the synteny maps, to locate the orthologous pseudogenes. 'GeneWise' tool  was used, to align the above-obtained genomic DNA sequence and the human parental protein sequence, and to detect disablements in the alignment. The following mammals were included in the analysis: monkey, mouse, rat, cow and dog. The pair wise synteny map data for the various mammals were obtained from http://genome.ucsc.edu/.
Flanking sequences 5' and 3' of human pseudogenes were individually obtained, of length equal to the length of the human pseudogene, and were each globally aligned using 'needle' module of EMBOSS package http://www.ebi.ac.uk to the corresponding flanking region sequences (10000 nucleotides 5' and 3') of monkey in a sliding window of size also equal to the length of human pseudogene. The window in which best identity score was obtained was considered as the most optimum alignment between the flanking regions, representing syntenic regions. The Wilcoxon signed rank test was used for assessing the statistical significance of the difference between the degrees of homology calculated between two orthologous pseudogenes and that between the respective (orthologous) flanking regions. Cases with pair wise sequence identities <40% were excluded.
For sequence length and GC percentage calculations, only the exonic segments of pseudogenes were considered. One thousand nucleotides upstream and downstream of a pseudogenes were considered as flanking regions. GC content is calculated as the sum of guanine and cytosine nucleotides divided by the total number of nucleotides represented in terms of percentage.
'PAL2NAL'  was used to construct codon alignments between protein sequences (conceptual amino acid translation sequences in the case of pseudogenes) and corresponding DNA sequences, separately, for orthologous pseudogenes and parental protein coding genes. 'PAML 4' package  was used to calculate Ka/Ks ratios. Orthologs of human parental protein coding genes were identified using a similar approach as that for pseudogene orthologs discussed above, and also obtained from Ensembl database.
We derived a simulation protocol to calculate Ka/Ks values for evolution without coding-sequence selection pressures. This simulation protocol is as follows: (i) the nucleotide distance (Dnt) between a sequence and its ortholog was calculated, using the program DNADIST ; (ii) for each sequence, samples of 500 simulated sequences were generated, by randomly mutating the human sequence until the Dnt value was reached; (iii) Ka/Ks was calculated using PAML , for each simulated sequence compared to the original human sequence; (iv) those original human sequences that have Ka/Ks values < 95% of simulated Ka/Ks values were labeled as potentially under significant purifying selection. For these simulations, all Ka/Ks calculations are performed on the longest ORF in the sequence.
We also analysed simulated distributions of Ka/Ks for populations of sequences mutating without coding-sequence selection pressures, starting from the human TPA sequences. These were derived simply by merging the simulated distributions of Ka/Ks for each individual TPA.
Protein domains were assigned to the TPAs, using protein structure domain sequences downloaded from the ASTRALSCOP database http://astral.berkeley.edu, as described previously . Protein domains sequences were aligned to the TPA nucleotide sequences to assess for disablement by a frameshift or premature stop codon at least 15 amino acids from the end of the aligned subsequence. Disablements were required to be detected both by blast/bl2seq and by the TFASTX program [4, 39].
Transcribed human pseudogenes were aligned to full-length annotated human cDNA to examine for any antisense homology by using the sequence-searching program BlastN from the BLAST package (E-value < 1e-10).
siRNAs have been previously determined in the mouse genome [18, 19]. Using this data we mapped the siRNA sequences onto the mouse genome using GMAP software , and checked how many of these overlap with the annotations of transcribed mouse pseudogenes.
Ortholog sequences to the human transcribed ADP-ribose pyrophosphatase pseudogene (urn:lsid:pseudogene.org:9606.Pseudogene:4346; see Table 1), were obtained from the various studied mammals and were aligned using the online ClustalW tool http://www.ebi.ac.uk/clustalw/. The most conserved segment representing 257-396 positions of the human pseudogene was considered for the phylogenetic analysis. Phylogenetic tree was constructed using 'PHYLIP' software . The tree was evaluated statistically using 1000 bootstrap iterations and was visualized using the 'NJplot' tool .
A.N.K. and P.M.H. would like to thank the funding support from the National Science and Engineering Research Council of Canada (NSERC), and from Les Fonds Québécois de la Recherche sur la Nature et les Technologies (FQRNT).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.