Detecting transcription of ribosomal protein pseudogenes in diverse human tissues from RNA-seq data
© Tonner et al.; licensee BioMed Central Ltd. 2012
Received: 12 April 2012
Accepted: 10 August 2012
Published: 21 August 2012
Ribosomal proteins (RPs) have about 2000 pseudogenes in the human genome. While anecdotal reports for RP pseudogene transcription exists, it is unclear to what extent these pseudogenes are transcribed. The RP pseudogene transcription is difficult to identify in microarrays due to potential cross-hybridization between transcripts from the parent genes and pseudogenes. Recently, transcriptome sequencing (RNA-seq) provides an opportunity to ascertain the transcription of pseudogenes. A challenge for pseudogene expression discovery in RNA-seq data lies in the difficulty to uniquely identify reads mapped to pseudogene regions, which are typically also similar to the parent genes.
Here we developed a specialized pipeline for pseudogene transcription discovery. We first construct a “composite genome” that includes the entire human genome sequence as well as mRNA sequences of real ribosomal protein genes. We then map all sequence reads to the composite genome, and only exact matches were retained. Moreover, we restrict our analysis to strictly defined mappable regions and calculate the RPKM values as measurement of pseudogene transcription levels. We report evidences for the transcription of RP pseudogenes in 16 human tissues. By analyzing the Human Body Map 2.0 study RNA-sequencing data using our pipeline, we identified that one ribosomal protein (RP) pseudogene (PGOHUM-249508) is transcribed with RPKM 170 in thyroid. Moreover, three other RP pseudogenes are transcribed with RPKM > 10, a level similar to that of the normal RP genes, in white blood cell, kidney, and testes, respectively. Furthermore, an additional thirteen RP pseudogenes are of RPKM > 5, corresponding to the 20–30 percentile among all genes. Unlike ribosomal protein genes that are constitutively expressed in almost all tissues, RP pseudogenes are differentially expressed, suggesting that they may contribute to tissue-specific biological processes.
Using a specialized bioinformatics method, we identified the transcription of ribosomal protein pseudogenes in human tissues using RNA-seq data.
KeywordsRibosomal protein Pseudogene Transcription RNA-seq data
Pseudogenes are “fossil” copies of functional genes that have lost their potential as DNA templates for functional products [1–6]. While the definition of pseudogenes is still somewhat fuzzy, most of them are defined operationally by bioinformatics criteria, e.g., genomic scans of signatures of homology to known genes. Ribosomal protein (RP) pseudogenes represent the largest class of pseudogenes found in the human genome: over 2000 ribosomal protein pseudogenes are identified by bioinformatics scan of genomic sequence .
These pseudogenes are commonly thought to be non-functional due to the lack of promoters and/or the presence of loss of function mutations. Indeed, the vast majority of these pseudogenes either carry dysfunctional mutations such as in-frame stop codons, or lack of proper regulatory sequences, such as promoters, mTOP signals, and first introns . Interestingly, three RP pseudogenes, with 89%-95% sequence identity to their parent (progenitor) RP genes, were found to be transcribed and seem to be functional, by a bioinformatics scan of cDNA and expression sequence tag (EST) databases and confirmation by PCR and Northern blot . A genome-wide bioinformatics scan identified over 2000 potential pseudogenes . Moreover, it was found  that the six RP pseudogenes shared at syntenic loci between the human and the mouse genomes are more conserved than other RP pseudogenes.
However, data were lacking to experimentally validate pseudogene expression. It is unclear from the literature whether the reported cases are merely anecdotal or that pseudogenes do play some cellular roles. This is largely hindered by the lack of methods for the identification of pseudogenes transcription. The traditional method of transcriptome profiling, gene expression microarray, is not sensitive in distinguishing transcripts among very similar gene sequences.
Recent advancements of next-generation sequencing allow for direct massive transcriptome sequencing (RNA-seq), and thus providing unprecedented insights into all transcribed sequences. For example, RNA-seq has been applied to detect complex transcriptional activities such as alternative splicing [10, 11] and allelic-specific expression . Recently, RNA-seq has been applied to reveal RNA editing events . However, to the best of our knowledge, there were yet no attempts to detect the transcription of pseudogenes in RNA-seq data. The main challenge for pseudogene identification in RNA-seq data is the difficulty of high fidelity read mapping. Because sequences of pseudogenes are highly similar to the sequences of the mRNAs of the parent genes, specialized read mapping methods are required to detect reads unambiguously generated from pseudogenes.
In this study, we conduct a bioinformatics analysis of pseudogene expression using RNA-sequencing data of 16 human tissues of the Illumina Human Body Map 2.0 project. We first describe our new computational pipeline for detecting pseudogene expression that disentangles sequencing reads of pseudogenes from those of the parent genes, with consideration of possible mismatches due to SNPs and RNA-editing. This is followed by a description of our findings and a discussion of their implications.
Illumina Human Body Map 2.0 RNA-seq data
The Human Body Map 2.0 Project by Illumina generated RNA-seq data for 16 different human tissues (adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells). In our analysis we used the 75 bps single read data, with one lane of HiSeq 2000 data per tissue. Standard mRNA-seq library preps were used to extract poly-A selected mRNAs.
Discovery of pseudogene transcription in RNA-seq data
Our primary goal is to detect transcriptional activities of any of the 1709 processed RP pseudogenes. In addition, we also want to provide a preliminary quantification of their level of transcription.
The number of reads mapped to RefSeq sequences and RP pseudogenes for both the composite genome and the unaltered human genome (hg18) for each tissue
White Blood Cells
Statistics for each tissue sample
Number of reads in the sample
Number of reads mapped to the composite genome
Number of reads mapped to hg18
Number of reads mapped uniquely to the composite genome
Number of reads mapped uniquely to the hg18
White Blood Cells
Prevalent transcription of RP pseudogenes
RP pseudogenes expression identified in Human Body Map 2.0 RNA-seq data
Tissue with Max RPKM
White Blood Cells
Table of FPKM expression values of RefSeq genes in 16 human tissues
% > 1
% > 2
% > 5
% > 10
% > 15
White Blood Cells
Tissue-specificity of pseudogene transcription
Discussion and conclusions
In this work, we conducted a bioinformatics analysis of the pseudogenes of ribosomal protein genes in diverse human tissues. Using our specialized pipeline, we identified several cases of pseudogene expression. Most notably, one pseudogene in an intron of the TG gene is extremely highly expressed in thyroid. Moreover, several other pseudogenes are also highly expressed. We found that many pseudogenes are expressed in a tissue-specific fashion. Also, the expression of pseudogenes seems to often go beyond the boundaries of the annotated pseudogenes. Apparently, further experimental investigations will be needed to reveal the biological relevance of these expressions.
Transcriptome sequencing, RNA-seq, provides an unprecedented opportunity to uncover many complexities of cellular gene expression. A key computational challenge in RNA-seq data analysis is to discern reads among multiple potential sources with similar sequences. In this work we focused on the detection of evidences of pseudogene expression. We used extremely strict read mapping criteria to minimize potential false positives. Admittedly we did not utilize all potential reads, especially at regions with low uniqueness. Further research may consider using looser mapping criteria combined with sophisticated statistical algorithms to take into account the mapping ambiguity.
The bioinformatics methods presented here may find application in other RNA-seq studies dealing with high similarity in reference sequences. In particular, the same methodology may be able to identify differential expression between other homologous genome regions. Studies in other fields, such as metagenomics, dealing with high similarity DNA sequences may also find benefits from strict alignment and intersection with uniquely mappable locations.
Human tissue samples
The Human Body Map 2.0 RNA-seq data for 16 human tissue samples were provided by Gary Schroth at Illumina and can be accessible from ArrayExpress (accession no. E-MTAB-513). Reads were 75 base pairs long and came from the following samples: adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph, muscle, ovary, prostate, testes, thyroid, and white blood cells. The samples were prepared using the Illumina mRNA-seq kit. They were made with a random priming process and are not stranded.
Software and datasets
Bowtie version 0.12.7  and TopHat version 1.2.0  were used for the mapping. Cufflinks version 1.0.3  was used for differential expression calculation for RefSeq genes. BEDTools version 2.12.0  was used to analyze alignments. The uniqueome dataset was collected from the Uniqueome supplementary database  for human genome (hg18, NCBI Build 36.1) marking genome locations where reads of length 75 bps are unique within 4 mismatches (hg18_uniqueome.unique_starts.base-space.75.4.positive.BED). The 75 bps read length matches the RNA-seq data provided by Illumina. RefSeq genes and DNA sequences of spliced ribosomal protein genes were collected from NCBI (RefSeq database D32-6) . Pseudogene annotations and sequences were downloaded from pseudogene.org  database (human pseudogenes build 58). Pseudogenes whose parent genes are ribosomal protein genes were selected, totaling 1788. Among them, 79 were annotated ‘Duplicated’. As we are only interested in processed pseudogenes, our analysis focuses on the remaining 1709 pseudogenes. The human genome sequence (hg18) was collected from NCBI build 36.1.
A composite genome index was constructed with Bowtie-build using the sequences of the human genome (hg18, NCBI build 36.1) and NCBI spliced RP gene sequences.
RNA-seq data for each tissue was aligned using two distinct methodologies – one for pseudogenes and one for real genes. Pseudogene alignment protocol consists of strict alignment (Bowtie, no mismatches, report reads with only one alignment location only) to the composite genome. Real gene alignment protocol consists of strict alignment (Bowtie, no mismatches, single alignment location) to the human genome (hg18, NCBI Build 36.1).
A uniqueome data set  was obtained for Build 36.1 marking genome locations where reads of length 75 bps are unique within 4 mismatches. Alignments for all tissues for both real genes and pseudogenes were intersected with the uniqueome dataset for all genome locations (intersectBed from BEDTools ). The total number of remaining reads in each alignment was counted. The uniqueome dataset was used to filter out ambiguously mapped reads.
Comparative expression analysis
Gene expression values were calculated as reads aligned to gene per kilobase of exon per million mapped reads (RPKM) . The number of reads aligned to all gene exons and additionally aligning in unique locations was counted for each gene. Exon length for genes was calculated as the sum of unique positions as marked by the uniqueome across all gene exons. It is worth noting that RP pseudogenes appear spliced in the human genome and therefore have only a single exon for counting aligned reads and calculating exon length.
Expression percentiles of RefSeq genes were calculated using TopHat to map reads to the human genome (hg18, NCBI build 36.1) and Cufflinks was used to calculate FPKM values of all 37,681 RefSeq genes. Expression percentiles were calculated for specific tissues and for all datasets combined.
Gene reads coverage was calculated using the coverageBed program in the BEDTools software suite. Coverage represents the fraction of RP pseudogene exon covered by reads that aligned to unique genome regions.
We followed the definition of Jensen-Shannon divergence in . To avoid zero probabilities, all RPKM numbers are added by 10-10.
We are grateful for Gary Schroth and Illumina for the early sharing of their Human Body Map 2.0 RNA-seq data. This work is partly supported by a UAB NORC pilot grant funded by NIH grant 5P30DK056336.
- Balakirev ES, Ayala FJ: Pseudogenes: are they "junk" or functional DNA?. Annu Rev Genet. 2003, 37: 123-151. 10.1146/annurev.genet.37.040103.103949.View ArticlePubMedGoogle Scholar
- Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M: Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res. 2002, 12 (2): 272-280. 10.1101/gr.207102.PubMed CentralView ArticlePubMedGoogle Scholar
- Mighell AJ, Smith NR, Robinson PA, Markham AF: Vertebrate pseudogenes. FEBS Lett. 2000, 468 (2–3): 109-114.View ArticlePubMedGoogle Scholar
- Vanin EF: Processed pseudogenes: characteristics and evolution. Annu Rev Genet. 1985, 19: 253-272. 10.1146/annurev.ge.19.120185.001345.View ArticlePubMedGoogle Scholar
- Zhang Z, Harrison PM, Liu Y, Gerstein M: Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003, 13 (12): 2541-2558. 10.1101/gr.1429003.PubMed CentralView ArticlePubMedGoogle Scholar
- Gerstein M, Zheng D: The real life of pseudogenes. Sci Am. 2006, 295 (2): 48-55. 10.1038/scientificamerican0806-48.View ArticlePubMedGoogle Scholar
- Chung S, Perry RP: Importance of introns for expression of mouse ribosomal protein gene rpL32. Mol Cell Biol. 1989, 9 (5): 2075-2082.PubMed CentralView ArticlePubMedGoogle Scholar
- Uechi T, Maeda N, Tanaka T, Kenmochi N: Functional second genes generated by retrotransposition of the X-linked ribosomal protein genes. Nucleic Acids Res. 2002, 30 (24): 5369-5375. 10.1093/nar/gkf696.PubMed CentralView ArticlePubMedGoogle Scholar
- Balasubramanian S, Zheng D, Liu YJ, Fang G, Frankish A, Carriero N, Robilotto R, Cayting P, Gerstein M: Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes. Genome Biol. 2009, 10 (1): R2-10.1186/gb-2009-10-1-r2.PubMed CentralView ArticlePubMedGoogle Scholar
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321 (5891): 956-960. 10.1126/science.1160342.View ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Pastinen T: Genome-wide allele-specific analysis: insights into regulatory variation. Nat Rev Genet. 2010, 11 (8): 533-538.View ArticlePubMedGoogle Scholar
- Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG: Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011, 333 (6038): 53-58. 10.1126/science.1207018.PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, 37 (Database issue): D32-D36.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Koehler R, Issac H, Cloonan N, Grimmond SM: The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics. 2011, 27 (2): 272-274. 10.1093/bioinformatics/btq640.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7 (3): 562-578. 10.1038/nprot.2012.016.PubMed CentralView ArticlePubMedGoogle Scholar
- van de Graaf SA, Ris-Stalpers C, Pauws E, Mendive FM, Targovnik HM, de Vijlder JJ: Up to date with human thyroglobulin. J Endocrinol. 2001, 170 (2): 307-321. 10.1677/joe.0.1700307.View ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011, 25 (18): 1915-1927. 10.1101/gad.17446611.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMedGoogle Scholar
- Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M: Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 2007, 35 (Database issue): D55-D60.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.