Reliable quantification of pseudogene expression
Reliable quantification of pseudogene expression remains a challenging problem for a number of reasons. First, since parent genes and pseudogenes are highly similar in nucleotide sequence, short RNA-seq reads derived from one may align equally well to the other one. Such reads are fundamentally ambiguous in terms of their origin. Second, some reads may have nearly identical alignment to locations in the gene and pseudogene, and their mapping is often determined by the location with the least error in alignment. However, this strategy is unreliable in the presence of subject-specific variation with respect to the reference genome, or in the event of base call errors during sequencing, since these can result in an incorrect assignment of the read. Third, some aligners may follow a parsimony strategy in which a “simple” alignment is preferred to a complex (e.g. spliced) alignment. In the case of a processed pseudogene that lacks splices, this approach may erroneously bias the alignments to the pseudogene rather than the parent gene. Finally, in some cases, aligners report only a subset of possible alignments as a result of the heuristics used. For all of these reasons, studies of gene and pseudogene expression using existing tools are likely inaccurate without additional considerations.
A first approach to reliably studying pseudogene expression is to consider only the reads that are assigned to a single location by an aligner. However, the above confounding factors can result in reads that are uniquely aligned to the wrong positions (Figure 1A). Any conclusions drawn from such reads in downstream analyses will be unreliable. One approach to addressing this problem is to identify and discard from the analysis reads that map to regions in the genome that are especially sensitive to these confounding factors. We have adopted this approach using the concept of transcriptome mappability, which we describe below.
Our approach for computing transcriptome mappability builds on the notion of genomic mappability. Mappability is a measure of the inherent distinctiveness of a genomic region; the more frequently a genomic region occurs, the less mappable it is. Although mappability can be defined as a continuous quantity (the reciprocal of k-mer frequencies, for example, as in [23]), it is generally not very useful to know the degree to which a region is unmappable. If a k-mer occurs more than once in the genome, a read aligned there will be ambiguous. For this reason, we compute mappability as a discrete quantity –that is, a region is either mappable (unambiguous) or not mappable (ambiguous). Our notion of mappability also includes a “safety margin”, so that a mappable region guarantees not only a unique alignment for the reads matching the sequence, but also that no read with one or two base call errors or SNPs relative to the reference genome could be uniquely mismapped to this region. Mappability is important even if an aligner does not use heuristics and exhaustively enumerates read alignments. As demonstrated by Figure 1A, highly similar regions can produce uniquely mismapped reads as a result of genome variation and read errors in a way that no aligner can recognize (see Methods section for details).
If we restrict our attention to alignments in mappable regions, we ensure that the downstream analysis results are robust, even if the reference genome does not match the subject genome or the reads contain sequencing errors. Mappability is thus inversely related with sensitivity to genome variation and read errors.
Since RNA-seq reads may span multiple exons, the transcriptome contains additional k-mers beyond those found in the genome. To compute transcriptome mappability, we can align k-mers to the genome sequences crossing splice junctions. This transcriptome mappability scheme allows the computation of pseudogene expression levels using only reads uniquely aligned to mappable regions. Using these reliable reads, we compute pseudogene expression levels in units of Reads per Kilobase of Uniquely mappable transcript per Million reads (RPKUM). See the Methods section for a detailed description of the transcriptome mappability and RPKUM calculations.
We tested our RPKUM metric by comparing expression levels for protein coding genes computed with both RPKUM and RSEM [24], a commonly used transcript quantification method. We computed the mean expression level across the TCGA dataset for each protein-coding gene using both methods, then calculated the correlation between the expression levels from the two methods. The result showed good agreement between RPKUM and RSEM values (Spearman correlation > 0.85), indicating that RPKUM values provide a reliable method for quantifying expression levels.
An important question is whether RPKUM values computed from few mappable bases are trustworthy. To investigate the robustness of the RPKUM metric, we simulated RPKUM values by randomly sampling positions of genes that are completely mappable and then using these sampled bases as the only mappable bases of a gene in an RPKUM calculation. Genes spanning a wide range of expression levels from 1 to 200 RPKMs were used in the simulation. We performed the simulations with 500, 100, and 50 mappable bases per gene. RPKUM values computed from genes with as few as 50 simulated mappable bases showed very strong agreement with the true RPKM expression levels across the range of expression levels (ρ = 0.95). In addition, increasing the number of mappable bases slightly increases the correlation between RPKUM and RPKM levels (ρ = 0.97 for 100 mappable bases and ρ = 0.99 for 500 mappable bases).
Figure 2A shows the distribution of transcriptome mappability for protein coding genes and GENCODE v. 17 pseudogenes. As expected, pseudogenes are much less mappable than protein-coding genes; the median protein-coding gene mappability value is nearly 100% of gene length, and the vast majority of genes are almost completely mappable. In contrast, the median pseudogene mappability value is around 80% of pseudogene length. The distribution of pseudogene mappability is approximately bimodal, with peaks near 10% and 90%. A sizable fraction of pseudogenes are completely unmappable (2169 out of 14942). Nonetheless, the majority of pseudogenes possess a significant fraction of mappable bases and are thus accurately detectable using RPKUM expression levels.
As expected, restricting the set of reads aligned to pseudogenes to only those in mappable regions leads to a dramatic reduction in the number of reads (Figure 2B). On average, each sample contains nearly 10 million reads mapped to pseudogenes, but our filtering process leaves a set of just over 360,000 pseudogene reads per sample. The surviving reads comprise a high-confidence set that can be used to assess pseudogene transcription.
High-confidence breast cancer pseudogene transcripts
Using the GENCODE v. 17 pseudogene annotations, we identified 2012 pseudogenes with evidence of transcription, defined as genes with at least 50 mappable bases, 50 reads, and 1 RPKUM in at least 1 sample (Additional file 1). The majority of these pseudogenes occurred in only a small number of samples (Figure 3A). However, a subset of the pseudogene transcripts occurs in a large number of samples, including 94 pseudogenes that are transcribed in over 95% (n = 780) of the samples. To investigate the pseudogenes that are most likely to play a role in cancer biology, we chose to focus the remainder of our analysis on pseudogenes that exhibited evidence of transcription in at least 10% (n = 80) of the samples; this set consists of 440 pseudogenes.
The GENCODE pseudogene decoration resource (psiDR v. 0), assembled from a recent genome-wide survey of pseudogenes using ENCODE data [3], provides useful information for an initial assessment of the transcriptional potential of our pseudogene set. Out of the set of 440 transcribed pseudogenes we identified, 287 pseudogenes are annotated in psiDR for a number of attributes, including pseudogene type, parent gene, transcription evidence, open chromatin, histone modifications that indicate activity, transcription factor binding, RNA polymerase II occupancy, and evolutionary constraint [3]. Although the functional genomics annotations come from the ENCODE cell lines, not from breast cancer tissue, they nonetheless serve as a reasonable starting point for assessing the transcriptional activity of the pseudogenes we identified.
Examining the collection of psiDR annotations for these 287 transcribed pseudogenes shows that they possess a number of properties that indicate transcriptional activity (Figure 3B). Nearly half (n = 125) of the 287 pseudogenes were reported by psiDR to be transcribed. The remainder (n = 162) represent potentially novel pseudogene transcripts not annotated in psiDR. The pseudogenes producing these unannotated transcripts show strong evidence of transcriptional activity. Compared to the full set of more than 11,000 pseudogenes annotated by psiDR, the set of 287 is significantly enriched for active chromatin, Pol II occupancy, and transcription factor binding (p<0.002,χ2test). In addition, 20 of these pseudogenes display fewer substitutions compared to chimp and mouse orthologs than expected by chance. Interestingly, duplicated and unitary pseudogenes are also enriched within the set of 287. This may be due in part to the fact that duplicated pseudogenes are thought to be more likely to possess upstream regulatory elements similar to those of the parent genes. Also, unitary pseudogenes are likely to be more mappable, and thus are easier to detect from short-read RNA-seq data. In short, the diverse data types from the ENCODE project provide strong support for the transcriptional activity of the pseudogenes that we have detected in breast cancer tissue.
It is worth noting that PTENP1 and KRASP1, the two initial examples of pseudogene ceRNAs, are present (though at low levels) in the breast cancer samples we study here. Our method of computing RPKUM expression levels is thus capable of detecting these important pseudogenes, but their expression levels fall below the cutoff that we used to define our set of highly-expressed pseudogene transcripts, and therefore they were not considered for further analysis. The set of 748 breast-cancer pseudogene transcripts provided by Han et al. [22] does not contain PTENP1 or KRASP1, confirming the low expression of these pseudogenes in breast cancer.
Hierarchical clustering shows association with known cancer subtypes
The four molecular subtypes of breast cancer possess a number of distinguishing characteristics, including estrogen/progesterone receptor status, response to chemotherapy drugs, and gene expression profile [25]. A common method of studying the differences among these subtypes is to use unsupervised clustering techniques to group samples together based on their gene expression patterns. Unsupervised clustering using protein-coding genes results in four distinct clusters corresponding to the subtypes [25]. To investigate the relationship between pseudogene transcription and breast cancer disease state, we performed hierarchical clustering using the high-confidence set of 440 pseudogenes. Unsupervised clustering based solely on these pseudogene expression levels effectively separates tumor and normal samples (Figure 4A). However, since the normal samples are extracted from tumor adjacent breast tissue that contains a different cell type composition than the tumor itself, the ability to distinguish tumor from normal is likely due in large part to tissue specificity rather than tumor biology. Nonetheless, this result shows that pseudogene expression varies considerably between the cell types that make up the tumor and adjacent normal samples.
We also removed the adjacent normal samples and clustered solely on the tumor samples. As Figure 4B shows, the unsupervised clustering algorithm successfully separates the basal samples from the other subtypes. However, the pseudogene expression profiles for the luminal and Her2 subtypes are not sufficiently distinct to consistently separate samples from these subtypes. Basal tumors grow more rapidly and have significantly different histology than the other subtypes [25], and this may be why basal/luminal and basal/Her2 separation stands out more clearly than the luminal/Her2 separation. The fact that pseudogene expression alone can identify the basal subtype shows that pseudogene expression has a strong, non-random association with specific pathways and cellular environments. This suggests that previous findings, such as the pseudogene ATP8A2, which is more highly expressed in luminal compared to basal samples [4], are not isolated examples.
Pseudogenes are differentially expressed among cancer subtypes
To identify the pseudogenes with the most strong subtype-specific expression profiles, we performed a multi-class differential expression analysis using the SAM tool [26]. This analysis yielded 309 pseudogenes with significant subtype-specific expression (FDR < 1%; Additional file 2). Several interesting pseudogenes are at the top of this list. For example, the second pseudogene on the list is ATP8A2-Ψ, a pseudogene that has been found to be upregulated in luminal subtypes and shown to induce tumor progression [4]. The expression profile found here reflects this pattern, showing strong upregulation in luminal samples compared to basal.
Three other interesting examples are shown in Figure 5. A pseudogene of CASP4, a member of the caspase family known to initiate apoptosis under certain conditions [27], is expressed at higher levels in basal samples and downregulated in luminal A samples (Figure 5A). Interestingly, the expression of the CASP4 pseudogene is lower in tumor samples than normal, which is the expression profile expected for a ceRNA that promotes CASP4 expression. Additionally, the CASP4 pseudogene was found to be transcribed in the ENCODE analysis [3]. Another interesting property of this unprocessed pseudogene is that it shows alternative splicing—there appear to be multiple isoforms represented in the reads covering the pseudogene locus. Intriguingly, our analysis of potential ceRNA interactions also indicated that the CASP4 pseudogene is positively correlated (ρ = 0.3) with expression of its parent gene and shares a miRNA target site for hsa-mir-203 (see next section for detailed summary of ceRNA investigation).
The CYP2F1 pseudogene is expressed at quite high levels compared to most pseudogenes in the dataset, and the average expression level in the luminal B subtype is nearly five times the average expression in the basal subtype. The pseudogene is a unitary pseudogene, with no clear parent protein-coding gene. However, it possesses strong sequence similarity with the cytochrome P450 family of genes. It was previously demonstrated that CYP2F1 is expressed in colorectal cancer and that expression in primary tumors correlated with corresponding metastatic tumors in lymph nodes [28]. Like the CASP4 pseudogene, the CYP2F1 pseudogene shows evidence for multiple isoforms.
A pseudogene of the MSL3 gene shows nearly twice the expression level in basal compared to luminal A (Figure 5C). The processed pseudogene was found to be transcribed in the ENCODE analysis. The MSL3 protein is thought to play a function in chromatin remodeling and transcriptional regulation, and it has been reported as part of a complex that is responsible for histone H4 lysine-16 acetylation [29]. Furthermore, expression of this pseudogene is correlated with the expression of its parent gene (ρ = 0.3), and it is predicted to share target sites for six different miRNAs (see next section for detailed summary of ceRNA investigation).
Analysis incorporating miRNA and gene expression levels reveals pseudogenes with ceRNA potential
A common hypothesis about ceRNA interactions is that if transcript A sequesters miRNA C away from transcript B, the expression levels of A and B will be positively correlated, while both A and B will be negatively correlated with C. To assess the possibility that the transcribed pseudogenes identified may function as ceRNAs for their parent genes, we performed an analysis integrating miRNA target prediction with pseudogene, gene, and miRNA expression levels. The miRNA expression levels (Additional file 3) were computed from sample-paired TCGA small RNA-seq data using a previously described small RNA-seq analysis pipeline [30]. We computed expression levels for the parent genes of the pseudogenes using the same RPKUM method as for the pseudogenes.
Since pseudogenes are non-coding RNAs and are not densely bound by ribosomes, the vast majority of the transcribed region of a pseudogene is likely accessible for miRNA binding. However, if a pseudogene serves as a miRNA sponge for its parent gene, it is more likely that the shared miRNA binding site occurs in the 3′ UTR of the parent gene than in the coding region. In addition, using a restricted region for prediction somewhat ameliorates the lack of specificity common to miRNA target prediction algorithms [31]. We therefore chose to restrict our target prediction analysis to the portion of the pseudogene with sequence similarity to the 3′ UTR of the parent gene—what might be termed the “pseudo-3′ UTR”. During the process of performing miRNA target prediction on pseudogenes, we noticed that the GENCODE pseudogene annotations often did not span the pseudo-3′ UTR. Therefore, we used BLAST to identify the pseudo-3′ UTRs of pseudogenes by aligning the GENCODE annotation and surrounding genomic context with the annotated 3′ UTRs of the parent gene (see Methods section for details). TargetScan version 7 [32] was used to predict target sites for only the top 100 miRNAs expressed in the TCGA breast cancer dataset. This analysis revealed 177 transcribed pseudogenes that are predicted to share at least one miRNA target site with their corresponding parent genes.
We computed Pearson correlation coefficients for each pseudogene-parent gene pair. As the plot in Figure 6 shows, the majority of pseudogene-parent gene pairs are uncorrelated. However, there is a positive skew to the distribution of correlations. To test whether the distribution of correlations differs significantly from expectation, we performed a permutation test. We constructed 5000 sets of gene-pseudogene pairs in which the genes and pseudogenes were randomly paired. The sets were of the same size as the set of pseudogene-parent gene pairs. For each random set, we computed the number of pairs with Pearson correlation above 0.3. In the 5000 random sets we generated, there were never more than 15 such pairs per set (Figure 6C). However, the set of correlations resulting from pairing pseudogenes and parent genes contains 55 pairs with correlation above 0.3. This indicates that the positive skew to the distribution of correlations shown in Figure 6A is very unlikely to be due to chance. We also tested an additional correlation threshold of 0.5 and observed a similar result, indicating that our findings are robust to the choice of correlation threshold.
We also computed the correlation between the expression level of each pseudogene and the miRNAs predicted to target it. The correlations observed for these pseudogene-miRNA pairs closely approximate a normal distribution, but show a slight negative trend (Figure 6B). A total of 180 pseudogene-miRNA pairs show strong negative correlation of less than −0.3. To test whether this number of pairs is significant, we approximated a null distribution of pseudogene-miRNA correlations using the same permutation method we applied to the pseudogene-parent gene pairs. Randomly shuffling the pseudogene-miRNA pairs to create 5000 random sets (Figure 6D) showed only 5 permutations with at least as many strongly anti-correlated pairs as we observed in the data, which corresponds to an empirical p-value of 0.001. This supports the conclusion that the extent of negative correlations observed in the data cannot be attributed solely to chance, and is likely due to genuine miRNA target repression.
Next we sought to identify the pseudogene-parent gene-miRNA triples with the strongest ceRNA potential. To do this, we first identified expressed miRNAs predicted to target both a pseudogene and its parent gene. For each such triple, we computed the correlation between pseudogene and parent gene, pseudogene and miRNA, and parent gene and miRNA (Additional file 4). We also computed p-values with Benjamini-Hochberg FDR correction for the miRNA correlations. In this way, we identified 17 pseudogene-gene pairs with strong ceRNA potential, which we defined as pseudogene-gene correlation greater than 0.3 and statistically significant miRNA anti-correlation.
Two of these pseudogenes stand out as especially interesting examples. A pseudogene of GBP1 and its parent gene show statistically significant anti-correlation with hsa-mir-199a, which has been shown to regulate autophagy in breast cancer cells [33]. This pseudogene was also found to be transcribed in the ENCODE analysis [3]. The parent gene GBP1 is known to be the mediator of the anti-proliferative effect of inflammatory cytokines in endothelial cells [34], and is implicated in several types of cancer according to GeneCards. In addition, the GBP1 pseudogene shows strong positive correlation with the expression of its parent gene across the TCGA dataset (ρ = 0.82). Another interesting pseudogene is SUZ12P1. This pseudogene and its parent gene both show strong anti-correlation to hsa-mir-28. SUZ12P1 also shows moderate positive correlation with its parent gene (ρ = 0.41). The parent gene, SUZ12, is a polycomb group protein and part of the PRC2/EED-EZH2 complex, an important epigenetic regulator that performs histone methylation [35]. This gene is also frequently translocated in endometrial stromal tumors, where it forms the JAZF1-SUZ12 oncogene [36].
An interesting question is whether the genes that have pseudogenes with ceRNA potential are functionally related. To investigate this question, we performed a Gene Ontology (GO) term enrichment analysis using three different sets of parent genes. The sets of genes used were parent genes strongly correlated with a pseudogene, parent genes whose pseudogenes was strongly anti-correlated with a shared miRNA, and parent genes participating in a putative gene-pseudogene-miRNA ceRNA interaction as defined above. For each of these sets of parent genes, we used the GOrilla tool with default settings to look for GO terms enriched in the set compared to the background list of all parent genes. No significantly enriched GO terms were found for any of the 3 sets of interest, indicating that there is no clear functional relationship among the parent genes in the sets that we have identified.