Alternative splice variants of P4HB gene
The P4HB gene has 24 transcripts in human genome (Ensembl, GRCh38.p10), comprising the canonical isoform plus 10 protein coding sequences, 1 nonsense mediated decay, 3 processed transcripts and 9 retained introns. The main variants described as protein-coding (Ensembl) are shown in Fig. 1 and available from http://www.ensembl.org [25]. All these 10 isoforms are supported by The Human Protein Atlas (http://www.proteinatlas.org) and annotated in UniProt. Table S1 summarizes the information about P4HB splice variants, including Transcript ID (Ensembl), UniProt identification, nucleotide and protein length, molecular mass and putative signal peptide.
The predicted organization of each protein coding isoform is depicted in Fig. 1B. Of note, P4HB-02 does not predictably display the classical ATG start codon, though it was possible to detect CAGE tags in that region. Moreover, except for P4HB-02 and P4HB-021, the splice variants are not predicted to have a classical stop codon, while P4HB-023 (detected) and P4HB-027 (possibly) have stop codons at the 3′-UTR regions. Except for isoform P4HB-021, which depicts all four thioredoxin domains (with a and b partially truncated - see below), the predicted P4HB isoform products lack one or more domains or depict incomplete forms of some domains, generating variable combinations with potential to display alternative functions, since the unique thiol isomerase activity of PDIA1 requires all 4 (a, a’, b, b’) domains [26]. For example, P4HB-019 has a fragmented a domain lacking 36 amino acids which include the redox-active motif and P4HB-027 has one truncated a domain with a redox-active motif and one truncated b domain. P4HB-021 is predicted to have the signal peptide, the 2 active CGHC domains and exhibits only a lack of 44 amino acids between a and b domains. P4HB-02 and P4HB-021 are the only to display the intact C-terminus with the KDEL motif, indicating that eventual protein products generated from other isoforms may not be retrievable to the endoplasmic reticulum.
Taking advantage of CAGE tags to determine expression levels, we analyzed an upstream region of P4HB-02 and P4HB genes. For P4HB-02, we used 250 bp upstream of putative coding region to verify which samples presented higher normalized tags per million (TPM). For this, we selected a subset of data from FANTOM5 called FANTOM5 CAGE Phase1 CTSS human which displayed the highest TPM and was composed of samples from pancreas, Sertoli cells, smooth muscle cells (aortic), leiomyoma cell line and fibroblast (aortic adventitial) (Fig. S1). In addition, the information of ENCODE CAGE was also analyzed showing Hep G2, K562, HUVEC and Nhek cell lines (Fig. S1). We also checked for the presence of CAGE tags upstream of P4HB gene using the same data and the result was similar. The benefit of CAGE tags is the possibility of revealing a range of alternative transcription initiation events even in exonic coding sequences [27, 28]. These data were important to select samples from ENCODE RNA-seq (described below), filtering for samples in which CAGE tags were identified and more representative to such analysis.
Expression profiling of P4HB splice variants in FANTOM5 database
We next addressed an overview of P4HB gene and spliced variant expression profiling in different cell lines and tissues, using a number of distinct databases: FANTOM5, ENCODE and GTEx.
First, the FANTOM5 project provides atlases of long noncoding RNAs and microRNAs and their promoters, with accompanying RNA-seq and short RNA transcriptome data [29]. We used information of all FANTOM5 RNA-seq libraries (70 samples) [30], in order to prospectively analyze P4HB splice junctions. These samples were composed of cell lines (n = 32), primary cells (n = 27) and tissues (n = 11). In some cases (n = 6), the average of triplicate data (whole blood samples, CD19 B cells and CD8 T cells) from the same donor was used. These 70 samples of FANTOM5 project were used to build Fig. 2a and b, which show profiles of expression for the 10 protein-coding isoforms from Fig. 1. Figure 2b represents the percentage of splice variant abundance in this set of samples from FANTOM5, showing that almost 30% of total isoform fraction is represented by P4HB-021. In Fig. 2c, two representative examples of the most expressed isoforms (P4HB-02 and P4HB- 021) are shown for different cells and tissues.
These results showed, in brief, the following overview: three processed transcripts and ten protein coding isoforms. The variant P4HB-021 was significantly represented, particularly in aortic smooth muscle cell, followed in this cell type by P4HB-019, P4HB-023, P4HB-027 and P4HB-02. The samples with overall highest number of expressed P4HB isoforms were smooth muscle cells – aortic samples, followed by CD19+ B cells and mucinous adenocarcinoma cell line. The most frequently expressed splice variant across all the 70 samples was P4HB-02, present in 28 samples, while P4HB-021 and P4HB-027 depicted the highest splice junction TPMs. The isoform P4HB-021 had its highest level of expression in aortic smooth muscle cells (Fig. 2d) but due the relatively low number of samples, we focused this analysis into more abundant SMC data from ENCODE and a recently published study [31] (following sections).
To visualize the BAM file in IGV platform, we selected the splice variant P4HB-027 to check the splice junction in 4 different cell types. Fig. S2 illustrates the splicing event in the middle of P4HB-027 exon 3. Using this tool, it was possible to visualize the splice junction of different variants among multiple samples. Additionally, the P4HB-027 splice junction in exon 3 is not present in all the 4 samples analyzed, as indicated by the black arrow in Fig. S2. Also, in Fig. 2e there is a plot for P4HB-021 displaying the splicing event. Sashimi plots were generated in the IGV-Sashimi, which allows one to select a specific genomic region and to detect events of isoform usage [32].
Expression profiling of P4HB splice variants in RNA-seq ENCODE database
The Encyclopedia of DNA Elements (ENCODE) [33] has a set of different types of experiments such as Exon Arrays, Chip-Seq and RNA-seq analysis, available at http://www.encodeproject.org. Here we used the ENCODE Caltech RNA-seq data and CSHL/ENCODE RNA-seq data to analyze 27 RNA-seq datasets including 12 different cell lines, 5 of which cancer cell lines. Their choice was justified by the presence of CAGE peaks [34], which are tags for gene expression, as detailed in Methods. The distribution of splice variants counted by splice junction (tags per million) in the ENCODE datasets is shown in Fig. 3a. In this graph, the most representative (i.e., expressed in most samples) was P4HB-029, but the isoforms most expressed (in SJ TPM) were P4HB-02 and P4HB-021. HCT-116 (human colon cancer) cell line, Gm12878 (human lymphoblastoid cell line), Hmsc (Human mesenchymal stem cell line) and Hsmm (human skeletal muscle myoblast cell line) were the ones most represented in this set (Fig. 3b). In addition to this analysis, we performed a separate one focusing on endothelial cells (HUVEC and HAoEC) and aortic adventitial fibroblasts (HaoAF), shown in Fig. 3c-d. Isoform P4HB-02 is well expressed in aortic adventitial fibroblasts, P4HB-021 in fibroblasts and 2 types of endothelial cells and P4HB-024 in two other endothelial cell types.
We next applied the same pipeline above to identify and count the splice junction TPM (tag per million) to investigate P4HB gene expression in polyA RNA-seq ENCODE human datasets (https://www.encodeproject.org/) from donors (primary cell). We focused on data from pulmonary artery smooth muscle cells, which derive from two male individuals. P4HB-019, 023 and 026 were more expressed in these cells (Fig. 3e), representing around 0.7% (P4HB-019) and 0.5% (P4HB-023) of total P4HB. In all such cases, however, the expression of isoforms was relatively small vs. the canonical isoform (Fig. 3).
Expression profiling of P4HB splice variants in GTEx database
The Genotype-Tissue Expression Project (GTEx) is one of many large cohort studies comprising a significant number of transcriptomic data, including RNA-seq from various tissues. Here, we used 11,690 RNA-seq data from different tissues and conditions listed in Table S3. These sets of data are highly enriched in whole blood (407 samples), blood vessel (913 samples) and heart (600 samples). Figure 4 shows the quantification of three P4HB splice variants (P4HB-02, P4HB-021 and P4HB-027) in 30 different tissues. Among these isoforms, P4HB-02 and P4HB-027 displayed slightly higher expression when compared to P4HB-021. The fractional expression of variant P4HB-021 in heart was higher compared to other tissues. For Fig. 4, we analyzed the 30 tissues by merging all sub-regions of each tissue. In Fig. S3, we separately analyzed isoform expression in different sub-regions of heart (atrial appendage and left ventricle), showing no difference in isoform expression. In this specific subset, P4HB-027, P4HB-021 and P4HB-02 were represented in arterial cells (aorta, coronary and tibial), with slightly higher prevalence of P4HB-027.
P4HB splice variants are highly expressed in smooth muscle cell
Given our focus on vascular cells and the above results from the FANTOM5 and ENCODE analysis, we further pursued the P4HB isoform analysis in these cell types. A recently published study [31] produced RNA-seq data from human aortic and coronary vascular smooth muscle cells (VSMC) aiming to investigate gene expression patterns during changes in extracellular matrix stiffness, since VSMC-extracellular matrix mechanobiological interactions are involved in disease pathogenesis. Figure 5 indicates that P4HB expression tended to be lower under pathologic, as compared with physiologic conditions. Concerning P4HB splice variants in coronary artery VSMC, the splice junction TPM tended to be higher in physiologic conditions. Similarly, in VSMC from proximal aorta isoforms P4HB-02 and P4HB-021 were more representative in physiological conditions. Taken together, these data indicate that the expression of specific isoforms was specific for each cell type, e.g., endothelial cells (Fig. 3) vs. VSMC and even between distinct VSMC locations (Fig. 4).
Validation analysis of selected P4HB splice variants in cells
We next performed the validation of P4HB splice variant expression in distinct cell types using PCR. For that, we arbitrarily selected 3 isoforms on the basis of their expression levels and tissue specificity (above data), namely P4HB-02, P4HB-021 and P4HB-027.
The cells chosen for an initial overview were neuroblastoma (SK-N-SH) cell line and HCT-116 (human colon cancer) cell line, on the basis of a previous analysis using IGV to detect the splicing junction and the use of shell script to detect the splice junction in BAM files. After RNA extraction and cDNA synthesis, PCR assays using specific sets of primers for each splicing junction were conducted (Fig. 6), resulting in each case in one amplicon, which was purified and cloned in pGEM-T (Promega). The amplicons had 89 bp or 211 bp, respectively for P4HB-02 or P4HB-027. For P4HB-021, the amplicon was cloned in pUC57 vector containing the complete isoform sequence. The three amplicons were cloned, sequenced and the nucleotide sequence with the splice junction was confirmed.
We next pursued the validation of isoform detection using RT-qPCR, focusing on primary human VSMC from mammary artery and HEK-293 (human embryonic kidney) cell line. In both cases, cells were investigated in their basal state and following serum starvation (16 h or 24 h) or exposure to CoCl2 as a mimetic of hypoxia, since some PDIs are upregulated by hypoxia [35]. Important, in accordance with previous results from databanks, P4HB-021 was detected in VSMC at baseline and upregulated, together with the expression of P4HB, after 24 h serum starvation compared with 16 h (Fig. 7a). Exposure to CoCl2, however, did not significantly affect the expression of P4HB and its variants in the conditions of our experiments (Fig. 7b). Four genes related to ER stress response were also analyzed with serum starvation (Fig. 7c) and CoCl2 treatment.
We also assessed the effects of tunicamycin, a potent inhibitor of N-linked glycosylation, in HEK-293 cell line to investigate the influence of ensuing endoplasmic reticulum stress in the expression of P4HB gene and its splice variants. Cells were incubated for 16 h and 40 h with 3 tunicamycin concentrations and the expression of P4HB and its splicing variants analyzed (Fig. S4). Other genes involved in ER stress response, such as ATF6, CHOP, GRP78 and GRP94, as well as NOX1, NOX2 and NOX4 were also analyzed (Fig. S5A). GRP78 and GRP94 gene expression, which are early and classical ER stress markers, were increased after 16 h and 40 h tunicamycin, with less robust increases of transcription factors ATF6 and CHOP. While P4HB gene and its variants tended to increase vs. control (Fig. S4), expressions of P4HB-02 and P4HB-027 (but not of P4HB or P4HB-021) decreased vs. those of reference gene (Fig. S5A). None of these differences, however, was statistically significant, confirming that P4HB gene and its associated splice variants are not per se directly unfolded protein response (UPR)- responsive genes [13]. Similarly to VSMC, exposure to CoCl2 for 24 h depicted a slight, but not statistically significant, difference in P4HB and GRP94.