Exome sequencing (ES) is a well-established method for diagnosing Mendelian disorders and improving precision medicine. Yet, ∼50–75% of patients remain undiagnosed after ES [1,2,3,4,5,6], although an underlying genetic disorder is highly suspected. Genome sequencing (GS) of patients offered a promising alternative, however, GS led to only a marginal increase in the yield compared to ES, with additional 10–15% of patients being diagnosed [4, 7,8,9]. The minimal boost of GS in the diagnostic yield is caused by a poor prioritization and interpretation of variants because of the lack of our current functional knowledge specifically of non-coding regions [3, 4, 8, 10, 11]. Hence, there is a growing interest of clinicians to use transcriptomics to facilitate variant interpretation [3, 11,12,13]. While DNA sequencing reveals variants in coding and non-coding regions, the investigator is still blind to the effect of regulatory variants that may affect RNA abundance and splicing. RNA-seq addresses this important gap by directly probing gene expression and splicing patterns. Yet, RNA-seq holds distinct limitations for the detection of DNA variants, made difficult by monoallelic expression or nonsense mediated decay. Thus, beginning experience with RNA-seq has proven successful in improving diagnosis of individuals with unresolved diagnosis after exome- or genome sequencing, when introduced complementary to GS or ES [3, 7, 11, 12, 14,15,16].
One major challenge of transcriptomics is tissue-specific gene expression [7, 14, 16]. In the endeavor of finding the diagnosis for the patient’s phenotype, clinicians are left with the decision of which tissue is most suitable to inquire, since biopsies to acquire the tissue of interest (TI) based on inferred pathophysiology are fairly rare. Recently, MAJIQ-CAT, a web-based tool has been designed to inform the tissue choice according to splicing pattern similarities between a clinically accessible tissue for analysis (TA) and a TI [7]. While the tool is very useful when the gene of interest is known, incorporation of human phenotype ontology [17] could prove additional utility for candidate gene identification and diagnosis [7].
We designed the online resource, Phenotype Tissue Expression and Exploration (PTEE), that incorporates data of 54 adult tissues from Genotype-Tissue Expression (GTEx) Project [18] and genes annotated to a multitude of phenotypes based on Phenomizer, human phenotype ontology (HPO) [17, 19], and expert opinion for neurodevelopmental (NDD)- [20], heart rhythm -[21, 22], and monogenic obesity-related disorders [23]. We identify tissues that are most suitable for performing RNA-seq in individuals with NDD or cardiac arrhythmia conditions. We show that genes annotated with these phenotypes are not necessarily expressed in the TI (e.g., NDD genes / inherited cardiac arrhythmia genes are not always expressed in the brain / heart). Our observations on gene expression correlations between distinct tissues could explain why, although blood is not considered to be a representative tissue for neurologic disorders, RNA-seq on blood proved successful for such conditions [12]. In summary, the present resource informs clinicians and scientists about which TA should be collected given the individual’s phenotype and a TI, provided a valid ethical and individual consent.
Implementation
Data processing
Data processing was performed in R [24] version 4.0.3. We used gene median transcript per million counts (TPM) of 54 tissues and a total of 17,382 samples from GTEx version 8 (Supplementary Material – File 1 displays number of expressed genes as a function of number of samples/individuals per tissue). The gene expression analysis appears to be most robust and the number of expressed genes plateaus when the number of samples per tissue exceeds 100. For transcript specific expression, we considered the transcript TPMs and calculated the median per tissue. A gene was required to have TPM ≥ 1.5 to be considered expressed [25] and included in subsequent analyses.
Phenotype annotation data was obtained from Phenomyzer [17], SysID Database (release 1.1:2021-04-10) for primary NDD genes [20], the gene compilation by Gray and Behr for cardiac rhythm disorders [21], and the compilation by Rhode and colleagues for monogenic obesity disorders [23].
A workflow of the analysis is presented in Supplementary Material – File 2. Briefly, analyses can be restricted to gene lists annotated for specific phenotypes, custom input, or all genes expressed in a specific tissue. Further, the tissue of interest and the tissue of analysis are defined. To calculate Pearson’s correlation based on gene expression profiles only genes with TPM ≥ 1.5 are considered and the correlation is calculated as: \( {r}_{xy}=\frac{\mathit{\operatorname{cov}}\left(x,y\right)}{SDx\times SDy} \), where cov is the covariance of the x = gene expression levels in the tissue of interest and y = gene expression levels in the tissue of analysis and SD = standard deviation of the variables [26]. To control for the influence of the tested genes on the correlation analysis we performed 100 randomization tests in which we calculated correlation coefficients based on random gene lists, which included the same number of genes as the real non-randomized gene list (e.g., 39 genes for inherited cardiac arrhythmias). We ran a binomial test to check whether the tissue that showed highest number of best correlations in the randomization tests was significantly different from the other tested TAs.
To identify which genes are expressed in the tissue of interest and the tissue of analysis, we considered only the genes included in the input list (phenotype of interest, custom genes, or all genes expressed in a specific tissue). For each considered tissue we then filtered the list of genes to include only the ones with TPM ≥ 1.5, considered to be expressed. To identify overlaps of gene expression in different tissues we used the merge function from R and identified common genes between the different tissues. If multiple tissues were input as TA/TI, genes were considered expressed if they were present in at least one of the considered tissues. For visualization of the graphs, we used ggplot [27] and venn diagram [28] packages.
In single gene analysis we display the expression of the gene in multiple samples and multiple tissues using violin graphs created with the package ggplot [27] – geom_violin from R.
The code and data for each analysis are deposited at the GitHub PTEE (https://github.com/akhilvelluva/PTEE) repository.
Web tool implementation and data access
We developed a user-friendly web interface using the R shiny package [29] – Phenotype Tissue Expression and Exploration (https://bioinf.eva.mpg.de/PTEE/). Graphics were generated using BioRender.com. Instructions for PTEE usage are presented in Supplementary Material – File 3. Users can select either a phenotype of interest based on an individual’s phenotype or input a list of genes to be inquired. The selection of the phenotype of interest restricts analyses to genes annotated with the respective HPO term, or genes that are annotated to be causative or candidates for NDD, heart rhythm-, or monogenic obesity-related disorders. Additionally, users can also upload custom gene lists according to their interests. Users can identify which genes belong to the HPO term in the table displayed online, with the possibility of download. Based on the phenotype the individual displays, users select a TI, which in general reflects the most affected organ and the disease pathophysiology.
Accessible TAs are: whole blood, skin, Epstein Barr virus (EBV)- transformed lymphocytes, cultured fibroblasts, and skeletal muscle. Users can visualize the correlation based on gene expression levels between TI and the TA, considering genes that are annotated to the individual’s phenotype. Based on random gene lists that contain the same number of genes as the one selected in the phenotype of interest, users can determine whether the tissue with best correlation coefficient generally performs best, or the correlation is influenced by the number of considered genes. Another feature of the tool allows users to visualize the overlap of expressed transcripts between TI and TA and to inquire the expression of each transcript in the two tissues. Also, the users can visualize which tissue expresses most genes included in the list they inquire.
In the gene expression analysis, users can directly visualize the expression of genes in different tissues.
Inquiry of heart rhythm disorders for tool validation
To validate the tool, we inquired genes related to inherited cardiac arrhythmias which are often induced by channelopathies. The underlying genetic defects can alter the ionic currents and change the shape and duration of the cardiac action potential [22]. Thus, most of the responsible genes are expressed in cardiomyocytes. Given the fact that the heart is a specialized muscle, among the easily accessible TAs skeletal muscle is expected to have the highest similarity to heart. Using analyses implemented within PTEE we prove this hypothesis, which also served as a sanity check for our tool.
Transcriptional profiling in the developing human brain and protein-protein interaction networks
To identify patterns of gene expression which are informative about neuronal developmental processes, we used the Allen Brain Atlas expression data and ABAEnrichment package implemented in R [30]. To this end, we identified the maximum expression in each developmental stage, followed by one-way ANOVA test to establish significance and Tukey’s HSD for the pairwise comparisons between the different groups, using the R-implemented corresponding functions [24]. The code for this analysis has been deposited under http://rpubs.com/Akhil_Velluva/ptee_aba.
We performed protein-protein interaction network (PIN) functional enrichment analysis of genes not expressed in the TI to delineate molecular processes in which these genes are involved. We incorporated the protein interaction partners of these genes to increase the power of functional module identification [31]. Functional annotations of genes were obtained from Gene Ontology (GO) [32] and protein-protein interaction data from InBio Map [33]. We then used a hypergeometric test to determine the enrichment of genes (conventional) and functional PPIs (network-wise) involved in the functional modules. The functional PPIs are interactions formed by two genes with the same GO annotation. We adjusted network-wise p-values using the Benjamini and Hochberg multiple testing procedures [34]. Functional modules with 1) adjusted conventional p-value < 0.05, 2) adjusted network-wise p-value < 0.05, and 3) at least one studied gene were considered to be significantly enriched.