Stratification of candidate genes for Parkinson’s disease using weighted protein-protein interaction network analysis

Background Genome wide association studies (GWAS) have helped identify large numbers of genetic loci that significantly associate with increased risk of developing diseases. However, translating genetic knowledge into understanding of the molecular mechanisms underpinning disease (i.e. disease-specific impacted biological processes) has to date proved to be a major challenge. This is primarily due to difficulties in confidently defining candidate genes at GWAS-risk loci. The goal of this study was to better characterize candidate genes within GWAS loci using a protein interactome based approach and with Parkinson’s disease (PD) data as a test case. Results We applied a recently developed Weighted Protein-Protein Interaction Network Analysis (WPPINA) pipeline as a means to define impacted biological processes, risk pathways and therein key functional players. We used previously established Mendelian forms of PD to identify seed proteins, and to construct a protein network for genetic Parkinson’s and carried out functional enrichment analyses. We isolated PD-specific processes indicating ‘mitochondria stressors mediated cell death’, ‘immune response and signaling’, and ‘waste disposal’ mediated through ‘autophagy’. Merging the resulting protein network with data from Parkinson’s GWAS we confirmed 10 candidate genes previously selected by pure proximity and were able to nominate 17 novel candidate genes for sporadic PD. Conclusions With this study, we were able to better characterize the underlying genetic and functional architecture of idiopathic PD, thus validating WPPINA as a robust pipeline for the in silico genetic and functional dissection of complex disorders. Electronic supplementary material The online version of this article (10.1186/s12864-018-4804-9) contains supplementary material, which is available to authorized users.


Background
In the past decade, high throughput technologies, such as DNA microarray, exome and genome sequencing, have provided investigators with an extensive catalogue of genes and genetic variations associated with complex disorders [1].
Parkinson's disease (PD) is a complex neurodegenerative condition; to date, familial PD is linked to 14 Mendelian genes (Table 1) [2], accounting for 1-5% of all cases [3], whilst for idiopathic PD (> 90% of all cases) multiple genome-wide association studies (GWAS) have identified over 30 risk loci throughout the human genome [4]. Importantly, a number of the most significant single nucleotide polymorphisms (SNPs) identified by the PD GWA studies map either directly or in close proximity to previously identified Mendelian genes (e.g. LRRK2, SNCA and GBA) indicating that such genes likely contribute to disease over a continuum of strong (Mendelian forms) and small to moderate (GWAS hits) effect size [5]. Yet, for the majority of the PD genome wide risk loci (and generally in GWAS), there is currently no consensus on how to select candidate genes in association with the most significant risk SNP(s) [6].
Interpreting GWAS results is a challenging task [6,7] as not only is it difficult to confidently define the authentic associated marker (the top identified SNP might be in linkage disequilibrium [LD] with the real associated SNP), but also it is now recognized that the practice of choosing candidate genes by proximity to a strongly associated SNP is unlikely accurate [8][9][10]. Furthermore, there are significant challenges in then characterising functional/biological effects to the top SNPs. This clearly indicates that making the step from genomic insight(s) to functional translation is currently arduous due to the limited success in progressing from genomic loci to candidate genes, coupled with the prevalent functional practice of studying one gene at a time.
In this study, we applied a systematic computational approach to investigate the interactome of the PD-Mendelian genes through weighted protein-protein interaction network analysis (WPPINA) [11] as a means to define PD-specific impacted pathways and stratify candidate genes within PD-GWAS loci. Specifically, we selected proteins encoded by PD-Mendelian genes as seeds, and harvested data of wet-lab proven protein-protein interactors (PPI) to build the PD-interactome for which we then performed functional annotation analysis to highlight impacted biological processes and to list all the proteins contributing to these systems. Finally, we mapped the genes encoding all such proteins to the loci highlighted in the PD meta-analysis [4] to assist identification of candidate genes driving risk in sporadic PD on the basis of available functional knowledge.
Using PD as a case study, we demonstrate that WPPINA, in combination with genomic approaches, is a powerful and adaptable tool to assess and prioritize candidate genes within risk-loci of complex disorders.

Design
We sought to consider the interactomes associated with the parkinsonian syndromes, which manifest in classical PD -the focus of the current studyand a subset of cases diagnosed with frontotemporal dementia (FTD) with parkinsonism and in a group of conditions classified under the umbrella term 'parkinsonian spectrum' (PS, see also Additional files 1 and 2: Table S1). We used all known Mendelian genes associated with these conditions as seeds (Table 1) to build syndrome-specific networks to: i) highlight syndrome-specific impacted biological processes and relevant nodes therein, and; ii) aid prioritization of candidate genes within the PD-GWAS loci. In this fashion, we were able to test for specificity of our method prior to carrying forward the PD-GWAS candidate genes prioritization on the basis of the PD-specific impacted biological processes and the key proteins therein.

Network construction and topological analysis
For each Mendelian gene used as a seed the WPPINA builds a protein network composed of the seeds, their direct interactors (first layer nodes) and the direct interactors of each first layer node (second layer nodes).
The 3 individual first-layer-networks (PD, FTD and PS) are shown in Fig. 1a-c. The combination of the 3 networks generated a nearly completely connected first-layer-network (Fig. 1d). Due to the seed centric approach used to construct the network, the first layer of a network is partially dependent upon hypothesis driven experiments. Moreover, the interactomes are subject to ascertainment bias resulting in interactomes of different sizes [12]. The ascertainment bias is an intrinsic property of the PPI networks, yet the seed centrality bias can be diluted by scaling up the network to the second layer, therefore including nodes that were not necessarily studied as direct interactor(s) of the seeds under investigation [11].
The construction of the second layer resulted in a comprehensive network made of 5414 nodes and 18,492 undirected edges (Fig. 1e); the seed-centrality issue was diluted as confirmed by an increase in average number of neighbors and connection density, and by a decrease of the average shortest path (Additional file 3). The analysis of the first + second layer networks led to the identification of inter-interactome hubs (IIHs) that are nodes with a degree of interactome connection (IDC) ≥ 0.6 (Additional file 4). These nodes represent the shortest path to connect more than 60% of the seed genes, therefore (based on the network parsimony hypothesis [13]) it is reasonable to expect these to reside in disease relevant pathways, congruent with the majority of the seeds, and therefore at the core of the syndrome-specific network.

Network functional analysis
We first analyzed the entire network (input of 5414 nodes) for enrichment of Gene Ontology (GO) biological processes (BPs) terms (Additional file 5); we then applied the same procedure to the IIHs for PD, FTD and PS separately (Additional files 6, 7 and 8) [12] highlighting a subset of semantic classes indicative of syndrome-specific biological processes (Additional file 9).
Although 'RNA metabolism' , 'organelles' , 'adhesion' , 'cytoskeleton' and 'chromatin' are general terms, a number of functional blocks clearly supported syndrome-specific processes ( Fig. 2 and Additional file 10) cell cycle and DNA metabolism (semantic classes: 'cell cycle checkpoints' , 'DNA damage check point' and 'repair') were specific to FTD/PS. Cell death pointed to 'intrinsic apoptosis' for both FTD/PD indicating 'DNA damage response' as specific to FTD, and 'mitochondria stressors' as specific to PD/PS. Stress indicated 'oxidative stress' as a common semantic class across all syndromes: 'ER stress' was shared across FTD and PDcomplexes that activate 'DNA damage response' were however (again) unique for FTDwhilst 'cell death via mitochondria' was unique for PD. These very links were further supported by localization Networks are reported as organic layout to highlight clustering properties. e. Seeds + first and second layer interactors combined in the final network, purple = seeds; green = IIHs that pointed at the 'nucleus' for FTD/PS and the 'mitochondria' for PS/PD. Development suggested 'cell growth' for FTD and 'neuronal/axonal development' for PS, this being in line with semantic classes indicating 'growth factors' within signaling, that were specific for FTD and PS. Conversely, the indication of 'cytokines' , 'immune receptors' , 'innate immune system' as specific to PS and PD in the immune system functional block was supported by a plethora of immune signaling semantic classes in the signaling functional block for PS/PD (Additional file 10). Finally, waste disposal was enriched in all 3 conditions; however, FTD was specifically characterized by semantic classes such as the 'ubiquitin proteasome system' and the 'unfolded protein response' , whilst PD by 'autophagy'. Functional annotation analysis performed through Panther replicated 95.5% of the g:Profiler findings (Additional file 11).

PD-GWAS candidate genes prioritization
We selected the functional blocks defined by PD-specific IIHs (Additional file 12), as they indicate conserved functions across all the Mendelian forms of PD, and extracted the list of proteins contributing to them in the complete PD-specific network (Additional file 13). We then used the significant PD-SNPs (n = 32) from the GWAS to define cis-haplotypes (Additional file 14) and map ORFs therein. Finally, we compared the ORFs with the proteins contributing to the enrichment of the PD-specific functional blocks: 19 genes were in cis-haplotypes defined by LD r 2 ≥ 0.8, 2 genes in LD r 2 ≥ 0.7, 2 genes in LD r 2 ≥ 0.6, and 4 genes in LD r 2 ≥ 0.5 with the top SNP of a risk locus, respectively, leading to the prioritization of a total of 27 candidate genes within the 32 PD-loci (Fig. 3).
To validate this analysis, we ran 100,000 random simulations. The p-values associated with the experimental analysis in comparison with the random distribution showed strong statistical significance (Additional file 17, p = 0.004 for LD r 2 ≥ 0.5 and p = 0.005 for LD r 2 ≥ 0.8). Analytic p-values values generated using the hypergeometric distribution also led to statistically significant results (p = 0.009 for LD r 2 ≥ 0.5 and p = 0.012 for LD r 2 ≥ 0.8). We undertook an additional validation step by assessing the total number of annotations reported in GO for each ORF within the analyzed haplotypes to verify

Discussion
Familial and sporadic PD cases associate with highly/ moderately penetrant variants in a number of Mendelian genes and multiple risk loci with small to moderate effect size, respectively. Despite an abundance of genetic data, translating PD genetics into understanding of the functional and molecular mechanisms underlying PD pathogenesis is a challenging task. For Mendelian genes, this is partially due to differences in the pace of genetic data generation (fast) versus functional and molecular assessments (slow) that still evaluate one gene at a time. For sporadic cases, their association with multiple risk markers (that indicate loci rather than genes) results in additional theoretical and practical issues affecting the design of functional experiments.
A number of methods to prioritize genes within GWAS loci have recently emerged: burden scoring at gene or pathway level [15,16], GWAS data integration with cis-eQTL signals or epigenetic markers [17][18][19][20], and PPIs-based methods [9,21]. However, while these hold much promise, none has yet solved the issue of confidently prioritizing genes within GWAS loci and it is still the case that in published GWAS nominated genes are assigned to the top SNPs predominantly by proximity, an approach that not only is arbitrary but also inaccurate and ineffective to gather insight on molecular mechanisms impacted in disease pathogenesis [8]. Therefore, there is an urgent requirement for additional pipelines to further and better characterize GWAS loci as well as the impacted biological processes, risk pathways and therein key functional players for potential future targeting is real [1,22].
In this view, our recently developed pipeline, WPPINA [11], is complementary to existing models given that it builds protein networks as a direct function of the disease under investigation (i.e. on the basis of their Mendelian genetics). By applying this pipeline to PD, FTD and PS we were able to highlight syndrome-specific PPI networks, impacted biological processes and associated functional players at the core of each disease (Fig. 2). We found that specific processes were differentially relevant across the 3 syndromes: 'immune response' , particularly 'signaling in immune response' , was relevant for PD and PS, 'DNA Here genes in black font confirm previous assignments by proximity, whilst genes in red font are the novel candidate genes for sporadic PD damage response' was relevant for FTD only, and 'waste disposal' was differentially relevant for PD (i.e. 'autophagy') and FTD (i.e. 'unfolded protein response' and 'ubiquitin proteasome system'). Particularly, we isolated 'mitochondria stressors' and 'mitochondria mediated cell death' , 'immune response and signaling' (i.e. 'immune system receptors' and 'cytokine signaling'), and 'waste disposal' mediated through 'autophagy' as PD-specific processes.
After verifying the suitability of our approach in highlighting syndrome-specific impacted biological processes, we then focused on the PD-interactome to prioritize candidate genes within PD-GWAS loci (Fig. 3) [4] on the basis of functional relevance rather than proximity. We used the totality of the proteins indicated by the PD-specific processes and mapped them onto the PD-GWAS loci, confirming 10 candidate genes (previously nominated by proximity) and identifying 17 novel candidate genes ( Fig. 3 and Additional file 15) for sporadic PD. Of note, for some of the loci we identified more than one candidate gene. This could be due to a number of reasons: i) the top SNP at the risk locus may mask a lower (but still biologically significant) signal of a second, independent SNP [23], and/or; ii) the same SNP(s) might influence more than one ORF [24] as, for example, in the case of rs34311866 for which classical wet-lab approaches prioritized different genes (i.e. GAK and TMEM175) [25,26]. It is also noteworthy that we were not able to assign a gene to every locus. This might reflect that: i) the PPI network is incomplete (i.e. partly because not all experimental data has been captured by manual curation, and partly because the network can only be as complete as the experiments that have been performed and published), and/or; ii) we may have lowered the number of potentially positive nodes by using a stringent filtering to only retain robustly validated interactors in the network. Finally, the risk loci may target functional elements that reside in -trans rather than in -cis (e.g. distal enhancers or silencers) or non-coding RNAs, rather than protein coding genes [6].

Conclusions
Through this in silico study of a multifactorial, complex neurodegenerative disorder, we show that our pipeline can be applied to: i) define disease-specific biological processes on the basis of known Mendelian genes, and; ii) provide a list of proteins involved in disease-specific processes that can be used to prioritize candidate genes in GWAS loci.
This pipeline aids in expanding on the genetic and functional architecture underlying idiopathic forms of disease. In doing so, it provides the basis for further genetic (e.g. such list of proteins/genes might be screened for rare variants within whole exome sequencing data sets) and hypothesis-driven functional studies to validate-risk pathways as well as identify targets for the development of therapies in the future.

Methods
Methods are described in detail in the Additional file 1.

Construction of the PPI network
PPIs of Mendelian-PD gene products were downloaded for each seed protein as MITAB 2.5 files (January-2016) from the IntAct [27], BioGRID [28], InnateDB, InnateD-B-all, InnateDB-IMEx [29] and MINT [30] databases by means of the PSICQUIC platform (http://www.ebi.ac.uk/ Tools/webservices/psicquic/view/main.xhtml) developed by the IMEx consortium [31]. All PPIs underwent quality control (QC) and filtering. Only human interaction and experimentally proven physical interaction were retained (predicted interactions and expanded complexes were removed). The interactions were then scored and thresholded taking into consideration the number of different publications and methods reporting the interaction (see [11] and Additional file 1 for details). Polyubiquitin-C, B and D (UBC, UBB and UBD) were excluded from the network as they may indicate unspecific binding of ubiquitin to proteins tagged for degradation.

Topological analysis
The inter-interactome degree (IID) for each single node was calculated based on the number of different interactomes that node belonged to. The interactome connection degree (ICD) of a node equates to the IID divided by the total number of seeds in the network and ranges between 1 (nodes able to bridge all the interactomes in the network) and 1/number of seeds (nodes unable to bridge any interactomes in the network). Nodes with IDC ≥ 0.6 are inter-interactomes hubs (IIHs).

Gene prioritization -GWAS
We used thirty-two relevant SNPs as per the PD meta-analysis (form discovery phase and/or joint analysis) [4]. The genes mapping to the GWAS-loci (SNPSNAP, reference EU 1000G; locus definition by linkage disequilibrium (LD) r 2 > 0.5) were matched with those encoding proteins contributing to the PD-specific risk-processes highlighted by WPPINA to aid prioritization of genes within the PD-GWAS loci. Results were statistically validated by generating 100,000 random gene-sets of sizes similar to the lists of open reading frames (ORFs) in LD with the PD-GWAS SNPs. P-values were calculated considering the total number of random matches falling above the actual number of experimental matches divided by the total number of trials. Analytic p-values were also calculated by using the hypergeometric distribution. An additional assessment was performed by evaluating the number of annotations for each gene reported in GO (GO annotation service http://amigo.geneontology.org/grebe) to verify whether differences in the annotation sizes might have influenced and driven gene prioritization.

Cell type expression
We evaluated cell specific expression profile for genes prioritized in each PD locus. The individual expression FPKM data were downloaded for human temporal lobe cortex mature astrocytes, neurons, microglia, oligodendrocytes and endothelial cells from Additional file 1 of Zhang et al. [14].