Multivariate genome wide association and network analysis of subcortical imaging phenotypes in Alzheimer’s disease

Background Genome-wide association studies (GWAS) have identified many individual genes associated with brain imaging quantitative traits (QTs) in Alzheimer’s disease (AD). However single marker level association discovery may not be able to address the underlying biological interactions with disease mechanism. Results In this paper, we used the MGAS (Multivariate Gene-based Association test by extended Simes procedure) tool to perform multivariate GWAS on eight AD-relevant subcortical imaging measures. We conducted multiple iPINBPA (integrative Protein-Interaction-Network-Based Pathway Analysis) network analyses on MGAS findings using protein-protein interaction (PPI) data, and identified five Consensus Modules (CMs) from the PPI network. Functional annotation and network analysis were performed on the identified CMs. The MGAS yielded significant hits within APOE, TOMM40 and APOC1 genes, which were known AD risk factors, as well as a few new genes such as LAMA1, XYLB, HSD17B7P2, and NPEPL1. The identified five CMs were enriched by biological processes related to disorders such as Alzheimer’s disease, Legionellosis, Pertussis, and Serotonergic synapse. Conclusions The statistical power of coupling MGAS with iPINBPA was higher than traditional GWAS method, and yielded new findings that were missed by GWAS. This study provides novel insights into the molecular mechanism of Alzheimer’s Disease and will be of value to novel gene discovery and functional genomic studies.


Background
Alzheimer's disease (AD) is a debilitating and highly heritable disease with great complexity in its genetic contributors [1]. Genome-wide association studies (GWAS) of AD or AD biomarkers have been performed at the single-nucleotide polymorphism (SNP) level [2][3][4] as well as at the higher level (e.g., gene, pathway and/or network) [5][6][7][8]. It is widely recognized that AD has a complicated genetic mechanism involving multiple genes. Different combinations of functionally related variants in genes and pathways may interact to produce the phenotypic outcomes in AD, single SNP-level and genelevel GWAS results are unlikely to completely reveal the underlying genetic mechanism in AD. GWAS have greatly facilitated the identification of genetic markers (e.g., single nucleotide polymorphisms or SNPs) associated with brain imaging quantitative traits (QTs) in AD [9,10]. As a complex disease, it is highly likely that AD is influenced by multiple genetic variants [11,12]. The identified single-SNP-single-QT associations typically have small effect sizes. To bridge this gap, exploring single-SNP-multi-QT associations may have the potential to increase statistical power and identify meaningful imaging genetic associations. With this observation, we employ the MGAS (Multivariate Gene-based Association test by extended Simes procedure) tool [13] to perform multivariate GWAS on eight AD-relevant subcortical imaging measures.
In addition, biological interactions may be important in contributing to intermediate imaging QTs and overall disease outcomes [14]. Network-based analysis guided by biologically relevant connections from public databases provides a powerful tool for improved mechanistic understanding of complex disorders [15][16][17][18]. Considering that the etiology of AD might depend on functional protein-protein interaction (PPI) network, we conduct multiple iPINBPA (integrative proteininteraction-network-based pathway analysis) [19] network analyses on MGAS findings using the PPI data, and identify Consensus Modules (CMs) based on the iPINBPA discoveries. Functional annotation and network analysis are subsequently performed on the identified CMs.
In order to enhance the ability to recognize the aggregation effect of multiple SNPs, it may be desirable to perform association analysis at the SNP set (or gene) level rather than at a single SNP level. This paper aims to reveal the relationship between genetic markers and multiple phenotypes, improve statistical power, and find GWAS missing results by MGAS. Network analysis could provide meaningful biological relationships to help interpret GWAS data to further study the genetic mechanism of AD. A schematic framework of our analysis is shown in Fig. 1.

Participant characteristics
The subjects (N = 866) consisted of 467 males (53.9%) and 399 females (46.1%) aged 48-91 years. Shown in Table 1 are the demographic and clinical characteristics of these subjects stratified by five diagnostic groups. There is no significant difference on the APOE e4 status in the five diagnostic groups. Significant differences are observed in gender (p = 0.035) and education (p = 0.037). Age is significantly different across the five groups (p < 0.001). Furthermore, eight neuroimaging phenotypes (LAmygVol,RAmygVol,LHippVol,RHippVol, LAccum-Vol,RAccumVol,LPutamVol,RPutamVol; see Table 2) show the significant difference across the five diagnostic groups (p < 0.001). Shown in Fig. 2 is the correlation matrix of these eight phenotypes. The correlation between LHippVol and RHippVol (r = 0.83) and that between LPutamVol and RPutamVol (r = 0.90) are among the highest.

Multivariate genome wide association study
In multivariate genome wide association study (MGAS) [13,20], the top SNP hit is rs769449 from the APOE and TOMM40 region (p = 1.19E-09) ( Table 3). According to the hypothesis that genes are the functional units in biology [15,21], multivariate gene-level association p-values were also obtained by MGAS which combines p-value information in regressing univariate phenotypes on common SNPs. Figure 3 shows the Manhattan plot of the genebased MGAS results. Using Bonferroni corrected p-value of 0.05 as the threshold, three genes (APOE, TOMM40, APOC1) were significantly associated with the studied eight subcortical measures. Table 4 shows that the top 10 genelevel findings identified by MGAS, where APOE (p = 2.77E-08), TOMM40 (p = 3.49E-08), and APOC1 (p = 2.09E-06) are the well-known AD risk regions. LAMA1 (p = 3.79E-05) was reported to encode the laminin alpha subunit associated with late onset AD in the Amish [22]. HSD17B7P2 (p = 8.40E-05) was reported to play an important role in brain development [23]. The other five gene-level findings in top 10 are XYLB, NPEPL1, CYP24A1, OR5B2 and MIR7160.

Consensus modules
Consensus modules (CMs) were constructed based on our previous work [24]. To search for subnetworks in the multivariate GWAS finding, we ran iPINBPA ten times by varying the random seed value from 1 to 10. Table 5 shows the top 5 subnetworks identified in each run, including the Dice's coefficient value with the most similar modules in other runs. Compared with the standard iPINBPA method, our CM-based network strategy was designed to identify more reliable modules across multiple runs.

Pathway analysis of consensus modules
In our work, we hypothesize that the identified trait prioritized CMs with high replication might have strong functional associations with the studied subcortical volume phenotypes. We clustered the relevant pathways for five CMs and plotted a heat map to summarize the relationships between these pathways and CMs (Fig. 5). Figure 5 shows that Alzheimer's disease, Apoptosis, TNF signaling pathway, Herpes simplex infection, MAPK signaling pathway are the pathways significantly enriched by one or more CMs [25]. We also observe that CM 1 , CM 3 , CM 4 enriched many interesting pathways. In particular, CM 3 demonstrates the strongest functional association with AD (p = 4.94E-05).

Discussion
In this work, we performed multivariate genome wide association study (MGAS) of eight AD-relevant subcortical ROIs, using 866 samples in the ADNI database. To the best of our knowledge, this is the first MGAS on the quantitative traits of eight subcortical ROIs. In our MGAS, we confirmed associations at multiple genes previously associated with AD, such as APOE (p = 2.77E-08, rs769449), TOMM40 (p = 3.49E-08, rs769449), APOC1 (p = 1.18E-06, rs4420638), as well as identified a few novel associations shown in Table 4. Table 4 also shows that the associations to individual subcortical QTs (e.g. APOE, TOMM40, APOC1: associated to LAmygVol, RAmygVol, LHippVol and RHippVol) have a range of different significances.
XYLB (p = 6.72E-05, rs196376) had been reported to be associated with neurological diseases such as ischemic stroke [26]. We observed that this gene is associated to   RAmygVol, LHippVol and RHippVol. LAMA1 (p = 3.79E-05, rs656734) encodes one of the alpha 1 subunits of Laminin, which has been demonstrated to be expressed in the hippocampal neuronal cell layers [27]. NPEPL1 was confirmed to be a potential direct target of miR-19a in a breast cancer study [28] and miR-19a was up-regulated in primary motor cortex and hippocampus in the brain of amyotrophic lateral sclerosis mice at late disease stage [29]. In our study, we found that NPEPL1 was associated to LPutamVol (P LPutamVol = 2.13E-05) and RPutamVol (P RPutamVol = 3.39E-03). In Table 4, the hippocampus and amygdala volumes were associated with multiple genes. While CYP24A1 was associated with none of eight studied QTs, it was identified in MGAS to have an overall association with all eight QTs. The statistical efficacy of MGAS of the detected gene associations appears to be more powerful than univariate phenotype models. Given that OR5B2 and MIR7160 have not been reported to be related to AD or AD related biomarkers, it warrants further investigation to examine their roles on AD in independent cohorts. Because we found that different subnetworks could be identified by using different random seed values, we present the consensus modules discovered by an enhanced iPINBPA strategy. The genes for the CMs might not show a direct individual statistical significance but demonstrated a collected effect on the studied phenotypes. We assessed the significance of each identified consensus module (Table 6). CM 1 (Score = 3.32, p = 9.00E-04) contains totally 8 genes, including KEGG AD genes TNFRSF1A. CM 2 (Score = 1.41, p = 0.16) contains totally 4 genes without reaching the significance level. CM 3 (Score = 4.37, p = 1.24E-05) contains 6 genes, including KEGG AD genes APOE APP, and CASP3. CM 4 (Score = 4.32, p = 1.56E-05) contains 5 genes, including KEGG AD genes APOE, APP, and CASP3. CM 5 (Score = 1.89, p = 5.88E-02) contains 6 genes with a marginal significance. The genes in the significant CMs warrant further investigation. The consensus module strategy applied to the iPINBPA framework yielded more stable results than the standard iPINBPA.
The intersection of CM 3 and CM 4 yielded five genes, including APP, APOE, CASP3, C3, and PLTP. The C3 gene was shown to contribute to the pathogenesis of demyelinating disease by directly or indirectly chemoattracting encephalitogenic cells to the CNS [30]. The PLTP gene was reported to play an important role in Aβ metabolism and it is an interesting topic to further elucidate functions of PLTP in AD susceptibility. Table 7 shows the top ten pathways enriched by the intersection genes. Among these genes, the APP, APOE, and CASP3 genes are known AD risk factors. Several significant pathways were observed, including Alzheimer's disease (p-    P MAGS is the p-value of top 10 genes identified by MGAS; Chr: Chromosome; P LAmygVol is the p-value of top 10 genes associated to LAmygVol; P RAmygVol is the pvalue of top 10 genes associated to RAmygVol; P LHippVol is the p-value of top 10 genes associated to LHippVol; P RHippVol is the p-value of top 10 genes associated to RHippVol; P LAccumVol is the p-value of top 10 genes associated to LAccumVol; P RAccumVol is the p-value of top 10 genes associated to RAccumVol; P LPutamVol is the p-value of top 10 genes associated to LPutamVol; P RPutamVol is the p-value of top 10 genes associated to RPutamVol, Bold font indicates p-value < 0.  [31]. Serotonergic neurotransmission and synapse activity are highlighted as primary pathological factors in neuropsychiatric symptoms [32,33]. Pertussis toxin inhibits the apoptosis and DNA synthesis caused by FAD APP mutants which precedes FAD APP-mediated apoptosis in neurons and inhibition of neuronal entry into the cell cycle inhibits the apoptosis [34]. Apoptotic pathways and DNA synthesis are activated in neurons in the brains of individuals with AD. Due to the limited number of samples available to us, in this work we were only able to perform a discovery study. In the future, when more data become available replication studies in independent cohorts warrant investigation to validate the identified CMs.

Conclusion
In this study, we performed MGAS analysis to explore the multivariate imaging genetic association effects for a set of AD-related subcortical measures. In addition, we conducted the iPINBPA network analysis to discover consensus modules related to these imaging phenotypes from a protein-protein interaction network. The MGAS analysis identified several genes associated with the studied  imaging phenotypes, including APOE, TOMM40, APOC1, LAMA1, XYLB, HSD17B7P2 and others. The statistical power of coupling MGAS with iPINBPA was higher than traditional GWAS method, and yielded findings missed by GWAS. In this work, we reported top five consensus modules based on MGAS results. Network-based analysis can take into account information on biological relationships to interpret GWAS data. Our results suggested several susceptible genes and network modules for further investigation and replication to better understand the genetic mechanism of Alzheimer's Disease.

Subjects and data
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see www.adni-info.org. Baseline 3 T MRI scans, demographic information, and diagnosis for the ADNI-1 and ADNI-GO/2 cohorts were downloaded [35]. MRI scans were analyzed using FreeSurfer version 5.1 for brain segmentation. We examined the volume measures of 14 subcortical ROIs; see Tables 1-2. We performed analysis of variance (ANOVA) to evaluate   Table 2) in subsequent genetic association studies. Genotyping data of both ADNI-1 and ADNI-GO/2 cohorts were downloaded, and then quality controlled and combined as described in [36]. A total of 866 non-Hispanic Caucasian participants with both complete subcortical imaging measurements and genotyping data were included in the study. The study sample (N = 866) included 183 cognitively normal (CN), 95 significant memory concern (SMC), 281 early MCI (EMCI), 177 late MCI (LMCI) and 130 AD subjects. The demographic and clinical characteristics of participants, stratified by the diagnosis, are shown in Table 1.

Multivariate genome wide association study
GWAS was performed to examine the main effects of 563, 980 SNPs on eight subcortical measures as quantitative traits (QTs). Linear regression model was performed using PLINK to examine the association between each SNP-QT pair (https://www.cog-genomics.org/plink2) [37]. An additive genetic model was tested with age, gender and brain volume as covariates. We computed the correlation matrix (8 × 8 matrix) for the QT data containing eight imaging phenotypes (Fig. 2). We applied the MGAS (Multivariate Gene-based Association test by extended Simes procedure) tool to all 563,980 SNPs and examined their multivariate gene-based associations with eight imaging QTs [13]. A Manhattan plot was generated using R (http:// www.r-project.org) to visualize the gene-level MGAS results for our work (Fig. 3).We obtained one multivariate gene-based p-value P MGAS as follows.
Here, q e represents the effective number of p-values within a gene, q ej represents the effective number of p-values among the top j p-values where j runs from 1 to 8 × 563,980, and p j represents the j-th p-value in the list of ordered p-values. P MGAS is the smallest weighted p-value within a gene associated with the null hypothesis that none of the eight phenotypes are related to the 563,980 SNPs within the gene, and the alternative hypothesis that at least one of the eight phenotypes is related to at least one of the 563,980 SNPs. We identified 1386 genes with p-value< 0.05 [13,38].

Identifying consensus models using iPINBPA
This study used the protein-protein interaction (PPI) data from the Human Protein Reference Database (HPRD, http:// www.hprd.org) [39], containing 9617 proteins and 39,240 interactions. Gene-level p-values obtained from MGAS of subcortical imaging phenotypes were mapped to the PPI network. Then the network was followed by an iPINBPA (integrative protein-interaction-network-based pathway analysis) procedure [19] to identify enriched PPI network modules. Consensus modules (CMs) were identified using the following approach based on our prior study [24].
Briefly, building on our prior study, we focus on analyzing the top 5 subnetworks (TN 1 , TN 2 , TN 3 , TN 4 , TN 5 ) in each iPINBPA run. Let TN ij be the top i-th subnetwork identified in the j-th run, where i ∈ {1, 2, …, 5} and j ∈ {1, 2, ...10}. We first find SN n (TN ij ), which is the most similar subnetwork to TN ij in the n-th run, where n ∈ {1, 2, …, 10}\{j}. Clearly, we have. SN n (TN ij ) = argmax sn DC(TN ij , sn), where sn is any subnetwork enriched in Run n, and DC(x, y) indicates the dice coefficients between two subnetworks x and y. Consequently, for Run j, we define its i-th consensus module CM ij as follows.  Namely, CM ij is the intersection of TN ij and its most similar subnetworks identified in all the other runs. In our empirical study, we will report the consensus modules based on Run 1, i.e., CM i1 as the i-th consensus module.

Functional analysis
Cytoscape 3.4 [40] was used to visualize the identified CMs. We used ToppGene online tool (https://toppgene. cchmc.org/) for functional enrichment analysis. The ToppGene suite is an advanced bioinformatics tool, it could detect and arrange candidate genes through a comprehensive assessment of a variety of factors, including gene ontology (GO) annotating, phenotype, signaling pathway and protein interactions from a specific list of genes [41]. In this case, the top 10 findings of our multivariate gene-based association analysis were analyzed for functional enrichment. For the identified CMs, we also performed functional enrichment analysis using the ToppGene Suite.