GWAS dataset
We used the GWAS dataset for schizophrenia from the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) project [15]. The CATIE project is a multiphase randomized controlled trial initially designed to investigate and improve the use of antipsychotic medications in treating schizophrenia patients. We included the samples involving 738 schizophrenia patients and 733 controls, which were genotyped by Perlegen Sciences using the Affymetrix 500K and Perlegen's custom 164K chip, resulting in ~446 k genotyped SNPs. A detailed description of the samples can be found in reference [15]. We accessed this dataset (Distribution 7.0) from http://www.nimhgenetics.org/ through NIMH approval. Only the Caucasian samples were used. We followed the pipeline of quality control, including the selection of samples and markers, as described in references [8, 16, 17].
Gene-wise P value
To compute gene-wise P values, given multiple SNPs mapped to each gene, an ideal algorithm should account for potential confounding factors, such as gene length, SNP density, and LD structures. We incorporated the software tool VEGAS [13] to compute the gene-wise P values. VEGAS combines the information of multiple SNPs by making use of simulations from the multivariate normal distribution while explicitly accounting for the LD between markers. We followed the default settings in VEGAS; this default setting maps SNPs to genes with 50 kb extension of gene boundaries. The HapMap CEU samples (http://www.hapmap.org/, release R2) were selected to estimate the LD structure in our work, as we only included the Caucasian samples in the CATIE data.
We explored three options in VEGAS to compute gene-wise P values based on sets of SNPs: (1) using all the SNPs mapped to a gene (hereafter denoted as "VEGAS-all"); (2) using the top 10% SNPs based on SNP-level P values ("VEGAS-top"); and (3) using the most significant SNP, i.e., the SNP with the smallest P value ("minP"). Note that the minP option is the same as the smallest P value strategy widely used in many post-GWAS analysis methods. We included it here to compare with the other, more advanced approaches based on combining of multiple SNPs.
PPI dataset
Our background PPI network data were collected from six public PPI databases: MINT, IntAct, DIP, BioGRID, HPRD, and MIPS/MPact. We downloaded these data from the Protein Interaction Network Analysis (PINA) platform (March 2010) [18]. To ensure the reliability of the PPI data, we explicitly included only interactions that have experimental evidence and involve both interactors from human genes. Self-interaction and duplicates were removed. The final network included a total of 10,377 nodes and 50,109 interactions.
Module search algorithm
We first overlaid the GWAS data onto the whole human PPI network by assigning each node a z-score: , where is the inverse normal cumulative density function and P is the gene-wise P value from any one of the three methods. For a module with k genes, a module score was computed according to the Stouffer's Z-score method:. The detailed module construction process can be found in our previous work [8]. Briefly, for a given "seed node", a "best module" with the maximum module score will be returned in the context of the working parameters, i.e., d=2 and r>0.1. Here, d is distance for the nodes to be included to the seed node and r is the network score increment cutoff value.
The overall network weighted by the GWAS data serves as the working background. The restricted search strategy we proposed in this study is implemented as follows. First, we ranked all the nodes in the network according to their weights z
i
and started with the node that had the highest score as the seed to perform a module search. Once the module was generated, all the nodes present in this module, including the seed node itself, were then removed from the network, and the rest of the network constituted the new background network. A new module search round was started again, with the highest scored node in the new background network each time, until none of the nodes in the background network could generate a module with ≥5 nodes (i.e., the minimum number of nodes we required to define a module). This restricted strategy takes into account every single node in the background network but does not require using each of them as a seed node, which has been implemented in our previous work. Thus, this tactic could greatly reduce the computational duty and also avoid the heavy redundancy of resultant modules using our original algorithm.
Significance test
To estimate the significance of the resultant modules, we calculated P values based on the module scores (Z
m
). We adopted the method proposed in [19] to empirically estimate the null distribution, which is assumed to be a normal distribution. Specifically, we used the median-centered module score to estimate the location parameters δ and σ for the empirical null distribution using the R package locfdr and computed standardized module scores by Z
S
= (Z
m
'- δ)/σ, where Z
m
' is the median-centered module score. The final module P values were obtained using the standard calculation, P(Z
m
) = 1-Φ(Z
S
), where Φ is the normal cumulative density function. The module P values were then used for significance test of the resultant modules and help module selection.
Evaluation by ALIGATOR
We utilized the software tool ALIGATOR [6] to evaluate the module gene sets. The algorithm of ALIGATOR is initially designed to prioritize Gene Ontology (GO) categories using summary GWAS data at the SNP level. Building on a resampling strategy, ALIGATOR first pools all the SNPs and their P values from the GWAS data and builds the SNP collection. In each resample, the algorithm randomly selects SNPs from the collection and records the number of significant genes defined by the selected SNPs. Here, significant genes were defined as those with at least one SNP that has a P value less than a pre-defined cutoff value, e.g., 0.05. The random selection process keeps running until the significant genes targeted by the selected SNPs reaches the number of the significant genes in the real case. After the resampling SNP set is constructed, each of the GO categories are compared with the resample data to obtain the number of significant genes. This resampling process is repeated numerous times (e.g., 50,000), resulting in the null distribution of the number of significant genes in each GO category. Finally, an empirical P value can be computed for each GO category by comparing the significant genes in the category in the real case versus those in the resample sets.
In our application, instead of using GO categories, we constructed new gene sets based on the module genes. The module genes identified by each input dataset were pooled as one gene set, and 3 module gene sets were generated, corresponding to the input dataset of VEGAS-all, VEGAS-top, and minP. As a comparison, we also included the KEGG pathways (downloaded as of March, 2011) [3]. We restricted the KEGG pathway size (number of genes in a pathway) to ≥5 and ≤300. In total, there were 204 KEGG pathways plus 3 module gene sets collected for the ALIGATOR analysis. We followed the ALIGATOR default definition of significant genes, i.e., genes with at least one SNP having P<0.05, and we performed the resampling 10,000 times. Multiple testing correction by the Benjamini & Hochberg (BH) method [20] was then conducted.
Comparison with DAPPLE
We compared with another available network based GSA method, "Disease Association Protein-Protein Link Evaluator (DAPPLE)" [9]. We applied DAPPLE to the CATIE dataset. DAPPLE aims to evaluate whether genes in association loci are significantly connected by PPI, where the association loci are typically defined by the standard single marker analysis of a GWAS dataset for a complex disease/trait. Using the genes located in these association loci, DAPPLE searches for two types of subnetworks: a direct network, in which the input genes are directly connected, and an indirect network, in which the input genes are connected through a common interactor [9]. Both networks are then evaluated by a permutation test to assess their significance. Because the construction process of the resultant subnetworks starts with the input association genes/loci, the method relies heavily on the input genes and can generate different subnetworks if the input gene/locus list is changed. However, the ways to define associated genes have not been standardized yet, although all adopt a pre-defined hard cutoff. For example, the widely applied method employs the cutoff of 5 × 10-8, which is challenging in psychiatric diseases, because the association signals are typically weak in psychiatric GWAS datasets. Alternatively, a user-defined cutoff value, which can be less strict yet arbitrary, could be considered.
Functional evaluation
We performed pathway enrichment test of the three module gene sets using the canonical KEGG pathways. The hypergeometric test was implemented with the genes in the network used as the gene universe and module genes used as the genes of interest. Multiple testing correction was performed using the Bonferroni method.