Defining a SNP set
Before introducing a detailed description of the method used to perform SNP set analysis, it is important to clarify how a SNP set can be defined.
The first step in defining a SNP set is mapping SNPs to genes. SNPs may fall within coding regions of genes, noncoding regions of genes, or in the intergenic regions between genes. Each SNP V
_{
i
} in a GWA study, with i = 1,..., n, is associated to a gene G
_{
j
}, where j indexes the total M genes, if the gene contains the SNP or is the closest gene to the SNP. In cases where a SNP is located within shared regions of two overlapping genes, it is mapped into both genes. SNPs that are a fixed number of kilobases (kb) away from any gene are not considered. In [13] SNPs are associated to a given gene if they are within 500 kb. The selection of 500 kb is due to most enhancers and repressors being <500 kb away from genes and most LD blocks being <500 kb.
The second step is mapping genes to pathways. The pathways are predefined lists of genes based on a a
priori biological knowledge, for example genes which are coexpressed in a particular cellular mechanism or function [17–19]. We use the Molecular Signatures Database (MSigDB) [3] as a source of gene pathways.
Random set methods
Random Set (RS) scoring methods were primary introduced by Efron and Tibshirani [6] to study the enrichment signal in gene sets analysis by using gene expression data. The methods they proposed are more widely applicable.
The main idea pointed out by RS methods is that any method for assessing gene sets should compare a given gene set score not only to scores from permutations of the sample labels, but also taking into account scores from sets formed by random selections of genes.
In fact, any approach to gene set analysis begins with the computation of some enrichment score
, for each gene set
, and computes its significance by comparison with permutation values
. Efron and Tibshirani in [6] argue that a second kind of comparison operation, called "row randomization", is also needed to avoid bias in the determination of significance.
In order to better clarify RS positions let us consider a simplified statement of the gene set problem, proposed by Efron and Tibshirani but adapted to the SNP data framework.
Let X indicate an n × ℓ matrix of genotypic observations, where n is the total number of SNPs and ℓ is the total number of samples, with the first ℓ
_{1} columns of X representing healthy control samples and the remaining ℓ
_{2} are case samples, ℓ
_{1} + ℓ
_{2} = ℓ. A statistic D
_{
i
}, i = 1,..., n is computed for each marker. Consider a single gene set
with m genes and the hypothesis that
is enriched. Testing this hypothesis is equivalent to asking if the m Dvalues have large magnitude (positive or negative), with large to be defined. The basic idea underlying enrichment, as nicely stated by Subramanian [3], is that a biologically related set of genes can be detected from the general effect of its D constituent values whether or not the individual genes are significantly nonzero. Let
indicate the set of m Dvalues in
and (
defines an enrichment test statistic, with larger value of ES indicating greater enrichment. Testing
for enrichment requires a distribution under the null hypothesis for ES. The following are two quite different models for what the null hypothesis might mean:
The randomization of the markers and the permutation of the labels can be combined into a method that is called "Restandardization". Restandardization can be thought as a method for correcting the permutation values of ES to take into account the overall null distribution of ES in the randomization model. The restandardized enrichment score (RES) used is defined as:
where (μ
^{†}, σ
^{†}) are the mean and standard deviation of ES
^{†} and (μ*, σ*) are the corresponding quantities based on label permutations. Two nested permutation procedures are needed in this case which is computationally intensive. Fortunately, the RS method has an appealing feature: for certain choices of the summary statistic
the restandardized score can be easily computed by analytically calculating the genewise means and standard deviations, without having to draw random set of genes. As a result evaluation of statistical significance requires only label permutations [6, 15].
Random Set method for SNP data: RSSNP
RSSNP is designed for genomewide SNP data with binary categorical phenotypes, for example cases and healthy controls.
The first step in the method is computing a correlation or association statistic D
_{
i
} for each SNP V
_{
i
}, i = 1,...., n. The association of a SNP with a trait can be assessed by considering five different genetic models [20]: general, dominant, recessive, multiplicative risk and additive risk model. The first three models use a χ
^{2} test (or Fisher's exact test) on genotype entries to compute association. The multiplicative risk model uses a χ
^{2} test or Fisher's exact test on allelic entries to compute association. The additive risk model uses a CochranArmitage test for trend [21] to associate a SNP to disease risk of association.
After computing the single SNP associations, RSSNP computes the enrichment of these associations in a predefined gene set
. The mapping of each SNP V
_{
i
} to genes is discussed above. The relevant components of the method include:

n = the number of genotyped SNPs;

d = the number of SNPs with pvalue P less than or equal to a given threshold α;

m = the number of SNPs in
;

y = the number of SNPs belonging to
with pvalue P ≤ α.
RSSNP assesses whether the number y of SNPs associated to the phenotype and belonging to
is compatible with chance or indicates overrepresentation of associated SNPs in gene set
. Assessing the statistical significance of y requires the distribution of y under two null hypotheses, as previously stated [6]. The first null hypothesis considered is the hypothesis in which there is no association between genotype and phenotype (
). In particular, the method assesses the probability of observing values of y greater than the observed ones when genotype and phenotype are independent random variables. In addition, a second cause of randomness for y comes from
. Knowing that d of n SNPs have pvalue P ≤ α and that y of them fall in a gene set
with size m, how many SNPs fall in a set composed of m SNPs drawn randomly from the n SNPs available? To take into account this source of randomness, the probability of observing values of y greater than the ones observed in the actual experimental conditions has to be assessed under the hypothesis (
) that the m loci in the gene set
have been chosen randomly from the full set of n SNPs. Note that this problem is easy to solve since under this model the distribution for y is hypergeometric Hyp(m, d, n) with mean
and variance
. To assess the statistical significance of y under the two null hypotheses simultaneously, the following procedure is carried out.
(1) Permute the labels of the samples Π times. For each permutation π = 1, ..., Π:
(i) Compute the number of significant SNPs
.
(ii) Compute the number of significant SNPs belonging to
,
.
(iii) Compute the mean
and variance
under the hypergeometric distribution
.
(iv) From the above
,
, and
compute
.
where I is the indicator function.
Since several gene sets are considered in the analysis, the falsediscovery rate (FDR) and the familywise error rate (FWER) are computed as proposed by Wang et al. [13] in order to control multiple hypothesis testing.
FDR, i.e. the fraction of expected falsepositive findings, is calculated as:
where
T is the total number of gene sets. The FWER is evaluated as the fraction of all permutations whose highest standardized enrichment score in all gene sets is higher than the
RES(
) for a given gene set: