Concordant gene set enrichment
In this study, we consider multiple largescale twosample gene expression data sets. We use K to denote the number of these data sets and m to denote the number of common genes in these data sets. For each of these data sets, we usually use a ttype test to evaluate the differential expression of each gene and a gene set enrichment analysis (GSEA) method to evaluate the enrichment level of a given gene set. In order to define and evaluate a concordant gene set enrichment when an integrative analysis is conducted for all K data sets, we categorize differential expression in each data set into three underlying (unobserved) representative categories: no change, positive change (or upregulated differential expression) and negative change (or downregulated differential expression). Due to data noise, what we observe from experiments may not indicate the underlying truth. (For example, a gene with slight downregulated differential expression may show a small positive ttype test value.) Although these categories are not observed in practice, they can be considered in a mixture model framework.
To understand the concept of concordant gene set enrichment, let us consider an artificial example. Given a pathway with 30 genes, we know all the underlying behavior of these genes: 20 genes have positive changes consistently among all different data sets. Furthermore, if we randomly select 30 genes, we also know that the expected number of genes with consistent positive changes among different data sets is just 5. In this case, we would conclude that the given gene set is concordantly enriched in upregulated differential expression (because 30 is clearly larger than 5). However, in practice, all the underlying differential expression categories are not observed. Instead, they can be considered in a mixture model framework. Then, we need to develop a mathematical formula for the probability of concordant enrichment score (CES) of a given gene set
S that contains
m
_{
S
} genes:
which can be useful for prioritizing different gene sets in practice.
Before we derive the mathematical formula for the above probability, we need to explain the term "enriched". As suggested by Efron and Tibshirani [17], unless the test statistic for a gene set enrichment analysis (GSEA) considers the genomewide background patterns (e.g. the statistics proposed in the original GSEA [7, 8]), it is necessary to consider the "row randomization" for genes in addition to the "column permutation" for samples. Therefore, the term "enriched" means "higher/better than expected".
Although many test statistics have been developed for GSEA with one largescale expression data set, we still need to develop a new approach for this study. The motivation is: we need to address the component information of the genes in a gene set. The component information is whether a gene is upregulated, downregulated or nondifferentially expressed. Most existing test statistics for the gene set enrichment analysis are either nonparametric or functions of
zscore. But it is difficult to analyze the component information with these test statistics. Therefore, based on the above discussion for the term "enriched", we propose the following probability for measuring concordant gene set enrichment:
For a gene in a given gene set
S, an event of interest can be: (1) the gene is concordantly upregulated; (2) the gene is concordantly downregulated; or (3) the gene is concordantly differentially expressed (either upregulated or downregulated). Our analysis methods for these different types of enrichment analysis are almost mathematically identical. For a mathematical notation of the above CES, we denote
U
_{
i
} the indicator that the
ith gene in gene set
S satisfies the event of interest. Let
D be the observed data and
π be the probability of event of interest if the gene is randomly sampled. Then, we have
In order to calculate CES practically, we propose a threecomponent multivariate mixture model. In the model, each component is a normal distribution. The model configuration for these three components is consistent with the differential expression categories as described above. This model is conceptually analog to a simple normal mixture approach to differential expression analysis proposed by McLachlan et al. [14]. The special feature of our model is that we focus on some specific combination of components from different dimensions. A bivariate version of this model has been used by us to evaluate the concordance and discordance between two largescale experiments with two sample groups [15] and to integrate two microarray data sets in differential expression analysis [16]. Before the model description, we need to describe the related data preprocessing and differential expression test scores as follows.
Data preprocessing
Because our proposed statistical method is developed based on the differential expression test scores, we assume that the given gene expression data sets have been preprocessed appropriately [18]. For a concordant integrative analysis of multiple data sets, we also need to select genes shared commonly by different data sets. This can be achieved using the genes' unique identifiers.
Differential expression test scores
For each of the twosample gene expression data sets, we screen individual genes with the traditional twosample Student's ttest. Several modified ttests, such as SAM ttest [19] and the moderated ttest [20], have been widely used in the differential expression analysis of microarray data. These test statistics can generally improve the control of false positives by "softly" filtering out genes with relatively small expression variance. However, we intend to consider all the genes equally important in the concordant integrative analysis of multiple data sets. Furthermore, a given gene can show different levels of variance in different data sets, which may make it difficult to use these modified ttests. Therefore, we still recommend the traditional twosample ttest as the differential expression test statistic. (In practice, other test statistics like SAM ttest or the moderated ttest can still be considered when there is a strong reason to do so.) Because the sample size of a highthroughput study is usually not large, it is generally difficult to validate the normal distribution assumptions for the ttest. Therefore, instead of the theoretical tdistribution, we use the permutation procedure to compute the pvalue of an observed ttest [21]. This approach has been widely adopted in the analysis of gene expression data [6].
For K twosample gene expression data sets with m common genes, we compute the onesided uppertailed pvalue p
_{
i,k
} for gene X
_{
i
} in the kth data set, i = 1, 2, . . . , m and k = 1, 2, . . . , K. Then, we perform an inverse normal transformation to obtain a zscore: z
_{
i,k
} = Φ^{
1}(1  p
_{
i,k
}), where Φ(·) is the cumulative distribution function (c.d.f.) of the standard normal distribution. This transformation has been widely used to improve the fitting of a mixture model [14]. Our proposed statistical methods for the concordant integrative analyses of multiple data sets are developed based on these sets of zscores.
A mixture model
For each individual data set, we assume that a mixture of three normal distributions can well fit the
zscores. Let
denote the probability density function (p.d.f.) of a normal distribution with mean
µ and variance
σ
^{2}. Three representative components are considered for the
kth data set (
k = 1, 2
, . . . , K):
for genes nondifferentially expressed (no change),
for genes with upregulated differential expression (positive change) and
for genes with downregulated differential expression (negative change). Notice that
µ
_{0,k
}= 0 and
(a
zscore under the null hypothesis follows the standard normal distribution because its associated
pvalue follows a standard uniform distribution). This configuration has been suggested in the analysis of gene expression data [
14] although more components can be considered to improve the data fitting. Mathematically, we have the following density function:
which is a type of wellknown simple normal mixture model.
When the above simple model is extended to accommodate the analysis of multiple data sets, we need to consider the combination of components from different dimensions (data sets). Then, there are 3
^{
K
} different combinations. We assume that different data sets are collected independently. For the
ith gene with a list of
zscores
from different data sets, if we know all the related component information, then the join density of these
zscores is the product of marginal densities of individual
zscores. Therefore, the following formula defines our basic mixture model for a concordant analysis:
where
is the probability for this gene being in a particular combination of different components (j
_{1}
, j
_{2}
, . . . , j
_{
K
}) in different data sets
. We call this model a partial concordance/discordance (PCD) model. Notice that a bivariate version of this model has been used to evaluate the overall concordance or discordance of two largescale data sets and to conduct an integrative analysis of differential expression for two largescale twosample data sets [15, 16].
Model estimation
Our mixture model can be estimated by the welldeveloped EM algorithm [22]. In the model, the differential expression categories are considered as missing information. For any zscore vector (z
_{
i,1}, z
_{
i,2}, . . . , z
_{
i,K
}), i = 1, 2, . . . , m, this information can be mathematically represented as
if each z
_{
i,k
} is sampled from the j
_{
k
}th component (j
_{
k
} = 0, 1 or 2 and k = 1, 2, . . . , K) or zero otherwise.
With only the observed data, the likelihood can be calculated by the following formula:
where Θ represents the parameter space described previously. The "complete likelihood" based on the observed data and missing information can be calculated by the following formula:
Then, we can derive the following Estep formula:
We can also derive the following Mstep formulas:
In the EM algorithm, we iterate Estep and Mstep until a numerical convergence of likelihood (not the "complete likelihood"). Let L
^{(t) }and L
^{(t+1) }be the likelihood values calculated after the tth and (t + 1)th iterations, respectively. A numerical convergence is claimed if L
^{(t+1) }
− L
^{(t)}
 < 0.001.
Concordant enrichment score
Suppose that we are interested in gene sets with coordinate upregulated differential expression (the CES formulas for the other events of interest can be derived similarly). Then, we need to focus on the combination of different components with (
j
_{1} = 1
, j
_{2} = 1
, . . . , j
_{
K
} = 1). Based on the mixture model, we can derive the following probability for a gene
X
_{
S,i
} in a given gene set
S = {
X
_{
S,i
} :
i = 1, 2
, . . . , m
_{
S
}}:
This probability
u
_{
S,i
} can be estimated as
by pluggingin the estimated parameters in the PCD model. Let
h
_{
S,i
} be either 0 or 1. Under the assumption that
zscores {
z
_{
i,k
} :
i = 1, 2
, . . . , m} from different genes are independent in each data set
k, k = 1, 2
, . . . , K, we can calculate the concordant enrichment score (CES) for a gene set
S = {
X
_{
S,i
} :
i = 1, 2
, . . . , m
_{
S
}}:
which is the PCD model based estimate for the probability Pr(gene set S is concordantly enriched  observed zscore matrix of gene set S). In the formula, I(true statement) = 1 and I(false statement) = 0 (indicator function). Notice that the formula can be simplified to a wellknown binomial tail probability if all
are the same. However,
are usually different in practice. Then, we need to calculate a tail probability for a heterogeneous Bernoulli process.
For the calculation for gene sets with coordinate downregulated differential expression, we need to focus on the combination of different components with (
j
_{1} = 2
, j
_{2} = 2
, . . . , j
_{
K
} = 2). Then, we need to change the formulas for
u
_{
S,i
} and
CES
_{
S
} as follows:
False discovery rate
The concordant enrichment score given in Equation (2) is an estimated conditional probability of concordant enrichment, which can be considered as the true positive probability for the gene set
S. This conditional probability is closely related to the concept of false discovery rate (FDR). FDR has been widely used to evaluate the proportion of false positives among the claimed positives [
6,
23]. According to the discussion by McLachlan et al. [
14], among the
J top gene sets {
S
_{1}
, S
_{2}
, . . . , S
_{
J
}} claimed significantly concordantly enriched, the false discovery rate can be estimated as:
Computational approximation
Although we have derived the formula for concordant enrichment score (CES), it is usually difficult to compute it in practice: the number of possible component combinations from different genes in a given gene set is usually huge. Based on our observation, most gene sets contain more than 20 genes. Since different genes have different probabilities of being concordantly upregulated and/or downregulated differentially expressed, we cannot further simplify the formula (we need to calculate a tail probability for a heterogeneous Bernoulli process). However, we can consider a simulation based approach to the approximation of CES given in Equation (2).
Monte Carlo approximation
Recall that the probability of event of interest u
_{
S,i
} can be calculated for a gene X
_{
S,i
} in a given gene set S = {X
_{
S,i
}
, i = 1, 2, . . . , m
_{
S
}}. The simulation scheme is based on a heterogeneous Bernoulli process:

For each X
_{
S,i
}, simulate a Bernoulli random variable with probability of event u
_{
S,i
};

For the gene set S, count the number R of events from different genes;

Repeat the above two steps B times and report the approximated enrichment score as {number of
.
One related question is how large B should be set in the simulation. As we have discussed above, the concordant enrichment score (CES) is closely related to the false discovery rate (FDR). Then, it is reasonable to require its accuracy around the 1% level for the 95% CES level (e.g. a 95% normally approximated binomial confidence interval 0.95 ± 0.01) and B = 2000 is adequate. Therefore, the Monte Carlo approximation is a feasible approach in practice. (In general, if we do not have a specific CES level, we can simply use an upper bound B = 10000 calculated based on the 95% normally approximated binomial confidence interval. Then, the related computing burden is still practically feasible.)