Multiple data sets
In this study, we consider a detection of gene set enrichment in discordant behaviors (or discordance gene set enrichment) for a series of two-sample genome-wide expression data sets. The term “enrichment in discordant behaviors" will be mathematically defined later. Let K be the number of data sets and let m be the number of common genes among this series of data sets. Each data set is collected for two given groups (same for all K data sets). In general, one group represents a normal status and the other group represents an abnormal status.
For a single two-sample genome-wide expression data set, differential expression analysis and gene set enrichment analysis are usually conducted. The purpose of analysis of differential expression is to identify genes showing significantly up-regulation or down-regulation when two sample groups are compared. The purpose of gene set enrichment analysis is to identify pathways (or gene sets) showing coordinate up-regulation or down-regulation, which may be considered as an extension of differential expression analysis.
Therefore, the following gene behaviors are usually of our research interest in two-sample expression data analysis: positive change (or up-regulation), negative change (or down-regulation) and null (or non-differentially expressed). However, these underlying behaviors are usually not observed and expression data are collected to make statistical inference about them.
Data pre-processing is important for both microarray and RNA-seq data and it has been well discussed in the literature [24–26]. In our study, the data can be downloaded from a well-known public database. We assume that the gene expression profiles have been appropriately pre-processed. In an analysis of multiple expression data sets, it is usually necessary to focus on common genes and gene identifiers can be useful for this purpose. In our study, we divide a relatively large data set into a series of non-overlapping subsets. Therefore, all the genes in the downloaded data are common.
z-Score
Many statistical tests have been proposed for analyzing a two-sample genome-wide expression data set [27, 28]. In this study, the traditional paired-two-sample t-test is chosen for its simplicity (although other statistics could be certainly considered, see below). For each gene in each data set (or subset), we perform the t-test to obtain a t-score. Its p-value is evaluated based on the permutation procedure (randomly switch the tumor/non-tumor labels for each pair of tissues) so that the normal distribution assumption is not assumed for the paired-difference data. All the permuted t-scores are pooled together so that tiny p-values can be calculated [29].
One-sided upper-tailed p-values are calculated so that the direction of change can be distinguished for each gene in each data set. Let p
i,k
be the p-value for the i-th gene in the k-th data set. z-scores are obtained by an inverse normal transformation
$$z_{i,k} = \Phi^{-1}(p_{i,k}), $$
where Φ(·) is the cumulative distribution function (c.d.f.) of the standard normal distribution (mean zero and variance one). This transformation has been widely used [20] and our proposed multivariate normal mixture model will be applied to the transformed z-scores.
Discordance enrichment
Our proposed method is a type of gene set enrichment analysis. As it has been discussed by Lai et al. [15], we defined “enrichment” as “the number of events of interest is larger than expected” and our “event of interest” in this study is “a list of clearly discordant behaviors” from a gene. If we know whether the expression profile of a gene is up-regulated (simplified as “up”), down-regulated (simplified as “down”) or non-differentially expressed (simplified as “null”) in a data set, then a list of concordant behaviors among K data sets for this gene could be (up, up, …, up), (down, down, …, down) or (null, null, …, null). In this study, we focused on a list with at least one “up” and at least one “down” among K data sets. For example, a list like (down, up, up, …, up) is an event of interest but a list like (null, up, up, …, up) is not. The reason is “down” and “up” can be visually distinguished by the negative (“-") and positive (“+”) signs in z-scores, respectively. However, zero z-scores are rarely observed. Therefore, it is less clear to distinguish “null” from “up” (or “null” from “down”).
Based on the expression profiles, we obtain z-scores to make statistical inference about genes’ behaviors in each data set. To evaluate “discordance enrichment" as defined above, we considered a mixture model approach that allows us to estimate the probability of a behavior (“up”, “down” or “null”) and the expected number of events of interest (notice that these are not directly observed in the data sets). Let S be the set of genes for a pathway (or gene set in general) and m
S
the number of genes in S. If the i-th gene in S is showing a list of clearly discordant behaviors, then we set an indicator variable U
S,i
=1; Otherwise, we set U
S,i
=0. Then, we can calculate the discordance enrichment score (DES) for gene set S that is a probability defined as
$$DES_{S} = \mathbf{Pr}\left(\sum_{i=1}^{m_{S}} U_{S,i} > m_{S} \theta\right), $$
in which θ is the proportion of genes with clearly discordant behaviors.
In our mixture model, we used normal distributions to model the z-scores. A novel contribution is that the parameter space of our model increases linearly when the number of data sets is increased. This is due to the two-level structure of our model. (The parameter space of a general model for this analysis increases exponentially when the number of data sets is increased). For each gene in each data set, we considered three normal distribution components that represent up-regulation (positive distribution mean), down-regulation (negative distribution mean) and null (zero mean). (Theoretically, p-values under the null hypothesis are uniformly distributed. Therefore, z-scores under the null hypothesis are normally distributed with mean zero and variance one). The mathematical details are described below.
A two-level mixture model
First, we described the basic model structure for just one data set. Then, we introduced our novel two-level mixture model. A simple three-component normal distribution mixture model [30, 31] is considered for each z-score z
i,k
(the i-th gene in the k-th data set, i=1,2,…,m and k=1,2,…,K):
$$f(z_{i,k}) = \sum_{j_{k}=0}^{2} \rho_{j_{k},k} \phi_{\mu_{j_{k},k}, \sigma^{2}_{j_{k},k}}(z_{i,k}). $$
In the above model, \(\phantom {\dot {i}\!}\phi _{\mu, \sigma ^{2}}(\cdot)\) is the probability density function (p.d.f.) of a normal distribution with mean μ and variance σ
2. Three components represent up-regulation with μ
1,k
>0, down-regulation with μ
2,k
<0 and null with μ
0,k
=0 (also recall that \(\sigma ^{2}_{0,k}=1\)). For this model, an assumption is that the p.d.f. of z
i,k
is simply \(\phi _{\mu _{j_{k},k}, \sigma ^{2}_{j_{k},k}}(z_{i,k})\) if we know the underlying component information j
k
for the i-th gene in the k-th data set. However, the component information is usually not observed in practice. Then, we have this one-dimensional mixture model after the introduction of component proportion parameters \(\left \{ \rho _{j_{k},k}, j_{k}=0,1,2 \right \}\) for the k-th data set.
When we extend the above mixture model to a higher dimension (i.e. K data sets), without a structure consideration, the parameter space increases exponentially due to the 3K different component combinations (3 components in each of K data sets). Therefore, when K is not a small number (i.e. K>4), we need a more efficient model [15]. Biologically, when different data sets are collected for the same or similar research purpose, some genes are likely to show consistent behaviors across different data sets and some genes are likely to show different behaviors. For genes likely showing consistent behaviors across K data sets, we consider a complete concordance (CC) multivariate model to approximate the distribution of {z
i,k
,k=1,2,…,K}. For genes likely showing different behaviors across K data sets, we consider a complete independence (CI) multivariate model to approximate the distribution of {z
i,k
,k=1,2,…,K}. (Notice that there is no overlap among multiple data sets. If the component information among these data sets is known, then z-scores are independent.) We first describe the CI model and CC model as below.
The CI model assumes that the behaviors of the i-th gene are independent across different data sets. Therefore, we have the following mixture model:
$$f_{CI}(z_{i,1}, z_{i,2}, \ldots, z_{i,K}) = \prod_{k=1}^{K}\left[\sum_{j_{k}=0}^{2} \rho_{j_{k},k} \phi_{\mu_{j_{k},k},\sigma^{2}_{j_{k},k}}(z_{i,k})\right]. $$
This model is simply a product of K one-dimensional three-component mixture-models.
The CC model assumes that the behaviors of the i-th gene are the same across different data sets. Although the component information is unknown, the components for different data sets must be consistent. Therefore, we have the following mixture model:
$$f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})=\sum_{j=0}^{2}\left[\pi_{j} \prod_{k=1}^{K}\phi_{\mu_{j,k},\sigma_{j,k}^{2}}(z_{i,k})\right]. $$
This model has three components and each component is a product of K normal probability density functions.
In practice, it is unknown whether the i-th gene is showing independent or consistent behaviors. Therefore, we consider CI and CC as two high-level components and propose the following two-level model for {z
i,k
,k=1,2,…,K}:
$$\begin{aligned} f(z_{i,1}, z_{i,2}, \ldots, z_{i,K}) &= \lambda f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K}) \\ & \quad+ (1-\lambda) f_{CI}(z_{i,1}, z_{i,2}, \ldots, z_{i,K}). \end{aligned} $$
Notice that this two-level model is still a mixture model. We further assume that \(\{ \mu _{j_{k},k}, \sigma _{j_{k},k}^{2}, j_{k}=0,1,2, k=1,2,\ldots,K \}\) are shared by both CI and CC models. It is evident that the model parameter space increases linearly when the number of data sets (K) increases.
We can use the well-established Expectation-Maximization (E-M) algorithm [19] for parameter estimation. First, it is necessary to introduce some indicator variables (for component information) for the z-scores {z
i,k
,k=1,2,…,K} of the i-th gene. Then, we describe the E-step and M-step.
For high-level component information,
$${}\begin{aligned} \omega_{i}\,=\,\left\{\!\ \begin{array}{ll} 1 &\text{if gene's behaviors are consistent with CC model;}\\ 0 &\text{if gene's behaviors are consistent with CI model.} \end{array} \right. \end{aligned} $$
For CI model component information,
$${}\begin{aligned} \eta_{i,j_{k},k}=\left\{ \begin{array}{ll} 1 &\text{if \(z_{i,k}\) is sampled from the \(j_{k}\)-th component;}\\ 0 &\text{otherwise.} \end{array} \right. \end{aligned} $$
For CC model component information,
$${}{\begin{aligned} \xi_{i,j}=\left\{ \begin{array}{ll} 1 &\text{if all}\, \{ z_{i,k}, k=1,2,\ldots,K \}\, \text{are sampled from the}\, j\text{-th component;}\\ 0 &\text{otherwise.} \end{array} \right. \end{aligned}} $$
The E-step is the calculation of the following expected values when all the parameter values are given.
$${}{\begin{aligned} {\mathrm{E}}(\omega_{i})& = &\frac{\lambda f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})}{\lambda f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K}) + (1-\lambda) f_{CI}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})}, \end{aligned}} $$
$${}{\begin{aligned} {\mathrm{E}}((1&-\omega_{i}) \eta_{i,j_{k},k})\\ & = \frac{(1-\lambda)\rho_{j_{k},k}\phi_{\mu_{j_{k},k},\sigma_{j_{k},k}^{2}}(z_{i,k}) \prod_{h=1,h\neq k}^{K}\sum_{j_{h}=0}^{2}\rho_{j_{h},h}\phi_{\mu_{j_{h},h},\sigma_{j_{h},h}^{2}}(z_{i,h})}{\lambda f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K}) + (1-\lambda) f_{CI}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})}. \end{aligned}} $$
$${}\begin{aligned} {\mathrm{E}}&(\omega_{i} \xi_{i,j})\\ &= \frac{\lambda \pi_{j} \prod_{k=1}^{K} \phi_{\mu_{j,k},\sigma_{j,k}^{2}}(z_{i,k})}{\lambda f_{CC}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})\! +\! (1-\lambda) f_{CI}(z_{i,1}, z_{i,2}, \ldots, z_{i,K})}, \end{aligned} $$
The M-step is the calculation of the following parameter values when all the component information is given:
$$\begin{array}{@{}rcl@{}} \hat{\lambda}&=&\frac{1}{m}\sum_{i=1}^{m}{\mathrm{E}}(\omega_{i}), \end{array} $$
$$\begin{array}{@{}rcl@{}} \hat{\rho}_{j_{k},k}&=&\frac{\sum_{i=1}^{m}{\mathrm{E}}\left((1-\omega_{i}) \eta_{i,j_{k},k}\right)}{\sum_{i=1}^{m}\sum_{j_{h}=0}^{2}{\mathrm{E}} \left((1-\omega_{i}) \eta_{i,j_{h},k}\right)}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \hat{\pi}_{j}=\frac{\sum_{i=1}^{m}{\mathrm{E}}(\omega_{i} \xi_{i,j})}{\sum_{i=1}^{m}\sum_{h=0}^{2}{\mathrm{E}}(\omega_{i} \xi_{i,h})}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \hat{\mu}_{j_{k},k}&=&\frac{\sum_{i=1}^{m}\left[{\mathrm{E}}(\omega_{i} \xi_{i,j_{k}})+{\mathrm{E}}((1-\omega_{i}) \eta_{i,j_{k},k})\right] z_{i,k}}{\sum_{i=1}^{m}[{\mathrm{E}}(\omega_{i} \xi_{i,j_{k}})+{\mathrm{E}}((1-\omega_{i}) \eta_{i,j_{k},k})]}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \hat{\sigma}^{2}_{j_{k},k}\,=\,\frac{\sum_{i=1}^{m}[{\mathrm{E}}(\omega_{i} \xi_{i,j_{k}})+{\mathrm{E}}((1-\omega_{i}) \eta_{i,j_{k},k})]] (z_{i,k} - \hat{\mu}_{j_{k},k})^{2}}{\sum_{i=1}^{m}[{\mathrm{E}}(\omega_{i} \xi_{i,j_{k}})+{\mathrm{E}}((1-\omega_{i}) \eta_{i,j_{k},k})]}\!. \end{array} $$
E-step and M-step are iterated until a numerical convergence is achieved. In this study, the numerical convergence is defined as that the difference between the current log-likelihood and the previous one is within a given tolerance value (e.g. 10−4).
Enrichment score
As we have discussed in Discordance enrichment, in this study, we focus on genes’ behaviors with at least one up-regulation and at least one down-regulation among K data sets (our event of interest: a gene with clearly discordant behaviors). However, we do not need to enumerate all these combinations (among 3K in total). The related computing can be simplified if we enumerate the compliment events instead. There are three combinations for complete concordance: (up, up,..., up), (down, down,..., down) and (null, null,..., null). They will be excluded. There are \(\sum _{l=1}^{K-1}{K \choose l}\) combinations with both “null" and “up" (without “down") and there are \(\sum _{l=1}^{K-1}{K \choose l}\) combinations with both “null" and “down" (without “up"). They will also be excluded. Then, the remaining combinations are our events of interest (at least one “up" and at least one “down").
According to the above computing strategy, based on the two-level mixture model, the related proportion (θ) of genes with clearly discordant behaviors (also see “Discordance enrichment” for more details) can be calculated as follows.
$${}\begin{aligned} \theta\! =\! (1\,-\,\lambda)\left(\!1\! -\!\! \sum_{j=0}^{2} \prod_{k=1}^{K} \rho_{j,k} \,-\,\!\! \sum_{\{j_{k}\} \in A} \prod_{k=1}^{K} \rho_{j_{k},k} -\!\! \sum_{\{j_{k}\} \in B} \prod_{k=1}^{K} \rho_{j_{k},k}\! \right)\!, \end{aligned} $$
where A is the set of lists with a mix of 0’s and 2’s, and B is the set of lists with a mix of 0’s and 1’s.
Let S be a gene set with m
S
genes. As defined in Discordance enrichment, let the indicator variable U
S,i
=1 if the i-th gene in S is showing a list of clearly discordant behaviors, and U
S,i
=0 otherwise. Then, based on the two-level mixture model, the related probability can be calculated as follows.
$${}\begin{aligned} \mathbf{Pr}(U_{S,i}=1) &= (1-\lambda)[ f_{CI}(z_{S,i,1}, z_{S,i,2}, \ldots, z_{S,i,K}) \\ & \quad- \sum_{j=0}^{2} \prod_{k=1}^{K} \rho_{j,k} \phi_{\mu_{j,k},\sigma^{2}_{j,k}}(z_{S,i,k}) \\ & \quad- \sum_{\{j_{k}\} \in A \cup B} \prod_{k=1}^{K} \rho_{j_{k},k} \phi_{\mu_{j_{k},k},\sigma^{2}_{j_{k},k}}(z_{S,i,k}) ] \\ & \quad / f(z_{S,i,1}, z_{S,i,2}, \ldots, z_{S,i,K}), \end{aligned} $$
where (z
S,i,1,z
S,i,2,…,z
S,i,k
) are the related z-scores. Let ζ
S,i
=P
r(U
S,i
=1), which is a conditional probability according to the given model and observed data.
Under the assumption that z-scores from different genes are independent, the discordance enrichment score (DES) for gene set S, which has been defined in Discordance enrichment as \(DES_{S} = \mathbf {Pr}\left (\sum _{i=1}^{m_{S}} U_{S,i} > m_{S} \theta ]\right)\), can be calculated as follows.
$$\begin{aligned} DES_{S} &= \sum_{U_{S,1}=0}^{1} \sum_{U_{S,2}=0}^{1} \cdots \sum_{U_{S,m_{S}}=0}^{1} \left[I\left(\sum_{i=1}^{m_{S}} U_{S,i}\right.\right.\\ & \left.\left. \quad > m_{S} \theta{\vphantom{\sum_{0}^{0}}}\right) \prod_{i=1}^{m_{S}} \zeta_{S,i}^{U_{S,i}} (1-\zeta_{S,i})^{1-U_{S,i}} \right], \end{aligned} $$
where I(true statement)=1 and I(false statement)=0 (indicator function). Since {ζ
S,i
,i=1,2,…,m
S
} are usually different for different genes, the above formula is a calculation of a tail probability for a heterogeneous Bernoulli process. The related computing issue and the related false discovery rate have already been discussed by Lai et al. [15]. Therefore, we described them briefly as below.
False discovery rate
As discussed in the literature [15, 20], the above enrichment score is a conditional probability and a true positive proportion for gene set S. Therefore, the related false discovery rate [6, 18] for the top T gene sets {S
1,S
2,…,S
T
} identified by the above DES can be conveniently derived as below.
$$FDR = 1 - \sum_{t=1}^{T} DES_{S_{t}}/T. $$
Computational approximation
As discussed in Lai et al. [15], the exact calculation of DES can be difficult due to the complexity of heterogeneous Bernoulli process. A Monte Carlo approximation has been suggested as follows. First, set an integer variable X=0. For the i-th gene in S, simulate a Bernoulli random variable with probability of event ζ
S,i
. Then, count the number of events from all genes in S, and increase X by one if this number is larger than m
S
θ. Repeat the simulation and counting B times and report X/B as the approximated DES. B=2000 was suggested by Lai et al. [15].