 Proceedings
 Open Access
 Published:
Concordant integrative gene set enrichment analysis of multiple largescale twosample expression data sets
BMC Genomics volume 15, Article number: S6 (2014)
Abstract
Background
Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment.
Methods
We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a threecomponent multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets.
Results
We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method.
Conclusions
This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple largescale twosample gene expression data sets.
Background
The recent largescale technologies like microarrays [1–3] and RNAseq [4, 5] allow us to collect genomewide expression profiles for biomedical studies. Genes showing significant differential expression are potentially important biomarkers [6]. Furthermore, a gene set enrichment analysis enables us to identify groups of genes (e.g. pathways) showing coordinate differential expression [7, 8]. For some disease studies, multiple gene expression data sets have been collected and the related integrative analysis of multiple data sets has been investigated [9]. Since microarray and sequencing based genomewide expression data sets have been increasingly collected, it is necessary to further develop the computational and statistical methods for integrative data analysis studies.
Genes and gene sets showing consistent behavior among multiple related studies can be of great biological interest. However, since the sample sizes are usually small but the numbers of genes are large, it is difficult to identify truly differentially expressed genes and determine whether a gene or a gene set behaves concordantly among different related studies. Although the integrative analysis of multiple gene expression data sets has been well studied in recent years [10, 11], the genomewide concordance has not been well considered. Misleading results may be generated if the concordance among different data sets is not considered in an integrative analysis. Our purpose is to identify pathways or gene sets with concordant enrichment. Recently, there are several methods published for meta gene set enrichment analysis of expression data [12, 13]. However, these methods have not been specifically developed for our study purpose. Statistically, we need analysis methods that are consistent with the study purpose. There is still a lack of methods and software for the concordant integrative gene set enrichment analysis.
For a gene set enrichment analysis, an enriched gene set in one data set may also be enriched in another data set. However, this gene set is not necessarily concordantly enriched in both data sets. For an illustration, let us consider a simple artificial example: gene set S contains five genes with the first three genes strongly upregulated in the first data set (the last two genes nondifferentially expressed) and the last three genes strongly upregulated in the second data set (the first two genes nondifferentially expressed). Then, in general, gene set S is enriched in upregulated differential expression in both data sets. However, there is only one gene upregulated in both data sets; the remaining genes are showing inconsistent behavior. Therefore, unless the proportions of differentially expressed genes are small, there is a lack of evidence to conclude that gene set S is concordantly enriched in both data sets. Since a gene set concordantly enriched in several similar studies may be of great importance, it is necessary to develop statistical methods for detecting these gene sets.
It has been shown that a mixture model based approach can be an efficient approach to the differential expression analysis [14]. Furthermore, we have also demonstrated the usefulness of mixture models in concordant analysis of differential expression among largescale expression data sets [15, 16]. The advantage of the mixture model based approach is that the probability of a particular behavior (upregulated or downregulated) can be modeled and estimated for a given gene. Thus, it is feasible to address how likely this gene shows a concordant behavior. In this study, we develop a mixture model based method for a concordant integrative gene set enrichment analysis.
Methods
Concordant gene set enrichment
In this study, we consider multiple largescale twosample gene expression data sets. We use K to denote the number of these data sets and m to denote the number of common genes in these data sets. For each of these data sets, we usually use a ttype test to evaluate the differential expression of each gene and a gene set enrichment analysis (GSEA) method to evaluate the enrichment level of a given gene set. In order to define and evaluate a concordant gene set enrichment when an integrative analysis is conducted for all K data sets, we categorize differential expression in each data set into three underlying (unobserved) representative categories: no change, positive change (or upregulated differential expression) and negative change (or downregulated differential expression). Due to data noise, what we observe from experiments may not indicate the underlying truth. (For example, a gene with slight downregulated differential expression may show a small positive ttype test value.) Although these categories are not observed in practice, they can be considered in a mixture model framework.
To understand the concept of concordant gene set enrichment, let us consider an artificial example. Given a pathway with 30 genes, we know all the underlying behavior of these genes: 20 genes have positive changes consistently among all different data sets. Furthermore, if we randomly select 30 genes, we also know that the expected number of genes with consistent positive changes among different data sets is just 5. In this case, we would conclude that the given gene set is concordantly enriched in upregulated differential expression (because 30 is clearly larger than 5). However, in practice, all the underlying differential expression categories are not observed. Instead, they can be considered in a mixture model framework. Then, we need to develop a mathematical formula for the probability of concordant enrichment score (CES) of a given gene set S that contains m_{ S } genes:
which can be useful for prioritizing different gene sets in practice.
Before we derive the mathematical formula for the above probability, we need to explain the term "enriched". As suggested by Efron and Tibshirani [17], unless the test statistic for a gene set enrichment analysis (GSEA) considers the genomewide background patterns (e.g. the statistics proposed in the original GSEA [7, 8]), it is necessary to consider the "row randomization" for genes in addition to the "column permutation" for samples. Therefore, the term "enriched" means "higher/better than expected".
Although many test statistics have been developed for GSEA with one largescale expression data set, we still need to develop a new approach for this study. The motivation is: we need to address the component information of the genes in a gene set. The component information is whether a gene is upregulated, downregulated or nondifferentially expressed. Most existing test statistics for the gene set enrichment analysis are either nonparametric or functions of zscore. But it is difficult to analyze the component information with these test statistics. Therefore, based on the above discussion for the term "enriched", we propose the following probability for measuring concordant gene set enrichment:
For a gene in a given gene set S, an event of interest can be: (1) the gene is concordantly upregulated; (2) the gene is concordantly downregulated; or (3) the gene is concordantly differentially expressed (either upregulated or downregulated). Our analysis methods for these different types of enrichment analysis are almost mathematically identical. For a mathematical notation of the above CES, we denote U_{ i } the indicator that the ith gene in gene set S satisfies the event of interest. Let D be the observed data and π be the probability of event of interest if the gene is randomly sampled. Then, we have
In order to calculate CES practically, we propose a threecomponent multivariate mixture model. In the model, each component is a normal distribution. The model configuration for these three components is consistent with the differential expression categories as described above. This model is conceptually analog to a simple normal mixture approach to differential expression analysis proposed by McLachlan et al. [14]. The special feature of our model is that we focus on some specific combination of components from different dimensions. A bivariate version of this model has been used by us to evaluate the concordance and discordance between two largescale experiments with two sample groups [15] and to integrate two microarray data sets in differential expression analysis [16]. Before the model description, we need to describe the related data preprocessing and differential expression test scores as follows.
Data preprocessing
Because our proposed statistical method is developed based on the differential expression test scores, we assume that the given gene expression data sets have been preprocessed appropriately [18]. For a concordant integrative analysis of multiple data sets, we also need to select genes shared commonly by different data sets. This can be achieved using the genes' unique identifiers.
Differential expression test scores
For each of the twosample gene expression data sets, we screen individual genes with the traditional twosample Student's ttest. Several modified ttests, such as SAM ttest [19] and the moderated ttest [20], have been widely used in the differential expression analysis of microarray data. These test statistics can generally improve the control of false positives by "softly" filtering out genes with relatively small expression variance. However, we intend to consider all the genes equally important in the concordant integrative analysis of multiple data sets. Furthermore, a given gene can show different levels of variance in different data sets, which may make it difficult to use these modified ttests. Therefore, we still recommend the traditional twosample ttest as the differential expression test statistic. (In practice, other test statistics like SAM ttest or the moderated ttest can still be considered when there is a strong reason to do so.) Because the sample size of a highthroughput study is usually not large, it is generally difficult to validate the normal distribution assumptions for the ttest. Therefore, instead of the theoretical tdistribution, we use the permutation procedure to compute the pvalue of an observed ttest [21]. This approach has been widely adopted in the analysis of gene expression data [6].
For K twosample gene expression data sets with m common genes, we compute the onesided uppertailed pvalue p_{ i,k } for gene X_{ i } in the kth data set, i = 1, 2, . . . , m and k = 1, 2, . . . , K. Then, we perform an inverse normal transformation to obtain a zscore: z_{ i,k } = Φ^{1}(1  p_{ i,k }), where Φ(·) is the cumulative distribution function (c.d.f.) of the standard normal distribution. This transformation has been widely used to improve the fitting of a mixture model [14]. Our proposed statistical methods for the concordant integrative analyses of multiple data sets are developed based on these sets of zscores.
A mixture model
For each individual data set, we assume that a mixture of three normal distributions can well fit the zscores. Let ${\varphi}_{\mu ,{\sigma}^{2}}$ denote the probability density function (p.d.f.) of a normal distribution with mean µ and variance σ^{2}. Three representative components are considered for the kth data set (k = 1, 2, . . . , K): ${\varphi}_{{\mu}_{0},k,{\sigma}_{0,k}^{2}}(\cdot )$ for genes nondifferentially expressed (no change), ${\varphi}_{{\mu}_{1},k,{\sigma}_{1,\phantom{\rule{2.77695pt}{0ex}}k}^{2}}\left(\cdot \right)$ for genes with upregulated differential expression (positive change) and ${\varphi}_{{\mu}_{2},k,{\sigma}_{2,k}^{2}}\left(\cdot \right)$ for genes with downregulated differential expression (negative change). Notice that µ_{0,k}= 0 and ${\sigma}_{0,k}^{2}=1$ (a zscore under the null hypothesis follows the standard normal distribution because its associated pvalue follows a standard uniform distribution). This configuration has been suggested in the analysis of gene expression data [14] although more components can be considered to improve the data fitting. Mathematically, we have the following density function:
which is a type of wellknown simple normal mixture model.
When the above simple model is extended to accommodate the analysis of multiple data sets, we need to consider the combination of components from different dimensions (data sets). Then, there are 3^{K} different combinations. We assume that different data sets are collected independently. For the ith gene with a list of zscores ${\left\{{z}_{i,k}\right\}}_{k=1}^{K}$ from different data sets, if we know all the related component information, then the join density of these zscores is the product of marginal densities of individual zscores. Therefore, the following formula defines our basic mixture model for a concordant analysis:
where ${\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}$ is the probability for this gene being in a particular combination of different components (j_{1}, j_{2}, . . . , j_{ K }) in different data sets $\left({\sum}_{{j}_{1},{j}_{2},...{j}_{K}=0}^{2}{\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}=1\right)$. We call this model a partial concordance/discordance (PCD) model. Notice that a bivariate version of this model has been used to evaluate the overall concordance or discordance of two largescale data sets and to conduct an integrative analysis of differential expression for two largescale twosample data sets [15, 16].
Model estimation
Our mixture model can be estimated by the welldeveloped EM algorithm [22]. In the model, the differential expression categories are considered as missing information. For any zscore vector (z_{i,1}, z_{i,2}, . . . , z_{ i,K }), i = 1, 2, . . . , m, this information can be mathematically represented as ${w}_{{j}_{1},{j}_{2},...,{j}_{K}}^{\left(i\right)}=1$ if each z_{ i,k } is sampled from the j_{ k }th component (j_{ k } = 0, 1 or 2 and k = 1, 2, . . . , K) or zero otherwise.
With only the observed data, the likelihood can be calculated by the following formula:
where Θ represents the parameter space described previously. The "complete likelihood" based on the observed data and missing information can be calculated by the following formula:
Then, we can derive the following Estep formula:
We can also derive the following Mstep formulas:
In the EM algorithm, we iterate Estep and Mstep until a numerical convergence of likelihood (not the "complete likelihood"). Let L^{(t) }and L^{(t+1) }be the likelihood values calculated after the tth and (t + 1)th iterations, respectively. A numerical convergence is claimed if L^{(t+1) }− L^{(t)} < 0.001.
Concordant enrichment score
Suppose that we are interested in gene sets with coordinate upregulated differential expression (the CES formulas for the other events of interest can be derived similarly). Then, we need to focus on the combination of different components with (j_{1} = 1, j_{2} = 1, . . . , j_{ K } = 1). Based on the mixture model, we can derive the following probability for a gene X_{ S,i } in a given gene set S = {X_{ S,i } : i = 1, 2, . . . , m_{ S }}:
This probability u_{ S,i } can be estimated as ${\mathit{\xfb}}_{S,i}$ by pluggingin the estimated parameters in the PCD model. Let h_{ S,i } be either 0 or 1. Under the assumption that zscores {z_{ i,k } : i = 1, 2, . . . , m} from different genes are independent in each data set k, k = 1, 2, . . . , K, we can calculate the concordant enrichment score (CES) for a gene set S = {X_{ S,i } : i = 1, 2, . . . , m_{ S }}:
which is the PCD model based estimate for the probability Pr(gene set S is concordantly enriched  observed zscore matrix of gene set S). In the formula, I(true statement) = 1 and I(false statement) = 0 (indicator function). Notice that the formula can be simplified to a wellknown binomial tail probability if all ${\left\{{u}_{S,i}\right\}}_{i=1}^{{m}_{S}}$ are the same. However, ${\left\{{u}_{S,i}\right\}}_{i=1}^{{m}_{S}}$ are usually different in practice. Then, we need to calculate a tail probability for a heterogeneous Bernoulli process.
For the calculation for gene sets with coordinate downregulated differential expression, we need to focus on the combination of different components with (j_{1} = 2, j_{2} = 2, . . . , j_{ K } = 2). Then, we need to change the formulas for u_{ S,i } and CES_{ S } as follows:
False discovery rate
The concordant enrichment score given in Equation (2) is an estimated conditional probability of concordant enrichment, which can be considered as the true positive probability for the gene set S. This conditional probability is closely related to the concept of false discovery rate (FDR). FDR has been widely used to evaluate the proportion of false positives among the claimed positives [6, 23]. According to the discussion by McLachlan et al. [14], among the J top gene sets {S_{1}, S_{2}, . . . , S_{ J }} claimed significantly concordantly enriched, the false discovery rate can be estimated as:
Computational approximation
Although we have derived the formula for concordant enrichment score (CES), it is usually difficult to compute it in practice: the number of possible component combinations from different genes in a given gene set is usually huge. Based on our observation, most gene sets contain more than 20 genes. Since different genes have different probabilities of being concordantly upregulated and/or downregulated differentially expressed, we cannot further simplify the formula (we need to calculate a tail probability for a heterogeneous Bernoulli process). However, we can consider a simulation based approach to the approximation of CES given in Equation (2).
Monte Carlo approximation
Recall that the probability of event of interest u_{ S,i } can be calculated for a gene X_{ S,i } in a given gene set S = {X_{ S,i }, i = 1, 2, . . . , m_{ S }}. The simulation scheme is based on a heterogeneous Bernoulli process:

For each X_{ S,i }, simulate a Bernoulli random variable with probability of event u_{ S,i };

For the gene set S, count the number R of events from different genes;

Repeat the above two steps B times and report the approximated enrichment score as {number of $\left(R>{m}_{S}{\widehat{\pi}}_{1,1,...,1}\right)\}/B$.
One related question is how large B should be set in the simulation. As we have discussed above, the concordant enrichment score (CES) is closely related to the false discovery rate (FDR). Then, it is reasonable to require its accuracy around the 1% level for the 95% CES level (e.g. a 95% normally approximated binomial confidence interval 0.95 ± 0.01) and B = 2000 is adequate. Therefore, the Monte Carlo approximation is a feasible approach in practice. (In general, if we do not have a specific CES level, we can simply use an upper bound B = 10000 calculated based on the 95% normally approximated binomial confidence interval. Then, the related computing burden is still practically feasible.)
Results and discussion
Application #1: an integrative analysis of two data sets
To illustrate our method, we first considered two microarray gene expression data sets collected for lung cancer studies [24, 25]. The first one was collected by a research group in Boston (referred to as Boston data) and the second one was collected by a research group in Michigan (referred to as Michigan data). For an application of their Gene Set Enrichment Analysis (GSEA) method, Subramanian, Tamayo et al. [8] reorganized these two data sets, which were made freely available at http://www.broadinstitute.org/gsea. There were 62 and 86 patients for the Boston and Michigan data sets, respectively. These patients were classified as either "good" or "poor" outcomes. Expression profiles were available for 5216 genes that were common for both data sets. To compare our analysis results with the results reported by Subramanian, Tamayo et al. [8], we used an early version of gene set collection that was used by them [8]. Subramanian, Tamayo et al. [8] also suggested a moderate range of 15500 genes for the sizes of gene sets that were analyzed in their gene set analysis. A gene set was not analyzed if its number of genes was out of this range. This range was used in our analysis. To demonstrate the advantage of their GSEA, Subramanian, Tamayo et al. [8] observed several commonly significantly enriched gene sets from the analysis of each data set although no individual genes with significantly differential expression were identified.
Since no concordant integrative analysis has been conducted before for these two data sets, it is necessary to investigate whether more significant results can be achieved by such an analysis. Lai et al. [15] and Lai et al. [16] have discussed that it is necessary to evaluate the genomewide concordance before an integrative analysis to be conducted. Based on a likelihood ratio test [15, 16], we obtained pvalues < 0.01 and > 0.3 for testing hypothesis complete discordance (CD) model vs. partial concordance/discordance (PCD) model and complete concordance (CC) model vs. PCD model, respectively. This result suggested that the expression profiles of both data sets were overall concordant at a genomewide level. To avoid any possible selection bias, we still conducted our integrative analysis based on the general PCD model. (When the simplified CC model was used, we still observed similar results [not shown].) As shown in Table 1, the gene sets identified by Subramanian, Tamayo et al. [8] were also identified by our method. Furthermore, the resulting false discovery rates (FDRs) were even more significant (all below 0.08 and most of them below 0.001) by our method. [The complete gene sets (328 gene sets) with the false discovery rates (FDRs) based on our concordant integrative gene set enrichment analysis have been included in our supplementary material.]
Figure 1 gives some graphical illustrative examples for our concordant integrative gene set enrichment analysis. Proteasome degradation is a wellknown pathway in cancer studies [26]. Furthermore, proteasome inhibitors are being used clinically in lung cancer treatments [27]. Yang et al. [28] has also demonstrated that proteasome regulates the key survival factors for cells. However, this gene set had not been identified in the study by Subramanian, Tamayo et al. [8]. As shown in Figure 1, the majority of zscores from both data sets were positive for the gene set proteasome pathway. Because most of these zscores were relatively close to zero, it was difficult to identify this pathway by an analysis based on individual data sets. However, the concordant enrichment of upregulation of this gene set was identified by our integrative analysis approach (CES > 0.999 and FDR < 0.001 for the upregulation enrichment). The B cell receptor (BCR) signaling pathway has been shown to be important in immune disease and cancer studies [29]. As shown in Figure 1, the majority of zscores from both data sets were negative for this gene set although these zscores were also relatively close to zero. For the downregulation enrichment, the CES and FDR for this gene set was > 0.9 and ~ 0.05, respectively. For a comparison, we also randomly selected 30 genes as a random gene set. As shown in Figure 1, the paired zscore pattern of this random gene set was consistent with the genomewide paired zscore distribution. Therefore, this random gene set was not significantly concordantly enriched. (The corresponding upregulation and downregulation based CES were both in the range of 0.4 ~ 0.6.)
Mootha, Lindgren, et al. [7] have made their gene set collections freely available at their web site for Molecular Signatures Database. Since the introduction of gene set enrichment analysis [7, 8], the collections of gene sets have been updated to version 3.0 (at the time of our study). Therefore, based on the Boston and Michigan data, we have performed a concordant integrative analysis for this updated version of the C2 canonical pathway collection. Among 880 gene sets in the collection, there are 700 gene sets with gene number range from 15 to 500. We have also compared our results with the results calculated separately for individual data sets based on the gene set analysis (GSA) method proposed by Efron and Tibshirani [17]. Although certain statistical assumptions are required for the GSA method, Maciejewski [30] have still suggested in a recent comparison study that this method is one of the preferred methods for a gene set enrichment analysis. Figure 2 shows that the false discovery rate (FDR) curve based on our method is clearly lower than these two FDR curves based on GSA (one for Boston data and the other for Michigan data). There are 224 and 15 pathways with significant CES (FDR < 0.05) for upregulated and downregulated differential expression, respectively. These results have been included in our supplementary material.
Application #2: an integrative analysis of three data sets
In addition to the Boston and Michigan data sets, Subramanian, Tamayo et al. [8] also mentioned and reorganized another related data set collected by a Stanford study [31]. We then considered these three data sets together for a concordant integrative gene set enrichment analysis. The number of patients in the Stanford data set was much less: 24 patients were classified as either "good" or "poor" outcomes. For these three data sets, there were 2865 common genes (almost 50% reduction from the first application described above). We still used the Version 3.0 of the C2 canonical pathway collection. The GSA method was again used to analyze individual data sets separately for 700 gene sets (see above for details). Figure 3 shows that the FDR curve based on our method is still clearly lower than these three FDR curves based on GSA (one curve for each data set). There are 99 and 74 pathways with significant CES (FDR < 0.05) for upregulated and downregulated differential expression, respectively. These results have also been included in our supplementary material.
Among the gene sets with FDR < 0.05, we observed many interesting pathways. Among these 74 identified based on downregulated differential expression, there were pathways related to immune system, TCR signaling, viral myocarditis, BCR signaling, cell survival, WNTβcatenin signaling, cytokine, PI3K, VEGF signaling, interleukins and GPCR signaling. Among these 99 identified based on upregulated differential expression, there were pathways related to different metabolism, cell cycle, checkpoints, and related phases and transitions, DNA replication, synthesis damage and repair, p53, glycolysis gluconeogenesis, telomere maintenance and extension, apoptosis, TGFβ signaling, tRNA aminoacylation, gene expression, lung cancer and PDGF signaling.
Consistency between two application results
We also investigated whether the inclusion of an additional data set to our previous integrative analysis of two data sets would still generate consistent results. (Notice that the number of common genes was much reduced from 5216 to 2865 when the Stanford data set was included. This would change the number of selected pathways as shown above.) Figure 4 shows the scatter plot for the paired CES calculated based on two data sets and CES calculated based on three data sets (separately for upregulated and downregulated differential expression). For each plot, a clear correlation pattern can be observed. The Spearman correlation coefficients were both greater than 0.75 for these two plots (0.804 and 0.760). We also compared the listed of selected pathways with FDR < 0.05 (see above for details). For upregulated differential expression, there were 92 pathways in common (among 224 selected based on two data sets and 99 selected based on three data sets); for downregulated differential expression, there were 11 pathways in common (among 15 selected based on two data sets and 74 selected based on three data sets). If [(the number of commonly selected pathways)/(the number of smallest list of selected pathways)] was used as the overlap proportion, then we would have 92/99 = 92.9% and 11/15 = 73.3% as the overlap proportions for upregulated and downregulated differential expression, respectively. Therefore, a satisfactory consistency between both results was also observed.
About two pathways mentioned particularly in our first application, there were two proteasome pathways in the Version 3.0 of the C2 canonical pathway collection: one given by BioCarta and the other given by KEGG. For both pathways, their CESs and FDRs for upregulated differential expression were consistently respectively > 0.999 and < 0.001 based on our integrative analysis of two data sets, and these values were also consistently respectively > 0.95 and < 0.005 based on our integrative analysis of three data sets. There were also two BCR signaling pathways collected by KEGG and Signaling Gateway, their CESs and FDRs for downregulated differential expression were consistently respectively > 0.95 and < 0.01 based on our integrative analysis of three data sets. Based on our integrative analysis of two data sets, the CES and FDR for the pathway by KEGG were respectively > 0.7 and < 0.2 and these two values for the pathway by Signaling Gateway were respectively > 0.9 and ~ 0.05. Figure 5 shows different paired zscores from three data sets and the zscores for these two pathways are highlighted for an illustration.
KEGG cancer pathways
There is a collection of cancer pathways in the database of Kyoto Encyclopedia of Genes and Genomes (KEGG with web link http://www.genome.jp/kegg/). According to the database updated on July 24, 2013, 17 pathways are associated with lung cancer and general cancer studies. Table 2 lists 16 of these pathways that are also in the Version 3.0 of the C2 canonical pathway collection. (The KEGG PI3KAKT signaling pathway is not included since it is not listed in the C2 collection. Notice that only pathways from KEGG are included. Pathways with same or similar names from other online databases like Reactome are not considered here. This ensures the consistency between the gene sets from the C2 collection and the gene sets mentioned in the KEGG cancer pathways.) Since a pathway could be enriched in either upregulated or downregulated differential expression, we would choose the one with larger CES if the absolute difference of two CESs was greater than 0.1 (same results observed when this threshold value was set between 0.05 to 0.15), which was a conservative choice of threshold value. Otherwise, we would not present any further analysis results for this pathway. For examples, if these two CESs were 0.5 (upregulated) and 0.45 (downregulated), then no further analysis results would be presented for this pathway; if these two CESs were 0.8 (upregulated) and 0.1 (downregulated), then the analysis results based on upregulated differential expression would be presented. For these 16 pathways listed in Table 2, the results from the analysis described in our first and second applications were consistent. All the pathways except the TGFβ signaling pathway showed FDRs < 0.2 for at least one applications. Ten and eight pathways showed FDRs < 0.1 and FDRs < 0.05 respectively for at least one applications. Furthermore, all sixteen pathways showed FDRs < 0.25 for at least one applications.
Conclusions
In this study, we proposed a mixture model based statistical method for the concordant integrative gene set enrichment analysis. Our method was first applied to two published lung cancer microarray gene expression data sets. As shown in Figure 1, gene sets like the proteasome and BCR signaling pathways were identified by our method. These gene sets were not identified in the previous study [8] since the differential gene expression among these gene sets were relatively weak. However, the concordant enrichment of these gene sets was detected by our method. This comparison illustrated the advantage of our proposed concordant integrative gene set enrichment analysis. The analysis results from our second application (a concordant integrative analysis of three data sets) also showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection demonstrated the practical usefulness of our proposed method. Overall, this study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple largescale twosample gene expression data sets.
There are several advantages for our proposed method. The genomewide concordance can be statistically tested before the integrative analysis. The mixture model is estimated based on the maximum likelihood estimation procedure. Furthermore, our integrative analysis of gene sets is based on a probabilistic framework, which can be conveniently used for the calculation of false discovery rates. However, there are also limitations. Our proposed mixture model is simple and it contains only three components. Normal distributions are assumed for these components. Furthermore, we assume that different genes behave independently (Gold et al. [32] have showed that the independence assumption can be acceptable in practice). These limitations should be considered when our method is used in practice.
For our future research, it will be useful to extend our proposed method for an integrative analysis of data with multiple sample groups. This will be particularly useful for studying diseases with different progression stages. Although a major proportion of gene expression data have been collected for binary outcomes (e.g. normal vs. abnormal), data with other types of responses (e.g. survival data) have also been collected. It will also be useful to extend our method for these data. Furthermore, when our proposed method is used for an integrative analysis of more than 3 data sets, it is desirable to simplify the mixture model so that the number of model parameters (particularly for $\left\{{\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}\right\}$) can be reduced to achieve statistical efficiency. Furthermore, we would also like to consider more robust approaches (e.g. a nonparametric method) to the concordant integrative gene set enrichment analysis.
References
 1.
Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467470. 10.1126/science.270.5235.467.
 2.
Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E: Expression monitoring by hybridization to highdensity oligonuleotide arrays. Nature Biotechnology. 1996, 14: 16751680. 10.1038/nbt12961675.
 3.
Network TCGA: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455: 10611068. 10.1038/nature07385.
 4.
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 13441349. 10.1126/science.1158441.
 5.
Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature. 2008, 453: 12391243. 10.1038/nature07002.
 6.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, USA. 2003, 100: 94409445. 10.1073/pnas.1530509100.
 7.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L: PGC1αresponse genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003, 34: 267273. 10.1038/ng1180.
 8.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences, USA. 2005, 102: 1554515550. 10.1073/pnas.0506580102.
 9.
de Magalhaes JP, Curado J, Church GM: Metaanalysis of agerelated gene expression profiles identifies common signatures of aging. Bioinformatics. 2009, 25: 875881. 10.1093/bioinformatics/btp073.
 10.
Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Supplement 1): i8490.
 11.
Tanner SW, Agarwal P: Gene Vector Analysis (Geneva): A unified method to detect differentiallyregulated gene sets and similar microarray experiments. BMC Bioinformatics. 2008, 9: 34810.1186/147121059348.
 12.
Shen K, Tseng GC: Metaanalysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics. 2010, 26: 13161323. 10.1093/bioinformatics/btq148.
 13.
Chen M, Zang M, Wang X, Xiao G: A powerful Bayesian metaanalysis method to integrate multiple gene set enrichment studies. Bioinformatics. 2013, 29: 862869. 10.1093/bioinformatics/btt068.
 14.
McLachlan GJ, Bean RW, Jones LB: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 16081615. 10.1093/bioinformatics/btl148.
 15.
Lai Y, Adam BL, Podolsky R, She JX: A mixture model approach to the tests of concordance and discordance between two large scale experiments with twosample groups. Bioinformatics. 2007, 23: 12431250. 10.1093/bioinformatics/btm103.
 16.
Lai Y, Eckenrode SE, She JX: A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics. 2009, 10 (Suppl. 1): S23
 17.
Efron B, Tibshirani R: On testing the significance of sets of genes. Annals of Applied Statistics. 2007, 1: 107129. 10.1214/07AOAS101.
 18.
Amaratunga D, Cabrera J: Exploration and analysis of DNA microarray and protein array data. 2003, John Wiley & Sons, Inc
 19.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, USA. 2001, 98: 51165121. 10.1073/pnas.091062498.
 20.
Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Application in Genetics and Molecular Biology. 2004, 3: 3
 21.
Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71103. 10.1214/ss/1056397487.
 22.
McLachlan GJ, Krishnan T: The EM algorithm and extensions. 2008, John Wiley & Sons, Inc., 2
 23.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289300.
 24.
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, USA. 2001, 98: 1379013795. 10.1073/pnas.191502998.
 25.
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002, 8: 816824.
 26.
Zhang HG, Wang J, Yang X, Hsu HC, Mountz JD: Regulation of apoptosis proteins in cancer cells by ubiquitin. Oncogene. 2004, 23: 20092015. 10.1038/sj.onc.1207373.
 27.
Davies AM, Lara PNJ, Mack PC, Gandara DR: Incorporating bortezomib into the treatment of lung cancer. Clinical Cancer Research. 2007, 13: s46474651. 10.1158/10780432.CCR070334.
 28.
Yang Z, Gagarin D, St Laurent Gr, Hammell N, Toma I, Hu CA, Iwasa A, McCaffrey TA: Cardiovascular inflammation and lesion cell apoptosis: a novel connection via the interferoninducible immunoproteasome. Arteriosclerosis, Thrombosis, and Vascular Biology. 2009, 29: 12131219. 10.1161/ATVBAHA.109.189407.
 29.
Faris M: Atypical B Cell Receptor Signaling: Straddling Immune Diseases and Cancer. International Reviews of Immunology. 2013, 32: 355357. 10.3109/08830185.2013.817248.
 30.
Maciejewski H: Gene set analysis methods: statistical models and methodological differences. Briefings in Bioinformatics. 2013
 31.
Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, PacynaGengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. Proceedings of the National Academy of Sciences, USA. 2001, 98: 1378413789. 10.1073/pnas.241500798.
 32.
Gold DL, Coombes KR, Wang J, Mallick B: Enrichment analysis in highthroughput genomics  accounting for dependency in the NULL. Briefings in Bioinformatics. 2007, 8: 7177.
 33.
Web link for Rcode. [http://home.gwu.edu/~ylai/research/Concordance]
Acknowledgements
The related R code and C code are freely available at the authors' web site [33]. This work was partially supported by the NIH grant GM092963 (Y.Lai).
Declarations
Publication of this article was funded by the NIH grant GM092963 (Y.Lai).
This article has been published as part of BMC Genomics Volume 15 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S1.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Y.Lai conceived of the study, developed the methods, performed the statistical analysis, and drafted the manuscript; F.Zhang developed the methods, performed the statistical analysis, and helped to draft the manuscript; T.K.Nayak, R.Modarres, N.H.Lee and T.A.McCaffrey helped to draft the manuscript. All authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lai, Y., Zhang, F., Nayak, T.K. et al. Concordant integrative gene set enrichment analysis of multiple largescale twosample expression data sets. BMC Genomics 15, S6 (2014). https://doi.org/10.1186/1471216415S1S6
Published:
Keywords
 False Discovery Rate
 Mixture Model
 Integrative Analysis
 Mixture Model Framework
 False Discovery Rate Curve