Concordant integrative gene set enrichment analysis of multiple largescale twosample expression data sets
 Yinglei Lai^{1}Email author,
 Fanni Zhang^{1},
 Tapan K Nayak^{1},
 Reza Modarres^{1},
 Norman H Lee^{2} and
 Timothy A McCaffrey^{3}
https://doi.org/10.1186/1471216415S1S6
© Lai et al.; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Abstract
Background
Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment.
Methods
We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a threecomponent multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets.
Results
We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method.
Conclusions
This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple largescale twosample gene expression data sets.
Background
The recent largescale technologies like microarrays [1–3] and RNAseq [4, 5] allow us to collect genomewide expression profiles for biomedical studies. Genes showing significant differential expression are potentially important biomarkers [6]. Furthermore, a gene set enrichment analysis enables us to identify groups of genes (e.g. pathways) showing coordinate differential expression [7, 8]. For some disease studies, multiple gene expression data sets have been collected and the related integrative analysis of multiple data sets has been investigated [9]. Since microarray and sequencing based genomewide expression data sets have been increasingly collected, it is necessary to further develop the computational and statistical methods for integrative data analysis studies.
Genes and gene sets showing consistent behavior among multiple related studies can be of great biological interest. However, since the sample sizes are usually small but the numbers of genes are large, it is difficult to identify truly differentially expressed genes and determine whether a gene or a gene set behaves concordantly among different related studies. Although the integrative analysis of multiple gene expression data sets has been well studied in recent years [10, 11], the genomewide concordance has not been well considered. Misleading results may be generated if the concordance among different data sets is not considered in an integrative analysis. Our purpose is to identify pathways or gene sets with concordant enrichment. Recently, there are several methods published for meta gene set enrichment analysis of expression data [12, 13]. However, these methods have not been specifically developed for our study purpose. Statistically, we need analysis methods that are consistent with the study purpose. There is still a lack of methods and software for the concordant integrative gene set enrichment analysis.
For a gene set enrichment analysis, an enriched gene set in one data set may also be enriched in another data set. However, this gene set is not necessarily concordantly enriched in both data sets. For an illustration, let us consider a simple artificial example: gene set S contains five genes with the first three genes strongly upregulated in the first data set (the last two genes nondifferentially expressed) and the last three genes strongly upregulated in the second data set (the first two genes nondifferentially expressed). Then, in general, gene set S is enriched in upregulated differential expression in both data sets. However, there is only one gene upregulated in both data sets; the remaining genes are showing inconsistent behavior. Therefore, unless the proportions of differentially expressed genes are small, there is a lack of evidence to conclude that gene set S is concordantly enriched in both data sets. Since a gene set concordantly enriched in several similar studies may be of great importance, it is necessary to develop statistical methods for detecting these gene sets.
It has been shown that a mixture model based approach can be an efficient approach to the differential expression analysis [14]. Furthermore, we have also demonstrated the usefulness of mixture models in concordant analysis of differential expression among largescale expression data sets [15, 16]. The advantage of the mixture model based approach is that the probability of a particular behavior (upregulated or downregulated) can be modeled and estimated for a given gene. Thus, it is feasible to address how likely this gene shows a concordant behavior. In this study, we develop a mixture model based method for a concordant integrative gene set enrichment analysis.
Methods
Concordant gene set enrichment
In this study, we consider multiple largescale twosample gene expression data sets. We use K to denote the number of these data sets and m to denote the number of common genes in these data sets. For each of these data sets, we usually use a ttype test to evaluate the differential expression of each gene and a gene set enrichment analysis (GSEA) method to evaluate the enrichment level of a given gene set. In order to define and evaluate a concordant gene set enrichment when an integrative analysis is conducted for all K data sets, we categorize differential expression in each data set into three underlying (unobserved) representative categories: no change, positive change (or upregulated differential expression) and negative change (or downregulated differential expression). Due to data noise, what we observe from experiments may not indicate the underlying truth. (For example, a gene with slight downregulated differential expression may show a small positive ttype test value.) Although these categories are not observed in practice, they can be considered in a mixture model framework.
which can be useful for prioritizing different gene sets in practice.
Before we derive the mathematical formula for the above probability, we need to explain the term "enriched". As suggested by Efron and Tibshirani [17], unless the test statistic for a gene set enrichment analysis (GSEA) considers the genomewide background patterns (e.g. the statistics proposed in the original GSEA [7, 8]), it is necessary to consider the "row randomization" for genes in addition to the "column permutation" for samples. Therefore, the term "enriched" means "higher/better than expected".
In order to calculate CES practically, we propose a threecomponent multivariate mixture model. In the model, each component is a normal distribution. The model configuration for these three components is consistent with the differential expression categories as described above. This model is conceptually analog to a simple normal mixture approach to differential expression analysis proposed by McLachlan et al. [14]. The special feature of our model is that we focus on some specific combination of components from different dimensions. A bivariate version of this model has been used by us to evaluate the concordance and discordance between two largescale experiments with two sample groups [15] and to integrate two microarray data sets in differential expression analysis [16]. Before the model description, we need to describe the related data preprocessing and differential expression test scores as follows.
Data preprocessing
Because our proposed statistical method is developed based on the differential expression test scores, we assume that the given gene expression data sets have been preprocessed appropriately [18]. For a concordant integrative analysis of multiple data sets, we also need to select genes shared commonly by different data sets. This can be achieved using the genes' unique identifiers.
Differential expression test scores
For each of the twosample gene expression data sets, we screen individual genes with the traditional twosample Student's ttest. Several modified ttests, such as SAM ttest [19] and the moderated ttest [20], have been widely used in the differential expression analysis of microarray data. These test statistics can generally improve the control of false positives by "softly" filtering out genes with relatively small expression variance. However, we intend to consider all the genes equally important in the concordant integrative analysis of multiple data sets. Furthermore, a given gene can show different levels of variance in different data sets, which may make it difficult to use these modified ttests. Therefore, we still recommend the traditional twosample ttest as the differential expression test statistic. (In practice, other test statistics like SAM ttest or the moderated ttest can still be considered when there is a strong reason to do so.) Because the sample size of a highthroughput study is usually not large, it is generally difficult to validate the normal distribution assumptions for the ttest. Therefore, instead of the theoretical tdistribution, we use the permutation procedure to compute the pvalue of an observed ttest [21]. This approach has been widely adopted in the analysis of gene expression data [6].
For K twosample gene expression data sets with m common genes, we compute the onesided uppertailed pvalue p_{ i,k } for gene X_{ i } in the kth data set, i = 1, 2, . . . , m and k = 1, 2, . . . , K. Then, we perform an inverse normal transformation to obtain a zscore: z_{ i,k } = Φ^{1}(1  p_{ i,k }), where Φ(·) is the cumulative distribution function (c.d.f.) of the standard normal distribution. This transformation has been widely used to improve the fitting of a mixture model [14]. Our proposed statistical methods for the concordant integrative analyses of multiple data sets are developed based on these sets of zscores.
A mixture model
which is a type of wellknown simple normal mixture model.
where ${\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}$ is the probability for this gene being in a particular combination of different components (j_{1}, j_{2}, . . . , j_{ K }) in different data sets $\left({\sum}_{{j}_{1},{j}_{2},...{j}_{K}=0}^{2}{\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}=1\right)$. We call this model a partial concordance/discordance (PCD) model. Notice that a bivariate version of this model has been used to evaluate the overall concordance or discordance of two largescale data sets and to conduct an integrative analysis of differential expression for two largescale twosample data sets [15, 16].
Model estimation
Our mixture model can be estimated by the welldeveloped EM algorithm [22]. In the model, the differential expression categories are considered as missing information. For any zscore vector (z_{i,1}, z_{i,2}, . . . , z_{ i,K }), i = 1, 2, . . . , m, this information can be mathematically represented as ${w}_{{j}_{1},{j}_{2},...,{j}_{K}}^{\left(i\right)}=1$ if each z_{ i,k } is sampled from the j_{ k }th component (j_{ k } = 0, 1 or 2 and k = 1, 2, . . . , K) or zero otherwise.
In the EM algorithm, we iterate Estep and Mstep until a numerical convergence of likelihood (not the "complete likelihood"). Let L^{(t) }and L^{(t+1) }be the likelihood values calculated after the tth and (t + 1)th iterations, respectively. A numerical convergence is claimed if L^{(t+1) }− L^{(t)} < 0.001.
Concordant enrichment score
which is the PCD model based estimate for the probability Pr(gene set S is concordantly enriched  observed zscore matrix of gene set S). In the formula, I(true statement) = 1 and I(false statement) = 0 (indicator function). Notice that the formula can be simplified to a wellknown binomial tail probability if all ${\left\{{u}_{S,i}\right\}}_{i=1}^{{m}_{S}}$ are the same. However, ${\left\{{u}_{S,i}\right\}}_{i=1}^{{m}_{S}}$ are usually different in practice. Then, we need to calculate a tail probability for a heterogeneous Bernoulli process.
False discovery rate
Computational approximation
Although we have derived the formula for concordant enrichment score (CES), it is usually difficult to compute it in practice: the number of possible component combinations from different genes in a given gene set is usually huge. Based on our observation, most gene sets contain more than 20 genes. Since different genes have different probabilities of being concordantly upregulated and/or downregulated differentially expressed, we cannot further simplify the formula (we need to calculate a tail probability for a heterogeneous Bernoulli process). However, we can consider a simulation based approach to the approximation of CES given in Equation (2).
Monte Carlo approximation
Recall that the probability of event of interest u_{ S,i } can be calculated for a gene X_{ S,i } in a given gene set S = {X_{ S,i }, i = 1, 2, . . . , m_{ S }}. The simulation scheme is based on a heterogeneous Bernoulli process:

For each X_{ S,i }, simulate a Bernoulli random variable with probability of event u_{ S,i };

For the gene set S, count the number R of events from different genes;

Repeat the above two steps B times and report the approximated enrichment score as {number of $\left(R>{m}_{S}{\widehat{\pi}}_{1,1,...,1}\right)\}/B$.
One related question is how large B should be set in the simulation. As we have discussed above, the concordant enrichment score (CES) is closely related to the false discovery rate (FDR). Then, it is reasonable to require its accuracy around the 1% level for the 95% CES level (e.g. a 95% normally approximated binomial confidence interval 0.95 ± 0.01) and B = 2000 is adequate. Therefore, the Monte Carlo approximation is a feasible approach in practice. (In general, if we do not have a specific CES level, we can simply use an upper bound B = 10000 calculated based on the 95% normally approximated binomial confidence interval. Then, the related computing burden is still practically feasible.)
Results and discussion
Application #1: an integrative analysis of two data sets
To illustrate our method, we first considered two microarray gene expression data sets collected for lung cancer studies [24, 25]. The first one was collected by a research group in Boston (referred to as Boston data) and the second one was collected by a research group in Michigan (referred to as Michigan data). For an application of their Gene Set Enrichment Analysis (GSEA) method, Subramanian, Tamayo et al. [8] reorganized these two data sets, which were made freely available at http://www.broadinstitute.org/gsea. There were 62 and 86 patients for the Boston and Michigan data sets, respectively. These patients were classified as either "good" or "poor" outcomes. Expression profiles were available for 5216 genes that were common for both data sets. To compare our analysis results with the results reported by Subramanian, Tamayo et al. [8], we used an early version of gene set collection that was used by them [8]. Subramanian, Tamayo et al. [8] also suggested a moderate range of 15500 genes for the sizes of gene sets that were analyzed in their gene set analysis. A gene set was not analyzed if its number of genes was out of this range. This range was used in our analysis. To demonstrate the advantage of their GSEA, Subramanian, Tamayo et al. [8] observed several commonly significantly enriched gene sets from the analysis of each data set although no individual genes with significantly differential expression were identified.
A comparison based on two data sets.
Gene sets enriched in poor outcome  FDR 

Boston data  
Hypoxia and p53 in the cardiovascular system  <0.001 
Aminoacyl tRNA biosynthesis  <0.001 
Insulin upregulated genes  <0.001 
tRNA synthetases  <0.001 
Leucine deprivation downregulated genes  <0.001 
Telomerase upregulated genes  <0.001 
Glutamine deprivation downregulated genes  <0.001 
Cell cycle checkpoint  <0.001 
Michigan data  
Glycolysis gluconeogenesis  <0.001 
vegf pathway  <0.001 
Insulin upregulated genes  <0.001 
Insulin signaling  0.021 
Telomerase upregulated genes  <0.001 
Glutamate metabolism  0.018 
Ceramide pathway  0.076 
p53 signalling  <0.001 
tRNA synthetases  <0.001 
Breast cancer estrogen signalling  <0.001 
Aminoacyl tRNA biosynthesis  <0.001 
Application #2: an integrative analysis of three data sets
Among the gene sets with FDR < 0.05, we observed many interesting pathways. Among these 74 identified based on downregulated differential expression, there were pathways related to immune system, TCR signaling, viral myocarditis, BCR signaling, cell survival, WNTβcatenin signaling, cytokine, PI3K, VEGF signaling, interleukins and GPCR signaling. Among these 99 identified based on upregulated differential expression, there were pathways related to different metabolism, cell cycle, checkpoints, and related phases and transitions, DNA replication, synthesis damage and repair, p53, glycolysis gluconeogenesis, telomere maintenance and extension, apoptosis, TGFβ signaling, tRNA aminoacylation, gene expression, lung cancer and PDGF signaling.
Consistency between two application results
KEGG cancer pathways
An exploration of KEGG cancer pathways.
Two data sets  Three data sets  

KEGG cancer pathways  U/D  CES  Diff.  FDR  CES  Diff.  FDR 
PPAR signaling *  down  0.671  >0.1  0.194  0.563  >0.1  0.210 
MAPK signaling **  down  0.639  >0.1  0.209  0.857  >0.1  0.063 
ERBB signaling *  up  0.629  >0.1  0.119  0.581  >0.1  0.188 
Calcium signaling **  down  0.925  >0.1  0.051  0.694  >0.1  0.153 
Cytokinecytokine receptor interaction ***  down  0.717  >0.1  0.172  0.943  >0.1  0.022 
Cell cycle ***  up  0.998  >0.1  <0.001  0.959  >0.1  0.012 
p53 signaling ***  up  0.999  >0.1  <0.001  0.944  >0.1  0.018 
MTOR signaling *  down  0.724  >0.1  0.167  0.611  >0.1  0.193 
Apoptosis *  down  ≤ 0.1  0.776  >0.1  0.102  
WNT signaling ***  down  ≤ 0.1  0.888  >0.1  0.048  
TGFβ signaling  down  ≤ 0.1  0.521  >0.1  0.236  
VEGF signaling ***  down  0.784  >0.1  0.136  0.919  >0.1  0.033 
Focal adhesion ***  up  >0.999  >0.1  <0.001  0.830  >0.1  0.077 
ECM receptor interaction ***  up  0.996  >0.1  <0.001  0.977  >0.1  0.005 
Adherens junction *  up  0.646  >0.1  0.114  ≤ 0.1  
JAKSTAT signaling ***  down  0.875  >0.1  0.082  0.901  >0.1  0.044 
Conclusions
In this study, we proposed a mixture model based statistical method for the concordant integrative gene set enrichment analysis. Our method was first applied to two published lung cancer microarray gene expression data sets. As shown in Figure 1, gene sets like the proteasome and BCR signaling pathways were identified by our method. These gene sets were not identified in the previous study [8] since the differential gene expression among these gene sets were relatively weak. However, the concordant enrichment of these gene sets was detected by our method. This comparison illustrated the advantage of our proposed concordant integrative gene set enrichment analysis. The analysis results from our second application (a concordant integrative analysis of three data sets) also showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection demonstrated the practical usefulness of our proposed method. Overall, this study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple largescale twosample gene expression data sets.
There are several advantages for our proposed method. The genomewide concordance can be statistically tested before the integrative analysis. The mixture model is estimated based on the maximum likelihood estimation procedure. Furthermore, our integrative analysis of gene sets is based on a probabilistic framework, which can be conveniently used for the calculation of false discovery rates. However, there are also limitations. Our proposed mixture model is simple and it contains only three components. Normal distributions are assumed for these components. Furthermore, we assume that different genes behave independently (Gold et al. [32] have showed that the independence assumption can be acceptable in practice). These limitations should be considered when our method is used in practice.
For our future research, it will be useful to extend our proposed method for an integrative analysis of data with multiple sample groups. This will be particularly useful for studying diseases with different progression stages. Although a major proportion of gene expression data have been collected for binary outcomes (e.g. normal vs. abnormal), data with other types of responses (e.g. survival data) have also been collected. It will also be useful to extend our method for these data. Furthermore, when our proposed method is used for an integrative analysis of more than 3 data sets, it is desirable to simplify the mixture model so that the number of model parameters (particularly for $\left\{{\pi}_{{j}_{1},{j}_{2},...,{j}_{K}}\right\}$) can be reduced to achieve statistical efficiency. Furthermore, we would also like to consider more robust approaches (e.g. a nonparametric method) to the concordant integrative gene set enrichment analysis.
Declarations
Acknowledgements
The related R code and C code are freely available at the authors' web site [33]. This work was partially supported by the NIH grant GM092963 (Y.Lai).
Declarations
Publication of this article was funded by the NIH grant GM092963 (Y.Lai).
This article has been published as part of BMC Genomics Volume 15 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S1.
Authors’ Affiliations
References
 Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467470. 10.1126/science.270.5235.467.PubMedView ArticleGoogle Scholar
 Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E: Expression monitoring by hybridization to highdensity oligonuleotide arrays. Nature Biotechnology. 1996, 14: 16751680. 10.1038/nbt12961675.PubMedView ArticleGoogle Scholar
 Network TCGA: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455: 10611068. 10.1038/nature07385.View ArticleGoogle Scholar
 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 13441349. 10.1126/science.1158441.PubMedPubMed CentralView ArticleGoogle Scholar
 Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature. 2008, 453: 12391243. 10.1038/nature07002.PubMedView ArticleGoogle Scholar
 Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, USA. 2003, 100: 94409445. 10.1073/pnas.1530509100.View ArticleGoogle Scholar
 Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L: PGC1αresponse genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003, 34: 267273. 10.1038/ng1180.PubMedView ArticleGoogle Scholar
 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences, USA. 2005, 102: 1554515550. 10.1073/pnas.0506580102.View ArticleGoogle Scholar
 de Magalhaes JP, Curado J, Church GM: Metaanalysis of agerelated gene expression profiles identifies common signatures of aging. Bioinformatics. 2009, 25: 875881. 10.1093/bioinformatics/btp073.PubMedPubMed CentralView ArticleGoogle Scholar
 Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Supplement 1): i8490.PubMedView ArticleGoogle Scholar
 Tanner SW, Agarwal P: Gene Vector Analysis (Geneva): A unified method to detect differentiallyregulated gene sets and similar microarray experiments. BMC Bioinformatics. 2008, 9: 34810.1186/147121059348.PubMedPubMed CentralView ArticleGoogle Scholar
 Shen K, Tseng GC: Metaanalysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics. 2010, 26: 13161323. 10.1093/bioinformatics/btq148.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen M, Zang M, Wang X, Xiao G: A powerful Bayesian metaanalysis method to integrate multiple gene set enrichment studies. Bioinformatics. 2013, 29: 862869. 10.1093/bioinformatics/btt068.PubMedPubMed CentralView ArticleGoogle Scholar
 McLachlan GJ, Bean RW, Jones LB: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 16081615. 10.1093/bioinformatics/btl148.PubMedView ArticleGoogle Scholar
 Lai Y, Adam BL, Podolsky R, She JX: A mixture model approach to the tests of concordance and discordance between two large scale experiments with twosample groups. Bioinformatics. 2007, 23: 12431250. 10.1093/bioinformatics/btm103.PubMedView ArticleGoogle Scholar
 Lai Y, Eckenrode SE, She JX: A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics. 2009, 10 (Suppl. 1): S23PubMedPubMed CentralView ArticleGoogle Scholar
 Efron B, Tibshirani R: On testing the significance of sets of genes. Annals of Applied Statistics. 2007, 1: 107129. 10.1214/07AOAS101.View ArticleGoogle Scholar
 Amaratunga D, Cabrera J: Exploration and analysis of DNA microarray and protein array data. 2003, John Wiley & Sons, IncView ArticleGoogle Scholar
 Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, USA. 2001, 98: 51165121. 10.1073/pnas.091062498.View ArticleGoogle Scholar
 Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Application in Genetics and Molecular Biology. 2004, 3: 3View ArticleGoogle Scholar
 Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
 McLachlan GJ, Krishnan T: The EM algorithm and extensions. 2008, John Wiley & Sons, Inc., 2View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289300.Google Scholar
 Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, USA. 2001, 98: 1379013795. 10.1073/pnas.191502998.View ArticleGoogle Scholar
 Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002, 8: 816824.PubMedGoogle Scholar
 Zhang HG, Wang J, Yang X, Hsu HC, Mountz JD: Regulation of apoptosis proteins in cancer cells by ubiquitin. Oncogene. 2004, 23: 20092015. 10.1038/sj.onc.1207373.PubMedView ArticleGoogle Scholar
 Davies AM, Lara PNJ, Mack PC, Gandara DR: Incorporating bortezomib into the treatment of lung cancer. Clinical Cancer Research. 2007, 13: s46474651. 10.1158/10780432.CCR070334.PubMedView ArticleGoogle Scholar
 Yang Z, Gagarin D, St Laurent Gr, Hammell N, Toma I, Hu CA, Iwasa A, McCaffrey TA: Cardiovascular inflammation and lesion cell apoptosis: a novel connection via the interferoninducible immunoproteasome. Arteriosclerosis, Thrombosis, and Vascular Biology. 2009, 29: 12131219. 10.1161/ATVBAHA.109.189407.PubMedPubMed CentralView ArticleGoogle Scholar
 Faris M: Atypical B Cell Receptor Signaling: Straddling Immune Diseases and Cancer. International Reviews of Immunology. 2013, 32: 355357. 10.3109/08830185.2013.817248.PubMedView ArticleGoogle Scholar
 Maciejewski H: Gene set analysis methods: statistical models and methodological differences. Briefings in Bioinformatics. 2013Google Scholar
 Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, PacynaGengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. Proceedings of the National Academy of Sciences, USA. 2001, 98: 1378413789. 10.1073/pnas.241500798.View ArticleGoogle Scholar
 Gold DL, Coombes KR, Wang J, Mallick B: Enrichment analysis in highthroughput genomics  accounting for dependency in the NULL. Briefings in Bioinformatics. 2007, 8: 7177.PubMedView ArticleGoogle Scholar
 Web link for Rcode. [http://home.gwu.edu/~ylai/research/Concordance]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.