Identification of genes associated with multiple cancers via integrative analysis

Background Advancement in gene profiling techniques makes it possible to measure expressions of thousands of genes and identify genes associated with development and progression of cancer. The identified cancer-associated genes can be used for diagnosis, prognosis prediction, and treatment selection. Most existing cancer microarray studies have been focusing on the identification of genes associated with a specific type of cancer. Recent biomedical studies suggest that different cancers may share common susceptibility genes. A comprehensive description of the associations between genes and cancers requires identification of not only multiple genes associated with a specific type of cancer but also genes associated with multiple cancers. Results In this article, we propose the Mc.TGD (Multi-cancer Threshold Gradient Descent), an integrative analysis approach capable of analyzing multiple microarray studies on different cancers. The Mc.TGD is the first regularized approach to conduct "two-dimensional" selection of genes with joint effects on cancer development. Simulation studies show that the Mc.TGD can more accurately identify genes associated with multiple cancers than meta analysis based on "one-dimensional" methods. As a byproduct, identification accuracy of genes associated with only one type of cancer may also be improved. We use the Mc.TGD to analyze seven microarray studies investigating development of seven different types of cancers. We identify one gene associated with six types of cancers and four genes associated with five types of cancers. In addition, we also identify 11, 9, 18, and 17 genes associated with 4 to 1 types of cancers, respectively. We evaluate prediction performance using a Leave-One-Out cross validation approach and find that only 4 (out of 570) subjects cannot be properly predicted. Conclusion The Mc.TGD can identify a short list of genes associated with one or multiple types of cancers. The identified genes are considerably different from those identified using meta analysis or analysis of marginal effects.


Background
Microarrays have been extensively used to profile tissues on a genome-wide scale. Genes identified from microarray studies can be used as cancer markers for diagnosis, prognosis prediction, and treatment selection. As an example, microarray gene signatures have been used in breast cancer and lymphoma clinical practices [1]. In this article, we focus on microarray studies where gene expres-sions are measured along with certain cancer clinical outcomes. The goal of such studies is to identify genes with important impacts on the clinical outcomes of interest, which may include risk of developing cancer, cancer status, cancer survival, and response to treatment [2].
Analysis of cancer microarray data is challenging first because of the high dimensionality of gene expressions. In addition, unlike simple Mendelian diseases, development and progression of cancer are affected by the joint effects of multiple genetic defects. This in turn demands modeling the joint effects of a large number of genes in a single statistical model and makes analysis of one gene at a time (i.e, marginal gene effects) suboptimal. Moreover, out of a large number of genes surveyed, only a subset are cancerassociated. To discriminate those cancer-associated genes from noises, various filter, wrapper, and embedded statistical methods have been developed [3].
In most existing studies, attentions have been focused on analysis of a single dataset and identification of genes associated with a single cancer clinical outcome. Consider a hypothetical study where we are interested in identifying genes associated with development of breast cancer. Assume that there are five genes of interest: genes A-E. The goal of most existing studies corresponds to the first column of Table 1, which is to distinguish between cancerassociated genes A and B from noisy genes C, D, and E. In this article, we refer to such a gene selection study as "one dimensional". That is, selection is only carried out on the genes.
All cancer cells share two essential characteristics: uncontrolled growth and local tissue invasion or metastasis. In addition, there is strong evidence that certain cancers share common susceptibility genes. Examples include the BRCA1 and BRCA2 tumor suppressor genes, whose mutations are associated with the inherited forms of both breast and ovarian cancers [4]. Over-expression of the HER-2 oncogene has been reported in 10-40% of primary breast and ovarian tumors and is strongly associated with a poor clinical prognosis [5]. Gene WWOX is a tumor suppressor gene mutated in both breast and prostate cancers [6]. Gene ADH is associated with development of lung cancer and head/neck cancer [7,8]. The wound response signature, which is a breast cancer prognostic gene signature, also has predictive power for prognosis of lung cancer and prostate cancer [9]. Simultaneously examining multiple cancers and searching for their common genomic basis will enable us to identify more essential features of cancer and lead to a better understanding of the subtle connections among different types of cancers [10].
When studying a single type of cancer, genes can be categorized simply as either cancer-associated or not. Selection only needs to be conducted at the gene dimension. When studying multiple cancers, the categorization becomes more complicated. Consider the hypothetical study presented in Table 1. Suppose that, in addition to breast cancer, we are also interested in ovarian and lung cancers. Among the five genes, gene A is associated with all three types of cancers. Genes B and C are associated with two types of cancers. Gene D is associated with only one type of cancer, and gene E is not associated with any of the three cancers. Examination of Table 1 suggests that development of breast and ovarian cancers may share a common genomic mechanism, which likely involves the protein encoded by gene B. However, such a mechanism may have no effect on development of lung cancer. When multiple genes and multiple cancers are considered, selection needs to be carried out at two dimensions: (a) the gene dimension. For each type of cancer, genes associated with its development need to be identified. For example, for ovarian cancer, this dimension of selection amounts to differentiating genes A-C from genes D and E; and (b) the cancer dimension. For each gene, we are interested in identifying cancers it is associated with. For example, for gene B, this dimension of selection amounts to differentiating breast and ovarian cancers from lung cancer. Of note, although there are studies investigating multiple genes and multiple cancers, none of them formally considers this as a two-dimensional selection problem.
Studies conducted to identify genes associated with multiple cancers include [11], where 218 tumor samples spanning 14 common tumor types and 90 normal tissue samples were collected and analyzed to identify a gene signature that is differentially expressed in metastatic tumors of diverse origins relative to primary cancers. A "support vector machine + recursive feature elimination" approach is proposed. Such an approach is limited to categorical clinical outcomes. We note that the data structures and scientific questions of interest in [11] and their counterparts in this article are significantly different. More specifically, [11] has one multiclass classification problem, whereas we have multiple binary classification problems. Rhodes et al. [10] examined 21 cancer microarray datasets

Cancer Development Gene Breast Ovarian Lung
A X X X B X X C X X D X E "X" indicates an association between the corresponding gene and cancer development. spanning 12 distinct cancer types and identified a set of 67 genes that are universally activated in most cancer types relative to normal tissues. The approach proposed in [10] can only study the marginal effects of genes, whereas cancer development is associated with the joint effects of multiple genetic defects. Segal et al. [12] pooled 1975 human DNA microarrays spanning 22 tumor types and characterized gene expression profiles in tumors as a combination of activated and deactivated modules. An approach similar to the Fisher's meta analysis approach is proposed, which can study the marginal effects of genes only. Chan and Mousavi [13] proposed a stochastic Bayesian approach to identify susceptibility genes shared by development of breast and ovarian cancers. The SHEBA approach demands selection of closely related cancers. Considering our limited knowledge of mechanisms beneath cancer development, potential applications of this approach can be limited. Yang et al. [14] analyzed 4 cancer prognosis studies involving breast cancer, leukemia, and mesothelioma and identified 42 genes that show consistent up-or down-regulation in patients with poor disease outcomes. An extension of the approach in [10] is considered, which can only study the marginal effects of genes. Xu et al. [15] collected 26 cancer datasets across 21 major human cancer types and identified a common cancer signature consisting of 46 genes. The proposed TSPG approach is limited to categorical clinical outcomes and hard to be extended. Choi et al. [16] analyzed 10 gene expression datasets from cancers of 13 different tissues and constructed two distinct coexpression networks: a tumor network and a normal network. This study focuses on analyzing the pair-wise interactions between genes. Lê Cao et al. [17] analyzed the NCI60 datasets, where the transcriptome of 60 cancer cell lines was investigated. The sparse partial least squares (sPLS) method was used, which cannot be easily extended to other data setup/models.
Existing methods for analyzing multiple cancer microarray datasets may have one or more of the following drawbacks. First, attention has been focused on analyzing one gene at a time (i.e, the marginal effects of genes). Examples include [10,12,14,16] and others. Since development and progression of cancer is caused by the joint effects of multiple genes, analyzing individual genes separately does not make full use of information in data. In this study, we include all genes in a single statistical model and account for their joint effects. Second, the focus has been on identification of genes associated with all cancers being investigated. Such a strategy demands preselection of cancers having a significantly overlapped genomic basis. For example, in [13], only breast cancer and ovarian cancerwhich are known to share a common genomic basis -are investigated. This strategy may have significant limitations given the great heterogeneity among different can-cers and our limited knowledge of cancer genomics. In this study, we release this constraint, and allow the data to reveal which cancers a particular gene may be associated with. Third, multiple datasets are usually analyzed separately. Then, summary statistics (for example p-values) from analysis of each individual dataset are combined using meta analysis methods to search for overlaps of findings. Such an approach can be inefficient since microarray studies have small sample sizes, and analyzing each individual dataset separately may have insufficient power and may lead to high false positive and false negative errors. Fourth, inefficient feature selection methods are employed. For example, in [15], the number of cancerassociated genes needs to be predetermined, and the heuristic exhaustive search approach in [13] can accommodate only a small number of genes.
In this article, we propose a new statistical approach -Mc.TGD (Multi-cancer Threshold Gradient Directed) -for investigation of associations between multiple genes and multiple cancers. The Mc.TGD is an integrative analysis approach in which raw data from multiple studies are pooled and analyzed. It differs significantly from meta analysis methods, which analyze each dataset separately and pool summary statistics. Unlike existing approaches, the Mc.TGD can model the joint effects of multiple genes, does not make assumptions on the genomic basis of cancers, uses effective gene selection techniques, and is broadly applicable. In this article, we analyze studies investigating the risk of developing cancer, which have binary outcomes. The Mc.TGD can also be used to analyze cancer microarray studies with survival, quantitative, and categorical outcomes.

Data collection
As shown in Table 2, we collect data from seven studies conducted by different research groups who investigated cancers of different tissues and used different profiling platforms.
The normalized datasets have been downloaded from the Stanford Microarray Database [18] and NCBI [19]. These seven datasets have also been investigated in [16], where three more datasets are analyzed. Including these three additional datasets leads the number of genes measured in all studies to decrease from 2207 to 371. To keep a reasonable number of genes, only the seven studies described in Table 2 are analyzed. Of note, although this study and [16] analyze similar datasets, the two studies differ significantly in that ours analyzes multiple genes at a time and seeks to identify those with important joint effects. In contrast, [16] analyzes one gene at a time. Thus, the two studies are not directly comparable. Rather, they investigate different aspects of genes and complement each other.
The following data processing is conducted for each dataset separately. Negative values of Affymetrix measurements are considered as missing. Genes with more than 70 missing values are filtered out. All of the expression values are log 2 transformed. Each clone is mapped to a UniGene accession based on UniGene build # 162. For multiple clones matched to the same UniGene accession, the one with the least missing values is chosen. Missing measurements are imputed using the means of gene expressions across samples. For each dataset, each gene expression is normalized to zero mean and unit variance. A consensus set of 2207 genes are identified.
For the breast, liver, lung and stomach cancer datasets, the tumor sample sizes were much larger than the normal sample sizes. We conduct the same selection as in [16], which leads to an equal number of tumor and normal samples.

Gene identification
We analyze the seven datasets using the Mc.TGD. With 5fold cross validation, (τ 1 , τ 2 , k) = (1.0, 0.85, 1311) are selected as the optimal tuning parameters. Gene identification results, including UniGene identifiers, gene names, and estimated coefficients, are shown in the Additional File 1. With the Mc.TGD approach, we conclude an association between a gene and cancer, if and only if a nonzero estimated regression coefficient is observed. A total of 60 genes are identified to be associated with one or more types of cancers.
Gene MT1F (UniGene Hs.438737) is found to be associated with six types of cancers (all except breast cancer). The MT1F gene belongs to the metallothionein (MT) family, which encodes a family of cysteine-rich, low molecular weight proteins. Published studies on MT1F have shown an association between this gene and a protective effect against metal toxicity, involvement in the physiologic regulation of metals such as zinc and copper, and a role in protection against oxidative stress. Since MTs play an important role in transcription factor regulation, problems with MT function or expression may lead to cellular changes that ultimately result in transformation to malignant cells. Studies have found increased expressions of MTs in cancers of the breast, colon, kidney, liver, lung, nasopharynx, ovary, prostate, mouth, salivary gland, testes, thyroid, and urinary bladder. Early studies have also found lower levels of MT expressions in hepatocellular carcinoma and liver adenocarcinoma. Moreover, there is evidence to suggest that higher levels of MT expressions may lead to resistance to chemotherapeutic drugs. We refer to [20][21][22][23][24][25][26] for studies that have identified MT1F as a marker for various cancers. Although MT1F has been previously identified as a marker for breast cancer, our study is unable to identify its association with breast cancer using the data from [27]. There are multiple possible reasons, including the small sample size, quality of data, and possible limitations of the Mc.TGD.
Four genes are found to be associated with five types of cancers. Gene Hs.15154 is sushi-repeat-containing protein, X-linked (SRPX). Its role in cancer development has not been well investigated. Gene Hs.1560 is DNA crosslink repair 1A (PSO2 homolog, S. cerevisiae) with official symbol DCLRE1A. DNA interstrand cross-links prevent strand separation, thereby physically blocking transcription, replication, and segregation of DNA. DCLRE1A is one of several evolutionarily conserved genes involved in repair of interstrand cross-links [28]. It regulates BRCA1, the obnoxious breast cancer susceptibility gene [29]. In mice models, it has been shown that DCLRE1A co-regulates with IGF-I. Suppression of IGF-I is associated with a low incidence of kidney disease [30]. In addition, a significant association between DCLRE1A and the development of lung cancer has been observed [31]. Gene Hs.418083 (official symbol RBP4) is retinol binding protein 4. This protein belongs to the lipocalin family and is the specific carrier for retinol in blood. It delivers retinol from the liver stores to the peripheral tissues. RBP4 level can be used as an index of cardiovascular disease risk in subclinical hypothyroidism. Retinol binding protein 4 may contribute to the pathogenesis of nonalcoholic fatty liver disease in type 2 diabetics. Gene Hs.435330 (official symbol KIAA0372) has not been well investigated.
In addition, 11 genes are found to be associated with four types of cancers. 9, 18, and 17 genes are found to be associated with three, two, and one types of cancers, respec-  29 29 tively. Many of these genes have been previously identified as cancer markers in independent studies.

Evaluation
We evaluate prediction performance of the Mc.TGD identified genes. Since we do not have independent studies with comparable designs, we use the Leave-One-Out (LOO) cross validation evaluation [32].
The LOO approach consists of the following steps: With the LOO approach, only four subjects in the lung cancer study are not properly classified, which leads to an overall error rate of 0.7%. The LOO evaluation is cross validation based. Since a new set of tuning parameters and estimates are computed with each reduced data, the LOO approach is expected to be relatively fair.

Meta analysis
For comparison, we consider the following meta analysis approach. We first analyze each study separately using the TGDR approach [33,34] and then search for genes that are identified in multiple studies. This meta analysis approach uses the voting method to combine analysis results from multiple studies. We are aware that the TGDR can be replaced by other regularization approaches. However, multiple studies have shown that it performs comparably to other single-dataset approaches [33][34][35]. Furthermore, unlike other regularization approaches, the TGDR has a thresholding framework similar to that of the Mc.TGD and is therefore chosen for comparison.
With this approach, a total of 181 genes are identified to be cancer-associated. However, only four genes are found to be associated with two cancers. All the other genes are found to be associated with only one type of cancer. Compared to this approach, the Mc.TGD is able to take information from multiple studies into consideration in gene selection and thus is more effective in identifying genes that are associated with multiple cancers.

Analysis of marginal associations
With the Mc.TGD, we describe effects of multiple genes using a single statistical model and thus are able to account for their joint effects. To provide a more comprehensive description of identified genes, we also conduct the following analysis of marginal associations: (a) For each gene in each study, we use the Wilcoxon rank-sum test to compare gene expressions of cancer patients with those of normal patients; (b) We then rank genes using their p-values. The gene with rank 1 has the smallest pvalue. This approach shares similar spirits as those for detection of differentially expressed genes in [10,12,14,16].
genes identified with the Mc.TGD, we show their marginal ranks in the Additional File 1. We found that genes identified as jointly associated with cancers not necessarily have high marginal ranks. For example, gene MT1F is identified to be associated with six types of cancers. However, its marginal ranks are only 532, 71, 54, 336, 25, and 28, respectively. This finding confirms the necessity of identifying genes with joint effects beyond analysis of marginal effects.

Discussion
When implementing the Mc.TGD, we focus on genes measured in all studies. As an alternative, when different studies have overlapped but different sets of genes, we can impute gene expressions not measured as zero, and then apply the Mc.TGD. An important objective of the Mc.TGD is to identify genes associated with multiple or all cancers investigated. The proposed analysis can be increasingly unreliable as the number of overlapped genes decreases. Focusing on genes measured in all studies may pose a limitation to the proposed analysis. However, in the very near future, when pangenomic arrays become routine, this limitation may no longer be an issue.
The Mc.TGD analyzes multiple cancer microarray datasets. The final output may be unreliable if one or more datasets have low qualities. In practical implementation, careful inspection of each individual dataset is imperative.
In this study, we evaluate the identified genes in two different ways. First, for those identified to be associated with six and five types of cancers, we manually search published literature for existing evidences of them being associated with cancer. Second, we use the LOO approach and evaluate the overall prediction performance of the Mc.TGD and identified genes. As one reviewer pointed out, our evaluation is still far from complete. To fully evaluate the sixty identified genes, independent biomedical studies may be needed, and that is beyond the scope of this article.
In our data analysis, we focus on studies that investigate the risk of developing cancer. Such studies have binary outcomes and can be naturally described using logistic models. As can be seen from the Methods section, the Mc.TGD is also applicable to other cancer clinical outcomes. More specifically, with continuous clinical outcomes, we can use linear regression models. With multiclass categorical outcomes, we can use generalized liner models. With censored survival outcomes, we can use the Cox proportional hazards model. Once statistical models are specified, likelihood functions can be constructed, and the Mc.TGD can be employed.

Conclusion
A large number of cancer microarray studies have been conducted to search for genes associated with development and progression of various types of cancers. Compared with genes associated with a single type of cancer, genes associated with multiple cancers can represent the more essential genomic features of cancer. In this article, we propose Mc.TGD, an integrative analysis approach that can pool and analyze raw data from multiple studies on different types of cancers. Although there are other studies investigating associations between genes and multiple cancers, the Mc.TGD is the first embedded approach to conduct "two-dimensional" selection and account for the joint effects of genes in such selection. Compared with existing approaches, the Mc.TGD can provide a much more comprehensive description of gene effects on cancer.
Seven cancer microarray studies are analyzed. A total of sixty genes are identified. For genes MT1F, DCLRE1A, RBP4, and many others, the identified associations are consistent with findings in the literature. For other genes, such as SRPX and KIAA0372, more biomedical studies are needed to fully understand their roles in cancer. The LOO evaluation suggests satisfactory prediction performance, which provides support for the identified associations. Ideally, prediction evaluation using completely independent data is needed to confirm the findings. However, this is beyond the scope of this article.

Methods
Our proposed approach for detecting genes associated with multiple cancers consists of the following steps: (a) With each dataset, model the joint effects of all genes on cancer clinical outcome using a regression model; (b) Since multiple datasets on multiple cancers are being investigated, define the overall objective function, which measures the overall association between genes and cancer clinical outcomes; and (c) Apply the Mc.TGD, which is an iterative, two-dimensional selection approach. At each iteration, for each gene, the Mc.TGD evaluates its overall effect to determine if it is associated with any cancer, as well as individual effects on each cancer to determine which cancer type(s) it is associated with.

Data and model
Consider M > 1 studies that measure clinical outcomes of possibly different cancers. For simplicity of notations, suppose that the same set of d genes are measured in all M studies. For the datasets presented in Table 2 . In what follows, the log-likelihood will be used as the function to be maximized with the Mc.TGD and will be referred to as the objective function.

Regularized gene selection
The Mc.TGD is an embedded approach, which embeds selection in model fitting [3]. Selection amounts to properly estimating the regression coefficients in logistic models.
6. Steps 2 to 5 are repeated k times, where k is determined with cross validation.
The Mc.TGD uses thresholding to remove noisy genes and carry out gene selection. In Step 1, the Mc.TGD starts with no genes identified as cancer-associated. In Step 2, the gradients are computed. For each study, the gradients measure the strengths of associations between the genes and cancer clinical outcomes. Genes with stronger associations will have larger gradients. To make different genes comparable, their expressions have been normalized to have unit variances. In Step 3, the cross-gene gradients and the corresponding thresholding vector are computed. In this step, the overall association of a gene with all the cancer outcomes is measured. By introducing the threshold, we compare one gene with the rest of the genes. Genes with more combined strengths of associations with all cancers will have the corresponding components of T equal to one. In Step 4, for each gene, its gradientsstrengths of associations with individual cancer clinical outcomes -are computed. By introducing the cross-cancer thresholding vector, we can identify those cancers this gene is associated with. In Step 5, the two thresholds are combined, allowing the determination of not only whether a particular gene is associated with any type of cancer at all but also which specific cancer type(s) this particular gene is associated with. The estimate is updated if and only if an association is observed. The iterations continue until terminated by cross validation.
To further illustrate, we consider the hypothetical study presented in Table 1. Genes A-D are associated with one or more types of cancers, whereas gene E is not. The crossgene gradients for genes A-D will be larger than that for gene E. Thus, with the thresholding in Step 3, we are able to discriminate gene E from others. Furthermore, consider gene B as an example. Gene B is associated with development of breast and ovarian cancer but not lung cancer.
The gradient for gene B in the lung cancer study will be considerably smaller than those in the breast and ovarian cancer studies. Thus, with the thresholding in Step 4, we can identify gene B as a susceptibility gene for breast and ovarian cancers but not for lung cancer. By combining Steps 3 and 4, we are able to construct a complete description of gene effects as shown in Table 1.

Remarks: connections with existing methods
The Mc.TGD belongs to the family of embedded selection methods [3]. It shares the "computing (gradients), searching (for covariates that can increase value of the objective function), and updating (estimates of selected covariates)" framework with the gradient boosting and individual-dataset TGDR approaches [33,36]. Like many other regularization methods, the Mc.TGD determines the existence of associations by examining the estimated regression coefficients. A nonzero estimated regression coefficient indicates existence of an association, which is equivalent to a thresholding approach with zero as the threshold.
Among the many available selection methods, the TGDR [33]  k, τ 1 and τ 2 , which jointly determine sparsity of the association table. More specifically, with fixed (τ 1 , τ 2 ), the table is sparse with small k and gets denser as k increases.
When (τ 1 , τ 2 ) are small, the table can be dense even with small k. In contrast, when (τ 1 , τ 2 ) are close to one, the table is sparse with small to moderate k, but eventually becomes dense as k increases.
We select tuning parameters using V-fold cross validation [36]. To facilitate computing, we search over the discrete grid of τ 1 , τ 2 = 1, 0.95, 0.9 ... 0.05, 0. We first randomly partition each dataset into V nonoverlapping subsets with equal sizes. Denote β -v as the Mc.TGD estimate of β based on data without the v th subset of each dataset. The CV objective function is defined as CV (k, τ 1 , where R v is the overall objective function evaluated on the v th subsets. Optimal tuning parameters are defined as (k, τ 1 , τ 2 ) that maximize the CV objective function.

Remarks: Why is cross validation needed
Although each Mc.TGD iteration increases value of the overall objective function, the iteration needs to be terminated within a finite number of steps. Otherwise, with the number of genes larger than the sample size, there is a possibility of overfitting, where value of the overall objective function goes to infinity. In addition, a larger value of the overall objective function does not indicate a better prediction performance of identified genes. Thus, we use cross validation and choose the tuning parameters (particularly finite k) that maximizes the cross-validated prediction.

Remarks: an ad hoc alternative
In some cases, researchers may have certain prior information on sparsity of the association

Parameter paths
To provide a graphic description of the Mc.TGD, we examine its parameter paths (estimates as a function of the number of iterations). We simulate under Scenario 3 presented in Table 3. Two datasets are generated, both with binary outcomes. In each dataset, there are 500 genes and 50 subjects with about an equal number of subjects having Y = 1 and 0. Genes 1 to 10 are associated with the first type of cancer, and genes 6 to 15 are associated with the second type of cancer. The two types of cancers share 5 common susceptibility genes. The rest are noisy genes. Pos. 1 (2): number of genes identified to be associated with cancer 1 (2); TP 1 (2): number of true positives for cancer 1 (2); Overlap: number of genes identified to be associated with both cancers.
The simulated datasets are analyzed with the Mc.TGD.
As can be seen from Figure 1, parameter paths for different genes are significantly different. In the upper-right panel, we show the parameter paths for gene # 6, which is associated with both types of cancers. We can see that the estimated coefficients are nonzero for even very small k. In the upper-left and lower-left panels, we show the parameter paths for genes # 1 and 11, which are associated with only one type of cancer. We can see that the estimated coefficients are nonzero in only one study. In the lowerright panel, for gene # 21, which is not associated with any cancer, the estimated coefficients are zero. Since zero estimates indicate no association, Figure 1 suggests that we are able to correctly determine associations between mul-tiple genes and multiple cancers by investigating properties of the Mc.TGD estimates or their parameter paths.

Simulation study
Simulations are conducted to evaluate performance of the Mc.TGD. We assume that there are two studies on two different types of cancers. The benefit of simulating two studies is that the definition of associations between genes and cancers is lucid. As shown in Table 3, we consider the following simulation settings (a) number of genes: 20, 500 and 1000; (b) sample size: we set sample size equal to 50 in each study; and (c) regression coefficients. For genes associated with the outcomes, we set their regression coefficients equal to 0.25 or 0.35, which correspond to two different levels of signals. In addition, we set genes 1-10 to be associated with the first type of cancer, genes 6-15 to be associated with the second type of cancer, and the rest to be noisy genes. Two types of cancers share 5 common susceptibility genes. We generate gene expressions to be multivariate normally distributed and marginally with zero mean and unit variance. Expressions of genes i and j have Parameter paths of the Mc.TGD estimates Step Estimate correlation coefficient 0.4 |i-j| . We generate the probability of cancer presence from the logistic regression model and then the cancer status from a binomial distribution. Under the present simulation settings, there are about equal number of subjects with Y = 1 and Y = 0. We simulate 200 replicates and show the summary statistics in Table 3. For comparison, we also consider the TGDRbased meta analysis described in the Results section. This approach is referred to as "TGDR" in Table 3.
With simulated data, we investigate how many genes are identified to be associated with one or both types of cancers. We can see from Table 3