A novel method to identify gene interaction patterns

Background Gene interaction patterns, including modules and motifs, can be used to identify cancer specific biomarkers and to reveal the mechanism of tumorigenesis. Most of the existing module network inferencing methods focus on gene independent functional patterns, while the studies of overlapping characteristics between modules are lacking. The objective of this study was to reveal the functional overlapping patterns in gene modules, helping elucidate the regulatory relationship between overlapping genes and communities, as well as to explore cancer formation and progression. Results We analyzed six cancer datasets from The Cancer Genome Atlas and obtained three kinds of gene functional modules for each cancer, including Independent-Community, Dependent-Community and Merged-Community. In the six cancers, 59(3.5%) Independent-Communities were identified, while 1631(96.5%) Dependent-Communities were acquired. Compared with Lemon-Tree and K-Means, the gene communities identified by our method were enriched in more known GO categories with lower p-values. Meanwhile, those identified distinguishing communities can significantly distinguish the survival prognostic of patients by Kaplan-Meier analysis. Furthermore, identified driver genes in the gene communities can be considered as biomarkers which can accurately distinguish the tumour or normal samples for each cancer type. Conclusions In all identified communities, Dependent-Communities are the majority. Our method is more effective than the other two methods which do not consider the overlapping characteristics of modules. This indicates that overlapping genes are located in different specific functional groups, and a communication bridge is established between the communities to construct a comprehensive carcinogenesis. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07628-9).


Background
Cancer is a complex disease that threatens human health, and it is mainly driven by the accumulation of a series of genetic changes that can be detected by various high -throughput sequencing technologies [1,2]. With the development of high-throughput next generation factors [10]. If we can accurately understand the pathogenesis and process of tumors, the corresponding treatment methods can be adopted to avoid the serious damage caused by tumors.
Due to the high dimensionality and heterogeneity of cancer data, researchers address an enormous challenge to obtain mutant biomarkers that promote the process of tumorigenesis proliferation from these massive highthroughput data [11][12][13]. Recent studies have shown that the detecting modular network brings new prospects for the development of biological mechanisms and tumors [14,15]. Many module network approaches have been proposed for gene module identification in the past decades. These methods can be divided into three categories, including heuristic clustering module identification method, model-based clustering method and the module network inference method based on gene-based network reconstruction, which depend on different biological assumptions and intuitions. The original module network learning algorithm relies on greedy heuristic algorithm, which was inspired by this viewpoint that co-expressed genes are likely to have similar regulatory patterns and may have the same driver or regulatory genes. For example, Eisen et al. applied a pairwise averagelinkage cluster analysis approach which is a form of hierarchical clustering to gene expression data verified that Clusters of coexpressed genes tend to be enriched for specific functional categories [16]. Although these methods have had an enormous influence, their statistical properties are generally not well understood and important parameters such as the number of clusters are not determined automatically [17]. Therefore, the attention has subsequently shifted to model-based clustering methods, Dahl et.al proposed a approach to obtain a point estimate of the clustering based on the least squares distances from posterior probabilities of two gene clusters [18]. The method groups genes with equal potential variables that control expression, and combines the least squares distances to automatically estimates the number of clusters. Anagha joshi et al. proposed a model clustering method based on Gibbs sampler [17]. This method uses Bayesian method and Gibbs sampling process to iteratively update the cluster assignment of each gene and condition. These methods assumes that the data is generated by a mixture of probability distributions (one for each cluster) and explicitly considers the noise of gene expression data. It allows for a statistical estimate of the generated clusters and gives a formal assessment of the expected number of clusters [17].
In recent years, it has been discovered that module network inference can be combined with gene-based network reconstruction approachs [19,20]. This methodological work complements research that focuses only on applying module network mehods to provide new biomedical and biological insights. For instance, Eric Bonnet et al. proposed a multi-omics module network reasoning method called Lemon-Tree [21], which uses one or more Gibbs samples to infer the initial module from gene expression data, and then use a spectrum edge clustering algorithm to establish a consensus module of genes. Although these methods have achieved good results, these proposed methods of module network recognition based on gene network reconstruction all insist that gene modules are independent and unrelated with each other. Contrary to this point of view, in real biological networks, gene modules may have overlapping characteristics [22]. For example, overlapping characteristics for gene GATA3 and TP53 in different functional pathways in breast cancer in which GATA3, TP53 and PIK3CA regulate the process of neuron apoptotic while gene GATA3, TP53 and BRCA2 response to gamma radiation [23]. This reveals that these genes with overlapping features have different functional roles in different communities. By obtaining the effects of overlapping factors in different gene communities, additional biomarkers may be identified which provide an effective complement to understand cancer progression. Therefore, we import the concept of overlapping community detection to module networks inference. The method is remarkably superior in deducing functional units and overlapping genes which can play different roles in the cell by taking part in several processes.
Here we propose a method to identify cancer function modules based on the concept of overlapping community detection (Fig. 1). The aim of this study was to reveal the overlapping characteristics of gene modules and explore the regulatory relationship between overlapping genes and modules by integrating comprehensive genomic data including gene expression data, copy number data (CNV) and protein-protein interaction (PPI) network data. We first used the Gibbs sampling model to generate initial clusters by clustering gene expression data, and then combined the PPI network with the initial clustering results to construct an interaction network. Finally, functional overlapping modules were detected from the network by overlapping community detection method. Considering overlapping scores, the gene community could be divided into three types: Independent-Community (IC), Dependent-Community (DC) and Merged-Community (MC). Experimental results indicate that ICs identified by our method have independent functions. Overlapping genes in DCs have diverse functions and bridge dependency among communities. The effect of MCs merge is found not only in the network structure but also in the corresponding known pathway function. In Go enrichment analysis, our method can identify more GO categories with lower p-values than other tools. In addition, the driver genes generated can accurately distinguish the tumor and normal samples. The identified Genes in gene expression matrix are pre-processed by gene differential expression analysis to select candidate gene subset. The generation of adjacency matrices is by running one or more instance of initial community procedure. Then, the PPI network is combined to the association between genes. bThe gene community is captured by these three models including Independent-Community, Dependent-Community and Merged-Community communities can distinguish survival and prognosis of patients by Kaplan-Meier analysis.

ICs identified for characterizing breast subtypes
We applied our method to each gene expression profile of six cancer datasets (see Table 1) from TCGA for community detection. The community was obtained based on the threshold of overlap score (ω) with 0.8 and the change of belonging factor of each node a iS in each model. The number of Independent-Community and Dependent-Community is shown in To clarify the results of the three models, we only analyzed breast cancer. Firstly, we validated the significance of functional interactions within IC communities with the Gene Ontology (GO) biology process (BP), cellular component (CC)and molecular functions (MF) with R package GOplot [24]. We captured 11 Independent-Communities for breast cancer. Figure 2a shows one Independent-Community network. The rest of Independent-Communities networks are shown in Additional file 1: Figure S1. The Independent-Community (Community NO.141) GO enrichment analysis is presented in the Fig. 2b. In terms of cellular component, the majority of genes are mainly enriched in plasma membrane process (GO:0005886, p-value = 1.536E-06), desmosome process (GO:0030057, p-value = 2.195E-23), extracellular exosome process (GO:0030057, p-value = 5.567E-5). In terms of biology process, bundle of His cell-Purkinje myocyte adhesion involved in cell communication process (GO:0086073, p-value = 2.128E-09),   [26]. The Reactome pathway analysis was performed to detect the specific function for these three communities respectively and the results are shown in Fig. 3b. Genes RAB1A, RAB34, RAB6B, RAB9B in Community NO.126 express in RAB geranylgeranylation pathway. CNIH3, F5, F8 etc genes in Community NO.172 locate in Cargo concentration in the ER pathway and so on. Since the overlapping genes are located in different communities with specific function, it indicates overlapping genes possess diverse functions. A gene can be relevant with more than one functional community (or cluster) [27,28] in cancer patients and bridge dependency between communities. In Fig. 3b, the overlapping gene NAPA is in the Asparagine N-linked glycosylation pathway which corresponds to the Community 172 function. Meanwhile, NAPA is a link in Intra-Golgi traffic pathway which associates with the Community 121 function. In addition, NAPA plays a function in COPI-dependent Golgi-to-ER retrograde traffic pathway which binds with the community 126. And the overlapping genes COG7 and NAPG enriched in the pathway associated with these three DCs too ( Fig. 3b and Additional file 1: Figure S2). The ratio of genes in each pathway and the genes contained in each pathway are shown in Additional file 1: Figure S2 and Additional file 3.

MCs worked as a functional unit for communication
In order to further understand the integration mechanism of community, overlap score (ω) was set as 0.8 and 0.9 respectively. We captured 219 gene communities at the threshold of 0.8. While the threshold of 0.9, we obtained 237 gene communities. The MC model will merge the community with overlap score over 0.8, resulting in a reduction of 18 communities (Table 3). Subsequently, we calculated the overlap score for each pair of communities in 237 communities. We found that 33 communities out of these 237 communities were merged into 15 communities. The result of community merging is shown in  . 4a), respectively, which are all higher than threshold (0.8). We merged them into one community which can perform as a common functional unit to action together (Fig. 4b). Jointly, the effect of merging can be seen not only in the network structure, but also in the corresponding known pathway function.

Prognostic predictor for cancer patients with survival analysis
Gene and modules are associated with patient survival time in cancer [29]. In order to assess the biological relevance of community network, we analyzed the prognosis value of total communities by Kaplan-Meier survival analysis (see Methods). For each community, we predicted the patient survival time, using the clinical data and gene expression data. We found that both the Independent and Dependent Community divide the patients into two groups whose survival time is significantly different (p-value <= 0.01). Figure 5 shows the most significant survival associated community in six cancers. In BLCA, patients were significantly divided into two classes by the identified functional community (p = 0.0001). The community obtained from BRCA also reveals a significantly different survival time (p-value = 0.01). In COAD, we found the patients in the given community were divided into two groups and the survival time is significantly different (p-value = 0.005). In ESCA, the detected community is significantly associated with the survival time of patients (p-value = 0.003). In HNSC, the community can significantly classify patients according to the average expression values (p-value = 0.002). The most significant survival-related community was obtained from KICH (pvalue = 1.825E-08). The analysis demonstrates that the six cancer communities could distinguish likely patient prognostic survival time according to their mean expression values of genes, with the statistical significance (p <= 0.01).

Comparison for Lemon-tree,K-Means and our method
We compared our method with Lemon-Tree and K-Means over the same large-scale real cancer data set to evaluate the performance based on the Gene Ontology (GO) enrichment. To compare the Gene Ontology (GO) categories among the three approaches, we first obtained all common categories for each cancer by a given pvalue threshold. Then, the highest score for each GO category was selected, and the number of GO categories with higher scores for K-Means, Lemon-Tree and our method was calculated. Finally, the sum of the scores for each GO category in each method is calculated (Fig. 6). This result indicates that the gene communities identified by our approach were enriched in more known GO categories with lower p-values than other methods, and our approach has a lower p-value in the global scope. Overall, our approach performs better than Lemon-tree and K-Means in terms of analyzing the GO enrichment.

In silico validation
To verify the performance of our method in the classification of cancer versus normal samples for each cancer type, we captured the candidate driver genes by mutation frequency from TCGA data. Then the cancer driver genes were identified from the candidate driver genes. In our method, we selected the genes as candidate driver genes with mutation frequency fre >0. 6. In the six cancers of BLCA, BRCA, COAD, ESCA, HNSC and KICH, the candidate regulator lists were comprised of 2077, 1100, 1910, 7312, 2230 and 9369 genes respectively. As the regulatory mechanism of communities or modules, the top 1% highest scoring genes of each community were selected as the final driver genes list. Finally, we obtained 745 cancer driver genes. For each cancer dataset, 80% of cancer versus normal samples were selested for training classifier, and the remaining datasets were used to test the classification performance. The results in The Cancer Genome Atlas data using 10-fold cross-validation are shown in Fig. 7. ROC curves and AUC values are shown for the top five cancer driver genes with the highest AUC performance for each cancer type. The genes PEX19 and SCAMP3 achieve the best AUC performance in BRCA (AUC = 0.969 and AUC = 0.954), ASXL1 and PLCG1 are the best predictors in COAD (AUC = 0.929 and AUC = 0.903), SLC4A4 and GPR87 in ESCA (AUC = 0.932 and AUC = 0.919), ATP6V0D2 and MED30 in HNSC (AUC = 0.958 and AUC = 0.957), MRAP2 and EDARADD in KICH (AUC = 0.984 and AUC = 0.954). The best performance was obtained in BLCA dataset. The top five drivers of cancer always obtained AUC >0.9, and the AUC for gene FGFR1 reaches 1 (Fig. 7).
The top-10 driver genes' regulation scores in each cancer are shown is demonstrated in Additional file 1: Table  S2. It proves that the low regulatory score of the driver genes can also accurately distinct tumor and normal samples.
MRAP2 is known to play a role in cancer [30]. the expression of MRAP2 is restricted to the adrenal gland and brain tissue, MRAP2 overexpression caused suppression of MC2R activation, and positive effects on signaling have been detected only at supraphysiological levels of ACTH. RBPMS has been reported as a part of the metastatic 79 gene characteristics observed in solid tumors [31]. Many studies implicated the RBPMS family of proteins in oocyte, retinal ganglion cell, heart, and gastrointestinal smooth muscle development. While SCAMP3, which is overexpression in the hepatocellular carcinoma [32], has been found that its modification with ubiquitin and its interaction with ESCRT coordinate the regulation of endosome pathway and affect the efficiency of receptor down-regulation [33]. SCAMP3 is a component of post-Golgi membranes, functions as a protein carrier and is critical for subcellular protein transportation. ASXL1 is widely expressed at low level in heart, brain, skeletal muscle, placenta, pancreas, spleen, prostate, small intestine, colon, peripheral blood, leukocytes, bone marrow and fetal liver. ASXL1 mutation is associated with some human cancer types such as leukemias and Bohring-Opitz syndrome [34]. GPR87 was suggested to contribute to the viability of human tumor cells and overexpression of GPR87 mRNA was detected in a number of malignant tumors, including lung cancer [35]. In addition, the high expression of TMEM40 is related to malignant behavior and tumorigenesis, which plays a key role in cell proliferation and apoptosis [36]. TMEM40 gene encodes a protein of 233 amino acids and is located on chromosome 3p25.2. TMEM40 silencing could dramatically suppressed cell proliferation, inhibited cell migration and decreased tumor growth. In invasive tumors, the overexpressed and amplified of ATAD2 has been reported [37], ATAD2 is mapped to chromosome 8q24.13, a genomic region frequently amplified in multiple cancer types,which is highly expressed in several different types of human tumors. while FSCN1, which is up-regulated by SNAI2 and promotes epithelial to mesenchymal transition in HNSC SNAI2 is key components for protein synthesis, which promotes epithelial to mesenchymal transition in HNSC [38], SNAI2 is usually upregulated and its high expression is associated with decreased cell-cell adhesion, increased motility and aggressive phenotype.

Discussion
A large number of applications on high throughput data are applied for cancer diagnosis, clinical treatment and prognosis prediction [39]. In many cases, the biological networks are applied to study intertwined signaling cascades, such as gene co-expression network, metabolic networks and protein-protein interaction networks, which are helpful to explain the tumorigenesis [40]. Particularly, it is a matter of common experience that such networks seem to have community and motifs in them: subsets of vertices within which vertex-vertex connections are dense, but between which connections are less dense. Recent insights emerged from the clustering modules, a comprehensive understanding in many biological mechanisms has been revealed through probabilistic graphical models which considered regulated genes and their regulatory processes [14]. Although many functional independent modules are identified and the module inference is effective in some methods, they are unable to exploit the dependent relation between modules. However, in the real biological network, gene modules may pose the characteristic of overlap, i.e. some genes may belong to multiple groups, where a node might have more than one function.
Hence, we introduce the concept of overlapping community detection into modular network reasoning. Based on the concept of overlapping community detection in social networks, we develop a novel co-expressed network analysis framework that can efficiently construct and analyze the large biology networks. Three network inference models were constructed, including Independent-Community, Dependent-Community and Merged-Community, for representing the relationship between diverse gene communities. By utilizing these three models, our approach is proposed to obtain the gene community with functional overlapping characteristics through integrating gene expression data and PPI data. We applied the method in six cancer data to obtain the diverse composition of gene communities, the results of identified communities are shown in Table 2. According to Table 2 we can found that DCs are the majority of all communities, this illustrating the strong communication between communities. In order to clarify the results of the three models, breast cancer was analyzed in detail. First, for the Independent-Community, it can be seen from the results (Fig. 2b) that the identified integrated circuits have independent functions. Moreover, we also validated the association of the communities with breast cancer subtypes (Fig. 2c) as well as their association with mutagenic processes independent of the subtyping through the mutation behavior of the core genes with top degrees. we can find that for a given community, each mutation event across all the samples would markedly enriched in specific subtypes to effectually categorize patients. In addition, mutations in core genes do correlate with the progression of cancer subtypes. Remarkable relation among EYA1, EYA2 and EYA3 genes with the LumB subtype are in accordance with the previous finding that EYA genes play an important role in breast cancer growth and metastasis as well as directing cells to the repair pathway upon DNA damage [41,42]. The community containing TP53INP1 is strongly mutated in the Her2 and LumB subtypes (p <0.01) mainly because of mutations in TP53INP1. Next, for the Dependent-Community, we find Biologically, genes in the same community often coincide with known functional modules and/or protein complexes [43]. And our result (Fig. 3b) shows that overlapping genes possess diverse functions and bridge dependency among communities. Moreover, overlapping characteristics for gene TP53 and GATA3 are illustrated in [23] in which TP53, BRCA2 and GATA3 can response to gamma radiation while gene TP53, PIK3CA and GATA3 can regulate the process of neuron apoptotic. Finally, for the Merged-Community, the effect of merge is found not only in the network structure but also in the corresponding known pathway function. Several methods have been described, but few approaches have considered the overlapping characteristic in modules. We compared our method with a classical clustering method K-means and an advanced method Lemon-tree (Fig. 6). Through GO enrichment analysis, the identified communities perform better than other methods not only the number of GO categories with lower p-values but also globally the p-values are lower. Two possible reasons are presented to explain why our approach has better performance. The first reason is that our method adds the functional interaction to the relationship between genes which can enhance the interaction between the low-linked and highly functional gene pairs. The second reason is that we construct our framework by considering the overlapping genes among communities. These overlapping genes can play different functions from the combinations of different communities. In addition, the driver genes generated by our approach can accurately distinguish the tumor and normal samples. Furthermore, the prognostic prediction of patients with Kaplan-Meier survival analysis was performed. We found that the significant different survival time is acquired on the identified distinguishing communities.
As a next step, we plan to extend the approach using complementary algorithms developed by other groups, including algorithms that enhance the performance of the module network approach, combining proteomics and metabolomics information as a complement to community construction.

Conclusion
In summary, it is critical to consider the overlapping characteristic in modules and their communication when analyzing cancer mechanism and process. In order to understand this module features more clearly, we develop a novel coexpressed network analysis framework, which employing the concept of overlapping community to construct and analyze the large biology networks effectively and efficiently. The result shows that our method bridges the gap between communities to find the overlapping genes that provide common functions in different communities. The validity of this method provides more useful information for tumorigenesis mechanism and complex biological network inference, and complements the existing approaches of detecting gene communities.

Datasets and pre-processing
The gene expression and copy number data for six cancer datasets were obtained from The Cancer Genome Atlas (TCGA) [4]. The gene expression values were measured with Illumina HiSeqv2 platform and 2101 tumour vs normal samples were obtained (see Table 1). The copy number data were derived from Affmetrix SNP 6.0 arrays and processed with GISTIC. The clinical data were also obtained from TCGA. Protein-protein interaction (PPI) information from the online database Search Tool for the Retrieval of Interacting Genes (STRING) [44]. Interactions in STRING are uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information. From STRING, the known interactions, proved by biological experiments, with a combined score > 0.4 were retrieved as significant pairs for further analysis. Due to the high heterogeneous data for the six cancers, the Earth mover's distance [45] was applied to measure the overall diversity between the distributions of a gene's expression in two classes of samples, normal vs tumor. In our approach, for gene expression and copy number data, the difference between two classes (tumour/normal) in six cancers was analyzed. In gene expression data, we chose the genes with q − values < 0.1 which represents the gene is differentially expressed between class. And for the copy number data, the genes with their q − values < 0.1 were selected. Finally, we combined the filtered gene subsets.

The initial community clusters
Gibbs sampling can obtain samples from the probability distribution without having to explicitly calculate the value for their narginalizing integeals. It has a good performance to eliminate the non transmission relationship between genes, thus, we applied the Gibbs sampling [46] to infer initial community clusters from a gene expression data matrix. Therefore, due to the approximate distribution of large-scale data sets is multimodal, the mixture of traditional Gibbs sampler may be very poor, the convergence rate will be very slow, and it is difficult to consider the full posterior distribution. the Ganesh can converge the Gibbs sampler on large data sets. It performs nonheuristic reconstruction of gene clusters based on the posterior distribution of the statistical model. we used the "ganesh" software to perform Gibbs sampling [21]. For each cancer, in an expression matrix with N genes and M samples, iteratively updating the cluster assignment of each gene and samples was performed in 5 times with "ganesh" software (default parameters), which yields 5 partitions. For the k th Gibbs sampling, we can get a cluster assignment matrix C (k) , i.e. C (k) is an N × S k matrix where N is the number of genes and S k is the number of clusters from k th Gibbs sampling. In cluster assignment matrix C (k) , if gene i in cluster S, C k iS =1, otherwise, C k iS =0.

Interaction network construction Edge-weighted bipartite graph network
In this study, the gene-gene interaction network can be regarded as an edge-weighted bipartite graph G = (N, E) for each cluster, where N consists of vertices of genes (V L ) and genes (V R ), E denotes the weighted edges between gene. If the constraints between the two genes are met, there is an edge between the two genes. Let i be the vertex subscript in V L and j be the vertex subscript in V R , E ij is an edge connecting between N i and N j , An example of a gene-gene bipartite graph is shown in Fig. 8.

Edge weight calculation
We quantified the connections or weighted edges between gene and gene vertices based on the initial clustering results and STRING database, the algorithm of edge weight calculation is described in Algorithm 1. First, we defined an co-clustering matrix M (k) for the kth Gibbs sampler run. In matrix M (k) , M k ij = 1, if gene i and gene j in a same cluster, and M k ij = 0, otherwise. Then, the cooccurrence frequency matrix M was obtained from the mean of M (k) over all K runs. In matrix M, if M ij > 0.4, which means that there is an edge between gene i and gene j, and the edge weight E ij was equal to M ij . Finally, the protein-protein interaction information from the STRING database [44] was used to strengthen the correlation between vertices of genes (V L ) and genes (V R ). Let P ij be the weight between gene i and gene j from the STRING database. If the weight between gene i(i ∈ V L ) and genes j(j ∈ V R ) was less than P ij , the weight E ij between gene i and gene j was updated to P ij .

Identifying high cohesive community
Cohesiveness score (CS) is defined to measure the likelihood of the genes to form a gene community. A greedy growth process were used to construct overlapping gene communities in the gene-gene bipartite graph network G (N, E). The algorithm consists of five steps. First of all, the gene with the maximal degree (largest amount of connections) was chosen as the initial node, and a cohesive community was constructed by the greedy procedure. At the end of each growth process, the algorithm considered all genes that have not been included in any gene community to date, and select the genes with the largest number of connections as the next seed again. When there are no genes remaining to be considered, the whole process ends. The step-by-step description of the greedy growth procedure starting from the seed node s 0 is as follows: Step 1: let S 0 = {s 0 } and set the greedy step t = 0.
Step 2: calculate the cohesiveness value CS(S t ) of S t and let S t+1 = S t . Step 3: for every external gene s incident on at least one boundary edge with S t , calculate the cohe- Step 4: for every internal gene s incident on at least one boundary edge, calculate the cohesiveness value CS(S t \ {s}). If CS(S t \ {s}) > CS(S t+1 ), let S t+1 = S t \ {s}.
Step 5: if S t = S t+1 , increase t and return to step2. Otherwise, S t is regarded as a locally optimal cohesive community. The algorithm for identifying high cohesive community is described in Additional file 1: Algorithm S1. The cohesiveness score is defined as follows, Where i is the gene in cluster S, E ij represents the weight between the gene i and gene j, p|S| is a penalty term to take into account undetected interactions in genetic networks. The default setting of p|S| is 0.2.

Identification of community inference models
As we consider the value of overlapping score among the gene communities, we propose three models in our framework. Independent-Community (IC) model detects communities that act independently on cancer. Dependent community (DC) model can communicate with other communities by searching for such gene community possessing certain gene with multiple biological functions. Merged-Community (MC) model contains most common genes in two modules to play co-function in one module, which is necessary to community detection for community structure stabilization and community diversity elimination. The overview of three models from our approach is shown in Fig. 9. The overlap score is defined to measure the proportion of common genes in two communities (S1 and S2). The overlap score ω between two communities is calculated as follows, Given a threshold ω, community S1 and community S2 with high cohesiveness score (CS). If the overlap score is 0, we call it an independent community model. The community is called a dependent community when ω(S1, S2) < ω, and community S1 and community S2 are reserved as two dependent communities. Otherwise, it is called a Merged-Community module, and community S1 and Fig. 9 Overview of three overlap situations for which our method was processed. In this network, the same letter represents the same node, and different letters represent different nodes. a Independent-Community model: the overlap score of two gene communities S1 and S2, ω(S1, S2) = 0.

Survival analysis
The association of gene expression level with the patient survival in each community was analyzed by Cox proportional hazard model [47], and it was performed using the function coxph (R package survival). Only patients with fully characterized tumors and with at least 30 days of overall survival (OS) were included in this study. Cancer samples were divided into high-expression group and lowexpression group based on the average expression value of each community. The difference between these two classes of patients was acquired by using the Kaplan-Meier estimator and log-rank method.

Comparison benchmark for Lemon-Tree, K-Means and our method
Gene Ontology (GO) enrichment was performed using the BiNGO Java library [48]. The comparison of the K-Means [49], Lemon-Tree [21] and our approach was conducted by computing the p-value and corrected p-value. Enriched ontological terms with the corrected p-value <0.05 were selected. For these three methods, the K-Means method ignores that there is often a lot of noise in biological data and cannot automatically determine clustering parameters. Lemon-Tree combines multiple omics data to use the idea of gene network reconstruction to infer the initial modules, but only considers independent modules. However, in real biological networks, gene modules may have overlapping characteristics. Our method takes this characteristics into account.

In silico validation
For each cancer, we selected the regulating genes. Classification of normal tumor samples using the SVM model (https://cran.r-project.org/web/packages/kernlab/ index.html) [50][51][52]. The area under curve (AUC) and Receiver Operating Characteristic (ROC) curves were employed to measure the performance by a crossvalidation method (k-fold cross-validation, k = 10). We adopted the following parameters: type="C-bsvc", C (cost of constraints violation)=10.
Additional file 1: Figure S1. The rest of Independent-Communities networks in community network, enriched with the known Gene Ontology (GO) terms. The node and edge represent genes and interaction in community. Figure S2. Functional pathway in ClueGO. It presents the genes and information related to their associated genes. The bars represent the number of the genes from Community NO.121, 126, 172 found associated with the pathway. The different color of the bars denotes diverse pathway types. Table S1. Number of mutated samples for each subtype for each community. This table supplements the discussion of community identified in Independent-Community model. In Table 1 in S1 Text we show the number of mutated samples for each community for each subtype. Figure 3C in S1 Text shows the distribution for each gene. The total number of mutated samples for 4 subtypes are shown in the first row. For each community, we computed the number of samples with mutations in at least one gene in the community (followed by the number of mutated samples for each gene in the community). * indicates the significance of subtype enrichment relative to the overall mutations of a given community (or a gene) across all the subtypes. (*** for p < 0.01, ** for p < 0.05, and * for p < 0.1). Table S2. The list of AUC, regulation score and module about top-10 driver genes in each cancer.
Additional file 2: Table S3. The model types of identified communities in our method.
Additional file 3: Table S4. Detailed information about the enriched pathways and their associated genes.
Additional file 4: Table S5. The detailed description for the communities merging.