Data sources
Gene expression data
The RNA-Seq data in the form of Fragments Per Kilobase Million (FPKM) and clinical data of breast cancer, including 1097 breast cancer samples and 113 tumor-adjacent breast tissue samples, were downloaded from The Cancer Genome Atlas (TCGA) database (http://tcga-data.nci.nih.gov/tcga/ accessed on 24 June 2020) using TCGA-Assembler 2 (version 2.0.6, http://www.compgenome.org/TCGA-Assembler/). Additional clinical data of all samples were downloaded using the R package RTCGA (version 1.22.0). The expression data of breast cancer samples and tumor-adjacent breast tissues were assigned to a case group and reference group, respectively. The downloaded expression matrices were directly used as inputs in our method.
Somatic mutation data
Genome-wide nonsynonymous somatic mutations on 18,847 significantly mutated genes of 986 breast cancer tumors in TCGA were downloaded from UCSC Xena (version 07–18-2019, https://tcga-xena-hub.s3.us-east-1.amazonaws.com/download/mc3_gene_level%2FBRCA_mc3_gene_level.txt.gz). Only nonsynonymous mutations were used for the following analysis.
Methylation data
The DNA methylation data (level three) in TCGA generated on the HM450 platform including 485,577 total probes [18], were downloaded from the UCSC Xena (version 07–19-2019, https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-BRCA.methylation450.tsv.gz) for 794 breast cancer samples and 96 tumor-adjacent breast tissue samples. The methylation level of each probe was measured by its beta value, which ranges from 0 (unmethylated) to 1 (completely methylated). The CpG probe annotation file was downloaded from the UCSC Xena database (https://gdc-hub.s3.us-east-1.amazonaws.com/download/illuminaMethyl450_hg38_GDC).
Background network
Synthetic lethality (SL) is a phenomenon that was first observed in drosophila melanogaster experiments more than 100 years ago. Synthetic lethality is a type of genetic interaction between two genes where the simultaneous perturbations of the two genes will result in cell death or a dramatic decrease in cell viability, though a perturbation of either gene alone is not lethal [19].
In the saccharomyces cerevisiae experiment, most of the genes were found to have this SL effect with each other, and the SL relationship between these genes could also be applied to human tumor genes [20]. Passenger mutations that do not directly contribute to tumorigenesis, as well as mutations that disrupt cellular networks and promote a cancerous state of cells, could be candidates for SL cancer therapies. A novel comprehensive map of synthetic-lethal interactions between genes that are mutated in cancer may have significant clinical potential [21]. Multiple studies have indicated that SL is a promising form of gene interaction for cancer therapy, and can identify specific genes to target cancer cells without disrupting normal cells [22, 23]. In addition, an SL will help functionally interpret the vast number of mutations identified in cancer genome sequencing studies [21].
Consequently, the SL interactions built by Jing et al. from the Synthetic Lethality genes interactions Database (SynLethDB) [24], which consists of 6513 genes and 19,955 synthetic lethal gene pairs for human tumors, were downloaded as the background Network (v1.0, the last accessed date, 20–04-2021, http://synlethdb.sist.shanghaitech.edu.cn/#/).
Overview of the sample-specific driver module construction approach
We constructed sample-specific driver modules in three steps (see Fig. 1). First, the sample-specific network (SSN) for each breast cancer sample was constructed based on the gene expression profiles and background network by the SSN method introduced in [28]. Second, the somatic mutation profiles of all breast cancer samples were constructed by the somatic mutation data. In addition, based on the methylation data we obtained the aberrantly methylated genes for each tumor sample using the outlier detection method of the Hampel filter, which make up the methylation aberration profiles for breast cancer samples. Finally, the sample-specific mutation driver module (ssMutat-DM) and the sample-specific methylation aberration driver module (ssMethy-DM) were constructed using the 2-order network theory and hub-gene theory.
Construction of the sample-specific network
In the following, the SSN for each breast cancer sample was constructed based on gene expression profiles by using the SSN method introduced in reference [25] (Fig. 1A). First, using the gene expression profiles of n normal samples, (all tumor-adjacent breast cancer tissues), the reference network was constructed by calculating the correlation coefficient \({\mathrm{pcc}}_{\mathrm{n}}\)(the Pearson correlation coefficient (PCC)) of each gene pair connected in the background network. The weights of the edges in the reference network are the PCC of the corresponding gene pairs. Then, the perturbed network of the single breast cancer sample d was constructed by calculating the new correlation coefficient \({\mathrm{pcc}}_{\mathrm{n}+1}\) using the expression profiles of all the reference samples and sample d. The difference between the reference and perturbed network of d is due to the expression profile of sample d. The differential network for the single breast cancer sample d was constructed by the differential correlation coefficients of the corresponding edge between the reference and perturbed networks in terms of PCC:
$${\Delta \mathrm{pcc}}_{\mathrm{n}}={\mathrm{pcc}}_{\mathrm{n}+1}-{\mathrm{pcc}}_{\mathrm{n}}$$
for each edge. In reference [25], Liu et al. have demonstrated that \({\Delta \mathrm{pcc}}_{\mathrm{n}}\) follows a normal distribution with a mean value of 0 and a variance of \((1-{{\mathrm{pcc}}_{\mathrm{n}}}^{2})/(\mathrm{n}-1)\), when n is large enough. The significance level of each edge (\({\Delta \mathrm{pcc}}_{\mathrm{n}}\)) was calculated by the two-tailed Z-test (or the U-test). The SSN for sample d is constituted by those edges with significant edges (P-value < 0.01). The robustness of the SSNs results against the different choices of the reference samples and different reference sample sizes on breast cancer data from TCGA in [25]. These ensure the stability and reproducibility of the sample-specific driver modules.
Construction of the mutation and methylation matrices
The somatic mutation matrix and the methylation matrix can be constructed by the somatic mutation data and the methylation data (see Fig. 1B).
We first constructed somatic mutation profiles of all cancer samples. Each patient was represented as a profile of binary \((0, 1)\) states on genes, where rows represented genes and columns corresponded to cancer samples. The elements \({\mathrm{M}}_{\mathrm{ij}}\) in the matrix was defined as follows:
$${\mathrm M}_{\mathrm{ij}}=\left\{\begin{array}{c}1,\mathrm{if}\;\mathrm{gene}\;\mathrm i\;\mathrm{mutates}\;\mathrm{in}\;\mathrm{sample}\;\mathrm j,\\\\0,\;\mathrm{if}\;\mathrm{gene}\;\mathrm i\;\mathrm{does}\;\mathrm{not}\;\mathrm{mutate}\;\mathrm{in}\;\mathrm{sample}\;\mathrm j.\end{array}\right.$$
For the methylation matrix, we identified the aberrantly methylated genes for each cancer sample. First, probes with ‘NA’ values among > 10% samples were removed, and we imputed the remaining ‘NA’ values using the 10-nearest neighbors imputation procedure with the ‘impute.knn’ function in the R package ‘impute’ [26]. The information in the CpG probe annotation file includes the corresponding gene and the genomic region, among others. Using the median of the methylation values [27] of the CpG probes mapped at promoter regions (including 1500 bp upstream to the transcription start site TSS1500; 200 bp upstream to the transcription start site TSS200; 5’UTR; and 1stExon), a single methylation value was consolidated for each gene in the HM450K platform. We used the outlier detection method of the Hampel filter to identify the aberrantly methylated genes (i.e., hypermethylation and hypomethylation genes) for each sample. Specifically, each patient was represented as a profile of ternary \((-1, 0, 1)\) states on genes, where rows represented genes and columns corresponded to cancer samples. The elements \({\mathrm{MT}}_{\mathrm{ij}}\) in the methylation matrix were defined as follows:
$${\mathrm{MT}}_{\mathrm{ij}}=\left\{\begin{array}{ll}-1,\;\mathrm{represents}\;\mathrm{hypomethylation},\;&\mathrm{if}\;\mathrm v\;<\;\mathrm{MED}\;-\;3\;\mathrm{MAD},\\\\1,\;\mathrm{represents}\;\mathrm{hypermethylation},\;&\mathrm{if}\;\mathrm v\;>\;\mathrm{MED}\;+\;3\;\mathrm{MAD},\\\\0,\;&\mathrm{otherwise},\end{array}\right.$$
where \(\mathrm{v}\) was the methylation value of gene i in patient j, \(\mathrm{MED}\) and \(\mathrm{MAD}\) were the median and median absolute deviation of methylation value of gene i in all samples, respectively.
Construction of the sample-specific driver module
The ssMutat-DM and ssMethy-DM were constructed according to the following two theories. The first is the 2-order network theory [28], which states that genes with mutation or methylation aberrations could affect both their neighbors and their neighbors’ neighbors. The second is the hub-gene theory. Model biology studies have demonstrated that hub genes (genes with a high degree in a biological network) have extremely important biological functions, and also play important role in the regulation of other genes in related pathways [29, 30]. In the gene regulatory network related to human disease, hub genes tend to be cancer drivers. Based on the above two theories, we used two principles to select the subnetwork in the SSN for each breast cancer sample: i) the 2-order subnetwork centered on mutation genes, ii) the mutation genes’ neighbors and their neighbors’ neighbors should be hub genes, where the hub genes were regarded as the top 20% of genes with higher degrees in the SSN. The subnetwork was then considered to be the ssMutat-DM for the breast cancer patient, and edges in ssMutat-DM were considered to be mutation-driven edges. Similarly, we constructed the ssMethy-DM for each patient by centering on aberrantly methylated genes (see Fig. 1C). The union of the ssMutat-DM and ssMethy-DM constituted the sample-specific co-driver module (ssCo-DM) for each breast cancer sample.
Structural analysis of the driver modules
To measure the importance of different genes in the driver modules, the degree, betweenness and eigenvector centrality of genes in the constructed driver modules were analyzed. Degree centrality was determined by the number of connections. Betweenness centrality is a measure of the importance of a node by assessing the number of shortest paths through it. Eigenvector centrality is a measure of a node by assessing the importance of its adjacent nodes in a network. In addition, the similarity of ssMutat-DM and ssMethy-DM of each breast cancer sample was measured by the Jacobi similarity as follows:
$$\mathrm{Similarity}\;\mathrm{score}\;({\mathrm S}_{\mathrm{Mutat}},{\mathrm S}_{\mathrm{Methy}})=\frac{\left|\cap({\mathrm S}_{\mathrm{Mutat}},{\mathrm S}_{\mathrm{Methy}}\right)\vert}{\left|\cup({\mathrm S}_{\mathrm{Mutat}},{\mathrm S}_{\mathrm{Methy}}\right)\vert},$$
where \({\mathrm{S}}_{\mathrm{Mutat}}\) and \({\mathrm{S}}_{\mathrm{Methy}}\) represent the gene set or edge set of ssMutat-DM and ssMethy-DM, respectively.
Identifying the driver modules for breast cancer and its subtypes
A Monte Carlo simulation test was used to evaluate whether a specific edge was significantly included in the driver modules of more breast cancer samples. This process was repeated 10,000 times and 10,000 edges were selected randomly each time. We derived the statistical significance of the frequency of occurrence in the ssMutat-DM of all breast cancer samples based on empirical NULL distribution generated by 10,000 random shuffling. A cutoff of P-value = 0.05 was obtained. All edges with a frequency greater than the mean of the cutoff values made up the mutation driver module for breast cancer. Similarly, we constructed the methylation aberration driver module for breast cancer.
Furthermore, the driver modules for the PAM50 subtypes of breast cancer were analyzed. We computed the edge number of driver modules for each subtype of breast cancer, and the differential analysis for the edge numbers of different subtypes was performed using the Kruskal–Wallis sum test. In addition, pairwise comparisons between subtypes used the Nemenyi test. Edges that are present in more than 60% ssCo-DMs of breast cancer subtype samples constituted the subtype-specific driver module. The unique genes involved in each subtype-specific driver module were used to perform pathway enrichment analysis by Metascape (http://metascape.org). The pathways with P-values less than 0.01 were retained. Finally, the subtype-specific pathways were identified.
Statistical analysis
Unless stated otherwise, comparisons of two groups on one variable were determined using a Wilcoxon rank sum test with continuity correction, two-sided or one-sided, unpaired or paired, as appropriate, and multiple group comparisons using a Kruskal–Wallis rank sum test. Results with P < 0.05 were considered significant. Statistical analyses were performed with R (version 4.05) The Pearson correlation coefficients used for the construction of SSNs were calculated by parallel operation on the server with 48 cores.