In this paper, we obtained whole genome exome-seq data (WES) from TCGA for the patients with breast cancers and derived the SMG list for each patient. The list of SMGs from each patient were used as features for this patient. We then computed distance correlation of every pair of SMG lists to obtain the functional relationships between the affected genes in different patients based on the gene expression profiles. The process yielded the distance correlation matrix across the patients. Then we visualized the patients by multi-dimensional scaling, and further clustered the patients into different groups. Our workflow is summarized in Fig. 1.

The key component in this workflow is to compute the distance correlation between a pair of gene lists (in this case, expression profiles of two SMG lists from two patients). The intuition behind distance correlation can be considered as following: A gene list can be used to cluster the patient cohort of a heterogeneous disease, generating a clustering result. Two different gene lists will generate two results, and the results may be similar if the two gene lists play similar functional roles in the disease phenotype. The distance correlation measures the similarity of the two results.

In our case, we used the gene expression data (RNA-seq) of the entire cohort to compute the distance correlation, although theoretically, any gene expression dataset of a cohort with similar disease diversity can be used, and from a more general point of view, any type of data which present deep enough functional relationship among genes, even on normal people, can be used.

After we obtained the distance correlation matrix of any two SMG lists in the context of gene expression, which represents the functional relationship of any two sets of SMGs in the breast cancer disease gene expression, we use this matrix to cluster the entire breast cancer cohort, and the results should show a group of patients grouped by their common underlying perturbation resulted from seemingly different SMG lists.

### Datasets

The Cancer Genome Atlas (TCGA http://www.cancergenome.nih.gov) level-3 breast cancer patients’ somatic mutation derived from WES and RNA-seq data were downloaded from TCGA data portal in July, 2013. Among all 876 available patients at the time of download, 445 have matching SMG and RNA-seq data. The data from these patients were chosen for further analysis. 83 normal breast sample RNA-seq level 3 data were also obtained from TCGA.

### SMG selection

Somatic mutations derived from WES of the TCGA breast cancer patients were screened for significant mutation genes (SMG). SMG was defined as genes with frame-shift Indels, splice site change, non-stop mutation, or nonsense mutation. The mutation of mismatch, silent, RNA and in-frame indel were not included in SMG. For a specific group of patients, the number of SMG refers to the union of SMGs in that group of patients. For all the patients we analyzed in this study, their corresponding SMGs were listed in Additional file 1: Table S2.

### Computing distance correlation

Distance correlation is a recently developed metric with two advantages [6]. First, it can be used to calculate the “correlation” between two matrices instead of just two vectors. Essentially it calculates the similarity of effects of two “feature sets” on separating the same set of samples. Secondly, unlike Pearson correlation that is based on a linear model, it can respond to nonlinear relationships. These properties make it a good candidate for our purpose when comparing relationships between two gene lists.

In this project, the distance correlation was computed using Matlab as described in [6]. Given two lists of SMGs \( {\mathit{\mathsf{g}}}_a \) and \( {\mathit{\mathsf{g}}}_b \) with *n*
_{
a
} and *n*
_{
b
} genes respectively, we first extract their gene expression matrices across *N* patients as

$$ {E}^a=\left[\begin{array}{ccc}\hfill {e}_1^a\hfill & \hfill \cdots \hfill & \hfill {e}_N^a\hfill \end{array}\right]\in {\mathrm{\Re}}^{n_a\times N}\mathrm{and}\kern0.5em {E}^b=\left[{e}_1^b\kern1em \cdots \kern1em {e}_N^b\right]\in {\mathrm{\Re}}^{n_b\times N}, $$

where *e*
^{i}_{
j
}
(*i* ∈ {*a*, *b*}, *j* ∈ {1, 2, …, *N*}) are *n*
_{
i
}- dimensional column vectors representing the expression profiles for the j-th patient over the i-th SMG list. The distance matrices among the patients for the two sets of SMGs can be calculated as

$$ {D}^a=\left[{d}_{jk}^a\right]\in {\mathrm{\Re}}^{N\times N}\mathrm{and}\kern0.5em {D}^b=\left[{d}_{jk}^b\right]\in {\mathrm{\Re}}^{N\times N} $$

with *d*
^{i}_{
jk
}
= ‖*e*
^{i}_{
j
}
− *e*
^{i}_{
k
}
‖, *i* ∈ {*a*, *b*}, *j*, *k* = 1, 2, …, *N*. Let \( \overline{d_{j,\cdot}^i} \) and \( \overline{d_{\cdot, k}^i} \) be the average of the j-th row and k-th column for the matrix *D*
^{i} (*i* ∈ {*a*, *b*}) respectively. Also set \( \overline{d_{\cdot, \cdot}^i} \) be the grand average of all entries of *D*
^{i} (*i* ∈ {*a*, *b*}). Then set the centralized distance matrices to be

$$ \overline{D^i}=\left[\overline{d_{jk}^i}\right]=\left[{d}_{jk}^i-\overline{d_{j,\cdot}^i}-\overline{d_{\cdot, k}^i}+\overline{d_{j\cdot k}^i}\right]\in {\mathrm{\Re}}^{N\times N}\mathrm{with}\ i\ \in \left\{a,b\right\}. $$

Then the *distance covariance* between two distance matrices can be computed as

$$ dCov\left({E}^a,\ {E}^b\right)=\frac{1}{N^2}{\displaystyle {\sum}_{j,k=1}^N}\overline{d_{jk}^a}\cdot \overline{d_{jk}^b,} $$

and the *distance correlation* is defined as

$$ dCor\left({E}^a,\ {E}^b\right)=\frac{dCov\left({E}^a,{E}^b\right)}{\sqrt{dCov\left({E}^a,{E}^a\right)\cdot dCov\left({E}^b,\ {E}^b\right)}}. $$

For the 445 SMG lists obtained from the 445 patients, we compute the *distance correlation matrix*

$$ {D}_{dCor}=\left[ dCor\left({E}^i,{E}^j\right)\right]\in {\mathrm{\Re}}^{455\times 455},i,j=1,2,\dots, 445. $$

### Multidimensional scaling and clustering

In order to visualize the distribution of the patients with the proximity measurements defined by the distance correlation matrix, we applied multidimensional scaling (MDS) to embed the data points (each point represents a patient) in 3D space. Specifically we used Matlab function *cmdscale()* with its default settings. The distance correlation matrix was first transformed to a dissimilarity matrix (using 1 − *D*
_{
dCor
}) before MDS. K-means clustering was performed upon the patients using data using the same dissimilarity matrix. It was carried out using Matlab k-means function with default square-Euclidean distance and replicates of 50, *K* = 3 or 5.

### Jaccard index computing

SMGs for every pair of patients in TCGA BRCA cohort were used to calculate the similarity between the two SMG lists using Jaccard index (*J*), which is defined as:

$$ \boldsymbol{J}=\frac{\left|\boldsymbol{A}\ {\displaystyle \cap}\boldsymbol{B}\right|}{\left|\boldsymbol{A}\ {\displaystyle \cup}\boldsymbol{B}\right|}, $$

where A and B are the two groups of SMGs from any pair of patients in the TCGA BRCA cohort. A∩B is the set of overlapping genes within the two SMG groups A and B, and A∪B is the union of these two groups.

### Survival analysis

For validation, NCBI GEO breast cancer dataset GSE1456 (containing 318 patients of mixed types) [7] as well as Netherlands Kanker Instituut (NKI) NKI-295 dataset (containing 295 patients of mixed types) were used [8]. These microarray datasets (and their specific subtypes) contain gene expression data and matching survival time (years) that are needed for survival analysis. Log-rank test was performed to determine the significance of difference in survival time between two patient groups and Kaplan-Meier curves were plotted.

### Pathway analysis and gene query in TCGA database

Ingenuity Pathway Analysis (IPA) was used to analyze enriched biological functions and pathways in the identified SMGs. The prevalence of SMGs on other cancer types in TCGA database was generated using the cBioPortal online tools (http://www.cbioportal.org) [9].