Computational methods and resources for the interpretation of genomic variants in cancer

The recent improvement of the high-throughput sequencing technologies is having a strong impact on the detection of genetic variations associated with cancer. Several institutions worldwide have been sequencing the whole exomes and or genomes of cancer patients in the thousands, thereby providing an invaluable collection of new somatic mutations in different cancer types. These initiatives promoted the development of methods and tools for the analysis of cancer genomes that are aimed at studying the relationship between genotype and phenotype in cancer. In this article we review the online resources and computational tools for the analysis of cancer genome. First, we describe the available repositories of cancer genome data. Next, we provide an overview of the methods for the detection of genetic variation and computational tools for the prioritization of cancer related genes and causative somatic variations. Finally, we discuss the future perspectives in cancer genomics focusing on the impact of computational methods and quantitative approaches for defining personalized strategies to improve the diagnosis and treatment of cancer.


Dataset
To provide an overview of the somatic mutation data available, we analyzed the variants from the International Cancer Genome Consortium (ICGC) data portal (https://dcc.icgc.org/) release 17 (September 2014). We collected the simple somatic variants corresponding to 43 different cancer projects and after manual inspection we removed the data from Acute Lymphoblast Leukemia (ALL-US), which consist of only 2 samples. The final list of 42 cancer projects is reported in Supplementary Table 1. In our analysis, we merged data from the same cancer types coming from different projects. Thus, our dataset consists of 33 unique cancer types. Finally, we build a dataset, referred to as PanCancer, by pooling all the data from the previous 42 cancer projects.

Somatic mutation recurrence analysis
A large fraction of somatic mutations observed in cancer is passenger and does not have significant impact on the progression of the disease. In contrast, a small percentage of mutations, defined as drivers, increase the fitness of tumor cells. On average, it is expected that driver mutations are more recurrent than passengers across different cancer samples. For this reason, we analyzed all the 33 cancer types to study the recurrence of mutation events. In this paper, we use the following definitions: i. Recurrent Somatic Mutation: a variation that is observed at least in two donors of our dataset. ii. Mutation Recurrence: the number of samples in the dataset in which a specific somatic mutation is observed. A Recurrent Somatic Mutation has Mutation Recurrence equal or bigger than to 2. iii. Fraction of Somatic Mutations: it represents the portion of somatic mutations with Somatic Mutation Recurrence equal or higher than a given threshold. A particular case defined, as Fraction of Recurrent Mutations, is the number of somatic mutations observed at least in two donors divided by the total number of unique mutations. iv. Fraction of Donors: it is calculated as the fraction of donors in which is observed at least for one somatic mutation with Mutation Recurrence equal or higher than a given threshold.
These values can be calculated for each cancer type separately or for the PanCancer dataset.
In the latter case, the number of Recurrent Somatic Mutations increases because the mutations can also occur in two donors affected by different cancer types. To show this difference, in Fig. 3 of the manuscript, we present the complementary cumulative distributions (CCDs) observed when cancer types are considered separately or together.
Although the values defined above are affected by the consistency (i.e., biases due to batch effects) of the dataset, it is still useful to estimate the expected Fraction of Donors that can be recovered using a subset of currently identified recurring variants. Thus, the recurrence analysis presented in this paper consists of plotting the Fraction of Somatic Mutations and Fraction of Donors at different Mutation Recurrence thresholds. In Fig. 4 we report the CCDs obtained for 27 cancer types with at least 50 donors and for which at least 4 points are available (therefore, NBL, CLLE, LICA, EOPC, LIAD and GACA are excluded). To estimate the trend of the curves, the points have been fitted using the following equation Where B=1-A. This equation has been used to estimate the fraction of somatic variants that recurs in 95% of the donors (see Supplementary Table 2).

Exonic mutations and gene-based analysis
For the analysis of the ICGC data, we also focused on subset of variants in exonic region. To select this subset of variants we consider only the somatic mutations with assigned Ensembl gene code and discarded all the upstream, downstream and intronic mutations. Using this subset of somatic mutations in exonic regions we calculated the number variants corresponding to each donor and the relative distribution for 33 cancer types reported in Fig. 5 of the manuscript. In addition we used a subset of 62,206 exonic Recurrent Somatic Mutations from the PanCancer to calculate a feature vector for tumor type similarity comparison. Thus, each cancer type can be described with a vector of 17,381 elements that correspond to the total number of genes with at least one exonic Recurrent Somatic Mutation. Each element represents the number of donors with the corresponding gene affected by a Recurrent Somatic Mutation.
Two cancer A and B, described by the vectors V A and V B , are compared using the cosine similarity that is defined as follows: The values of exonic Recurrent Somatic Mutation, the affected genes and donors for each cancer type are reported in Supplementary Table 3. The cosine similarity measure is used to build the dendrogram of tumors reported in Fig. 6. The dendrogram is obtained using a hierarchical clustering algorithm implemented in the heatmap.2 function in R. Cancer sequencing projects from ICGC data portal (https://dcc.icgc.org/) analyzed in this paper.  SM95 is the percentage of somatic mutations needed to recover 95% of the donors. This SM95 value is estimated using the previous equation.