Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs

Background The field of cancer genomics has rapidly adopted next-generation sequencing (NGS) in order to study and characterize malignant tumors with unprecedented resolution. In particular for cancer, one is often trying to identify somatic mutations – changes specific to a tumor and not within an individual’s germline. However, false positive and false negative detections often result from lack of sufficient variant evidence, contamination of the biopsy by stromal tissue, sequencing errors, and the erroneous classification of germline variation as tumor-specific. Results We have developed a generalized Bayesian analysis framework for matched tumor/normal samples with the purpose of identifying tumor-specific alterations such as single nucleotide mutations, small insertions/deletions, and structural variation. We describe our methodology, and discuss its application to other types of paired-tissue analysis such as the detection of loss of heterozygosity as well as allelic imbalance. We also demonstrate the high level of sensitivity and specificity in discovering simulated somatic mutations, for various combinations of a) genomic coverage and b) emulated heterogeneity. Conclusion We present a Java-based implementation of our methods named Seurat, which is made available for free academic use. We have demonstrated and reported on the discovery of different types of somatic change by applying Seurat to an experimentally-derived cancer dataset using our methods; and have discussed considerations and practices regarding the accurate detection of somatic events in cancer genomes. Seurat is available at https://sites.google.com/site/seuratsomatic.


Prior Selection for Seurat Genotype Priors
The priors used for the genotype in the normal genome are the SNP frequencies for human diploid chromosomes, as calculated by (Li et al. 2009): π het = 0.001 π var = 0.0005 π ref = 1 -(π het + π var) = 0.9985 π somatic and πLOH are high-end estimates of the frequency of somatic events, given that that the mutation profile of each individual cancer can vary wildly even within subtypes. At 0.0001, they expect 300,000 events through the human genome.

Usage
Seurat is a command-line Java application, packaged as a stand-alone JAR file. It is compatible with any operating system platform which is support by the Sun Java 1.6 runtime (including Linux, Windows, and Mac OS X).
Seurat can be executed using a command prompt (terminal) window of the operating system, by moving to the directory of the JAR file and executing the following command:
-go <gene_out> Tab-delimited text output for non-focal events. Most large event analyses require the 'refseq' argument below.

Optional:
--indels Enable somatic insertion/deletion calling. Default = false -refseq <refseq_file> Name of RefSeq transcript annotation file. If specified, gene-wide events can be detected, and SNVs/LOH events will be annotated with the gene name.
-mbq <integer> Minimum base quality required to consider a base for calling. Default = 10.
-mmq <integer> Minimum mapping quality for reads to be considered in the pileup. Default = 10.
-ref <true/false> If true, only reference-matching homozygous positions are allowed on the normal, for SNV discovery. Reduces false positives due to faulty alignments. Default = true.
-alpha <integer> Alpha parameter of the beta-distribution used for evaluating homozygosity likelihood. Default = 1.
-beta <integer> Beta parameter of the beta-distribution used for evaluating homozygosity likelihood. Default = 701.
--both_strands Whether or not variant evidence needs to appear on both strands on the tumor in order to be considered. Default = false.
-coding_only Reduces full-genome and transcript analyses to coding regions of genes. Requires therefseq argument. Default = false.
-mm Maximum number of mismatches against the reference that are allowed in a read. Reads surpassing this number are filtered out. Can be used as an attempt to salvage 'dirty' BAMS containing large numbers of problematic and unlikely alignments, usually due to bugs in the aligner software.) Default =3.
-pileup Enable full pileup output for each call in the VCF file. Default = false -mcv <integer> The minimum per-sample coverage required to attempt a call at a locus. Default = 6 Usage notes / Known Issues -We recommend the use of the GATK Indel Realigner to jointly process the normal and tumor DNA BAMs for Seurat, as we have empirically found that it reduces indel false positive counts significantly.
-We have found that the Base Quality recalibrator provides a minimal accuracy improvement.
-We do not recommend the use of the base alignment quality (BAQ) (which can be enabled if needed with the '-baq' argument). BAQ currently appears to be causing a significant drop in sensitivity.
-We do not recommend the use of the Variant call recalibrator on Seurat results, as the tool was not designed for somatic calls.
-Seurat accepts most global GATK arguments that can affect its functions. For more information on the GATK framework, please visit http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit.
-Seurat does not support the '-nt' option for running multiple threads within GATK. However, the GATK interval option ("-L") can be used to split the data into "bins" that can run simultaneously.
-If Seurat runs without any errors, but the output files do not contain any calls, please check the following: a) Read group tags ("@RG") are required for all BAMs that are provided; BAMs without RG tags will be ignored (more accurately, any reads that are not assigned in a read group will not be used for analysis). If your BAMs were not generated with RG tags, you may use the Picard tool AddOrReplaceReadGroups to add them. Please note (b) below if you have to add read groups manually.
b) The same sample name ("SM") cannot be used on read groups belonging on both the normal and the tumor samples. GATK currently uses sample names to group alignments together, so if they are identical between datasets, they will be merged in-memory.
-BAM files for analysis must match exactly on their header's sequence names, sequence order, and sequence length.
-For more information on how GATK handles BAM files, please refer to http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files The TYPE tag in the INFO field describes the somatic event that was detected.
The ALT genotype describes either the variant detected in the tumor genome (in case of a somatic SNV event), or the variant allele that is lost in the tumor (in case of an LOH event). The strings "<INS>" and "<DEL>" represent indels.

Large event list (-go):
A simple two-field tab-delimited text file in the following format: [