Inferring copy number and genotype in tumour exome data

Background Using whole exome sequencing to predict aberrations in tumours is a cost effective alternative to whole genome sequencing, however is predominantly used for variant detection and infrequently utilised for detection of somatic copy number variation. Results We propose a new method to infer copy number and genotypes using whole exome data from paired tumour/normal samples. Our algorithm uses two Hidden Markov Models to predict copy number and genotypes and computationally resolves polyploidy/aneuploidy, normal cell contamination and signal baseline shift. Our method makes explicit detection on chromosome arm level events, which are commonly found in tumour samples. The methods are combined into a package named ADTEx (Aberration Detection in Tumour Exome). We applied our algorithm to a cohort of 17 in-house generated and 18 TCGA paired ovarian cancer/normal exomes and evaluated the performance by comparing against the copy number variations and genotypes predicted using Affymetrix SNP 6.0 data of the same samples. Further, we carried out a comparison study to show that ADTEx outperformed its competitors in terms of precision and F-measure. Conclusions Our proposed method, ADTEx, uses both depth of coverage ratios and B allele frequencies calculated from whole exome sequencing data, to predict copy number variations along with their genotypes. ADTEx is implemented as a user friendly software package using Python and R statistical language. Source code and sample data are freely available under GNU license (GPLv3) at http://adtex.sourceforge.net/. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-732) contains supplementary material, which is available to authorized users.


HMM to predict copy number alterations
This step predicts the unknown copy number state sequence from the observed depth of coverage (DOC) ratios. The hidden states of the HMM represent the copy number states of each targeted region. In the default settings we have five different copy number states representing 0 to 4 copy numbers. These states are interpreted in biological context as homozygous deletion (copy 0), hemizygous deletion (copy 1), no CNV or copy neutral (copy 2), 1 copy gain (copy 3), and copy amplifications (copy 4 and above).
The observed DOC ratios (R ij ) are smoothed by applying DWT denoising before feeding them to the HMM. Here, R ij represents the DOC ratio of the i th targeted region in j th chromosome. Each chromosome j of each tumour-control samples pair is considered separately for copy number identification. The fitted discrete time HMM is given below: 1. The total number of hidden states in the model is given by K and those are denoted by S = S 1 , S 2 ,...,S K . If there are L exons in the sample of consideration, the state of l th exon (e l ) equals to S k where 1≤l≤L and 1≤k≤K.
2. The initial state distribution = { k } where π k = P(e 1 = S k ), 1 ≤ k ≤ K 3. The state transition probability distribution A = a mp where a mp = P(e l+1 = S p | e l = S m ), 1 ≤ m, p ≤ K 4. The emission probability distribution is given by Here, N represents the Gaussian distribution. Mean (µ k ) of that distribution vary with different states and the normal cell contamination percentage and ploidy. We used a common standard deviation, σ, for all states.
Above HMM can be represented compactly as = (A,B, ) where A, B and represent transition probability matrix, emission probability distribution and initial state distribution.

Emission distribution
A 100% pure diploid tumour sample would have DOC ratios (0,0.5,1,1.5) representing copy numbers (0,1,2,3) correspondingly. However, this is not essentially the case when there is normal contamination, aneuploidy and polyploidy present in the data. A polyploid sample with normal contamination will see a deviation in the relationship between copy number and DOC ratio. This relationship is given by, Where , P Tij and P T represent normal contamination, copy number of considered tumour region and ploidy of tumour sample respectively.
Therefore, we modelled the mean of emission distribution by equation 1, where P Tij depends on the copy number of each hidden state. P T either can be given as an input or a best fit can be computed if B allele frequencies (BAF) are present. This computation is described in the main article.

Parameters used in the comparison of each method
ExomeCNV (Sathirapongsasuti, et al., 2011): We observed that this method performs better with their segment merging step, which utilizes circular binary segmentation algorithm in DNACopy package (Olshen, et al., 2004). Hence, we used the CNV segments output after the merging step and converted this to exon level copy number. We then compared its performance against copy number predictions by ASCAT on SNP 6 array data. We applied ExomeCNV version 1.2 with its default parameters, except for the read length and normal cell contamination. We applied sample specific normal cell contamination values as input to ExomeCNV based on manual inspection of the data and predictions made by ASCAT.
VarScan 2 (Koboldt, et al., 2012): This method (VarScan 2 version 3.1) segments the exome based on the differences between the read depths of tumour and normal samples. We followed the recommended workflow provided in the VarScan project page in SourceForge website (http://varscan.sourceforge.net/copy-number-calling.html): 1) Run VarScan 'copynumber' routine on pileup files generated by SAMtools (Li, et al., 2009), 2) Run VarScan 'copyCaller' on the results from step (1), 3) Apply circular binary segmentation (CBS) from DNAcopy package (Olshen, et al., 2004), 4) Visualise the results and adjust baseline if necessary. If baseline is adjusted, then apply steps (3) and (4) again and 5) Merge segments and classify events using 'mergeSegments.pl' script. The result contained segmented exome based on the predicted CNV status of each genomic locus. Since we calculated the performance based on exon level predictions, we converted the output to exon level predictions by referring to the overlap of each exon with the predicted CNV segments. If an exon has two different CNV status predicted by consequent CNV segments, then the CNV status which overlaps with the most number of bases of the exon is taken as the correct call.
Control-FREEC (Boeva, et al., 2012): All the default parameters specified by the authors of the method (version 6.2) in their example on exome sequencing data analysis were used in our comparison study. The samples were tested with the 'contaminationAdjustment = TRUE' and 'ploidy = 2' options.              Table S10. Arm level copy number aberrations.