- Methodology article
- Open Access
Inferring copy number and genotype in tumour exome data
BMC Genomics volume 15, Article number: 732 (2014)
Using whole exome sequencing to predict aberrations in tumours is a cost effective alternative to whole genome sequencing, however is predominantly used for variant detection and infrequently utilised for detection of somatic copy number variation.
We propose a new method to infer copy number and genotypes using whole exome data from paired tumour/normal samples. Our algorithm uses two Hidden Markov Models to predict copy number and genotypes and computationally resolves polyploidy/aneuploidy, normal cell contamination and signal baseline shift. Our method makes explicit detection on chromosome arm level events, which are commonly found in tumour samples. The methods are combined into a package named ADTEx (Aberration Detection in Tumour Exome). We applied our algorithm to a cohort of 17 in-house generated and 18 TCGA paired ovarian cancer/normal exomes and evaluated the performance by comparing against the copy number variations and genotypes predicted using Affymetrix SNP 6.0 data of the same samples. Further, we carried out a comparison study to show that ADTEx outperformed its competitors in terms of precision and F-measure.
Our proposed method, ADTEx, uses both depth of coverage ratios and B allele frequencies calculated from whole exome sequencing data, to predict copy number variations along with their genotypes. ADTEx is implemented as a user friendly software package using Python and R statistical language. Source code and sample data are freely available under GNU license (GPLv3) at http://adtex.sourceforge.net/.
Tumourigenesis is associated with the acquisition of genomic aberrations [1, 2] including copy number alterations (CNAs) and loss of heterozygosity (LOH), which activate oncogenes or deactivate various classes of genes that play crucial roles in cancer development . Previously, such data has been generated from array comparative genomic hybridisation (aCGH) and single nucleotide polymorphism (SNP) genotyping arrays [3–5] but the implementation of massively parallel sequencing (MPS) technologies has provided novel opportunities for using sequencing data to generate equivalent genomic aberration information. In the cancer genomics field it has become a routine to perform whole genome sequencing (WGS) or whole exome sequencing (WES) on DNA extracted from tumour tissues [2, 6, 7]. WES is particularly popular for large sequencing projects seeking to identify disease-specific mutations since it is significantly cheaper than WGS and involves reduced analytical complexity, but typically seeks only to identify single nucleotide variants and small insertions-deletions [8, 9]. CNAs have been successfully detected in gene panel targeted resequencing projects , however, the bioinformatics tools for upscaling this to exomes are lacking. With the efforts of large sequencing consortia, such as The Cancer Genome Atlas (TCGA) network and International Cancer Genome Consortium (ICGC)  and individual research groups, many whole exome sequencing projects involving thousands of tumours are currently underway. However, somatic CNA identification by means of WES data is still in its early stages and needs the development of new robust computational methods and algorithms.
Computational methods have been published for detecting CNAs in targeted resequencing data including whole exome sequencing [7, 12–19], although majority of these methods are designed for analysing variations in germline DNA and perform poorly when applied to the detection of somatic CNAs in tumour samples. Control-FREEC  is a method developed for WGS tumour data and more recently supports the application to WES data. ExomeCNV  and VarScan 2  are designed for CNA identification in tumour WES data, however, they do not predict absolute copy number in non-diploid samples. Other methods can predict absolute copy numbers in non-diploid tumour samples, but only if the ploidy is known a priori[12, 20] which is rarely the case or is impractical to obtain.
Simultaneous generation of depth of coverage (DOC) ratios and B allele frequencies (BAFs) would facilitate the identification of aneuploidy and polyploidy present in cancer samples. A diploid genome would only have BAFs of 0, 0.5 and 1, corresponding to AA, AB and BB genotypes, whereas, for example, a triploid genome, assuming no tumour heterogeneity or normal contamination, would have allele ratios of 0, 0.33, 0.67 and 1 regions corresponding to AAA, AAB, ABB and BBB genotypes and a baseline DOC ratio of one which is similar to a diploid tumour genome. To generate ratios and allele frequencies from WES data, we need to align them to the reference genome and identify the SNPs. Another issue impeding the use of WGS and WES for somatic CNA identification is non-tumour cell contamination, which is present in the majority of tumour tissues. Normal DNA contamination attenuates the signal-to-noise ratio in BAF and coverage ratio signals by altering their values towards a normal diploid genome pattern. Consequently, a high normal cell contamination would make it impossible to differentially detect somatic variations in tumour cells. Among the previously published methods, ExomeCNV  requires tumour purity to be known a priori while Control-FREEC  does not. Previous studies on SNP genotyping array data [4, 21] suggested the use of Hidden Markov Models (HMM) to predict CNA and LOH events with a parameter training procedure, which inherently models the normal contamination.
When analysing exome sequencing data, it is important to overcome the intrinsic noise present in data itself, which hinders its ability to accurately predict CNAs. Programs such as XHMM  and CoNIFER , which are applicable to CNA detection in germline DNA studies, perform principle component analysis and singular value decomposition, to remove the noise present in WES data and normalise the read counts. However, these methods are not applicable in single tumour/matched normal sample pairs. A potential way to overcome this issue is to implement discrete wavelet transformation (DWT) in normalising exome ratios as we have reported previously . DWT normalisation can achieve higher precision (that is, lower number of false positives) while maintaining a comparable or superior sensitivity compared to other methods.
Overall, the three above mentioned issues, (i) noise in WES data, (ii) ploidy and (iii) normal cell contamination in tumour samples have not been simultaneously evaluated by any of the computational methods applicable to WES tumour data. Although our previously proposed method  considers these three issues, it requires prior knowledge of contamination and tumour ploidy. Therefore, in the current study, we propose a new approach named Aberration Detection in Tumour Exome (ADTEx), which automatically estimates the three aspects important to WES tumour data and predicts CNA events and genotypes of SNPs associated with these regions. Further, ADTEx makes explicit predictions on chromosome arm level CNA events, which is a pattern commonly observed across many tumour types.
Results and discussion
ADTEx for aberration detection in tumour exome
ADTEx consists of two HMMs to predict copy number alterations and genotypes in WES data of paired tumour/normal samples. Two types of signals were generated from the exome data, DOC ratios and BAFs. Copy number analysis using ratios can be complemented by the computation of BAFs to determine ploidy and zygosity. Here, we propose to apply these two types of signals to predict the zygosity state of segments in the genome targeted by exome capture.
One HMM is used to predict CNAs, which in combination with BAF signal can be used to estimate ploidy of the tumour and predict the absolute copy numbers. A second HMM is used to predict zygosity or genotype of each CNA segment. The overall framework of the method is given in Additional file 1: Figure S1. We applied our method to 17 in-house tumour samples derived from ovarian cancer patients to assess the performance of the method. We selected ten samples with different aberration types from those 17 samples to compare the performance of ADTEx against existing somatic CNA prediction methods. Further, we evaluated the performance of ADTEx on publicly available 18 paired ovarian cancer/normal samples downloaded from the TCGA project.
The parameters in our two HMMs were trained using an expectation maximisation (EM) algorithm . Given these estimated parameters, the maximum likelihood of the hidden state sequence is determined using Viterbi algorithm . In the first HMM, to detect copy number, we applied EM algorithm for each chromosome separately, while in the second HMM, we pooled all chromosomes and estimated parameters jointly. However, in the second HMM, the initial state distribution was computed separately for each chromosome.
Aberrations detection in 17 ovarian cancer samples
We used 17 paired ovarian tumour/normal samples to evaluate the performance of our proposed method (Table 1 and Additional file 1: Table S1). The samples were sequenced on an Illumina HiSeq 2000 (one pair) and the Illumina Genome Analyzer IIx (16 pairs). Exon capture was performed using the 51 Mb Agilent SureSelect Human All Exon V4 (one pair), the 36.5 Mb Roche NimbleGen EZ Exome SeqCap V2 (11 pairs) and the 26 Mb Roche NimbleGen EZ Exome SeqCap V1 platform (five pairs). Each WES sample was aligned to the reference genome, GRCh37, using BWA . The predicted aberrations in exome samples were validated by Affymetrix SNP 6.0 data generated for the same samples. ASCAT  was used to predict CNAs from the SNP array data.
Correlation between SNP array data and WES data
Manual inspection of the ratio plots between SNP 6.0 array data and whole exome sequencing data proved to be highly consistent. We also observed statistically significant positive correlation (Additional file 1: Figure S2) between SNP 6.0 data ratios and exome depth of coverage ratios. To obtain these, we partitioned the exome into windows containing five exons and computed the mean normalised DOC ratios in each partitioned window. Mean SNP 6.0 intensity ratios were calculated from the probes overlapping the exonic windows. The Spearman’s rank correlation was calculated between these two data sets for three different tumour samples (Additional file 1: Figure S2) and the Spearman’s rho ranged from 0.63 to 0.81 (p value <0.001). This evaluation demonstrates that WES is comparable with SNP 6.0 array data for the analysis of CNAs.
Polyploidy detection in exome data
To predict the copy number of each exonic locus, we first needed to establish a method for predicting the polyploidy status of each tumour sample. Additional file 1: Figure S3 shows the properties of BAF to detect correct ploidy by our method. In each case, ADTEx accurately determined the copy number status of the regions corresponding to baseline ratio, by comparing BAFs and predicted copy number. This estimation is only possible when the BAFs of the tumour sample at normal heterozygous loci are available, DOC ratios alone would not allow correcting for this.
Overall, prediction accuracies of the ploidy detection were measured by calling copy number at each exonic locus and validating them against the calls made by ASCAT on SNP 6.0 array data (Figure 1 and Additional file 1: Table S2).
Comparison with SNP genotyping array data
CNA were evident in 14 of the 17 samples, which were therefore used to evaluate the performance of CNA predictions. LOH was predicted in all 17 samples (including three samples with copy neutral LOH), which were therefore used in the genotype prediction evaluation. A representative comparison between exome results and SNP 6.0 results are shown in Figure 1a for the sample OV12.
We evaluated ADTEx predicted CNAs against those predicted by ASCAT on SNP 6.0 array data, which we assumed as the ground truths. Sensitivity, specificity, precision and accuracy were computed for each sample (Figure 1 and Additional file 1: Table S2). Each exon was treated as a point of measure for the performance calculation with true positives (TPs) considered those exons identified by both ASCAT and ADTEx as gains/losses and false positives (FPs) considered those exons predicted by ADTEx as gains/losses and copy neutral by ASCAT. True negatives (TN) and false negatives (FN) were recognised in the same manner. ADTEx had median values of 94.1% sensitivity, 98.3% specificity, 94% precision and 98% accuracy for detecting CNAs. Triploidy and tetraploidy were each present in 2 of 14 samples while most of other samples were aneuploid (Additional file 1: Figure S4 and Additional file 2). As shown in Additional file 1: Figure S5a, a high proportion of the genome was affected by duplications with an average of 26.6% of the genome amplified compared with just 4% deleted. The smallest deletion and amplification detected by ADTEx were 100pb and 80 bp long respectively, while the largest deletion and amplification were 181Mbp and 243Mbp in length respectively. The resolution of the smallest CNA detected was restricted to the smallest exon detected as a CNA.
For large scale (>0.1 of the chromosome) CNA events, we further assessed the performance of ADTEx based on the number of events detected by the method. According to the results from SNP arrays, there were 150 large scale events in all samples (Additional file 1: Table S11). ADTEx detected 145 events, which is a sensitivity of 96.7%. For the assessment, an event predicted by ASCAT is considered to be correctly detected by ADTEx when there is more than 50% overlap between the predictions made by two methods.
Figure 1c shows the performance measurements of ADTEx on LOH predictions, evaluated considering ASCAT predictions as ground truths (Additional file 1: Table S3). The heterozygous SNP loci in matched normal sample were retained for this analysis. Further, we filtered out the SNP loci that fell outside of the regions of the predicted copy number variant and copy neutral segments for the relevant sample. Each SNP locus was considered as a performance measurement point with true positives considered those SNP loci identified as having LOH by both ASCAT and ADTEx and false positive events considered those SNP loci defined by ASCAT as non-LOH but predicted by ADTEx as LOH. Median values of sensitivity and specificity were 90.1% and 99.7%.
The distribution of the total length of LOH of a sample ranged from a minimum of 0.1 Mb to maximum of 1,577 Mb with a mean of 273 Mb. Additional file 1: Figure S5b shows the distribution of the genomic proportion of allelic imbalance presents in each sample. Additional file 1: Figure S6 shows different types of LOH events identified using ADTEx on whole exome sequencing data.
Performance metrics for detecting allele specific copy number alterations (ASCNA) were reported as median sensitivity of 96.8% and specificity of 98.2% and are summarised in Additional file 1: Table S4 in terms of sensitivity, specificity, precision and accuracy.
Performance evaluation on TCGA data
Next, we evaluated the performance of ADTEx on high-grade serous ovarian adenocarcinoma samples sequenced as part of TCGA project . We downloaded BAM files of 18 paired tumour/normal samples sequenced at Washington University from the Cancer Genomics Hub (CGHub). These were sequenced using Illumina Genome Analyzer IIx and target capture was performed by Agilent SureSelect Human exome platform. All samples have been aligned to the GRCh37-lite. The number of reads mapped to the targeted regions ranged from 57,215,953 to 118,126,167 (Additional file 1: Table S5).
To evaluate the somatic CNA detection of ADTEx, we also downloaded the raw Affymetrix SNP 6.0 files of the same samples from TCGA data portal. Then, as before, we carried out CNA detection on SNP 6.0 data using ASCAT algorithm . These results were treated as the ground truths for the evaluation. Each sample has very high aberration rate with focal and large scale CNAs, typical of this tumour type.
In detecting somatic CNA, ADTEx showed median sensitivity of 93.7%, precision 79.3% and F-measure of 83.0%. F measure was computed using the following equation.
The largest detected CNA segment was 242,433,351 bp long in length and smallest detected CNA was 120 bp long. On average 1,035 CNA segments per tumour were detected by ADTEx (Additional file 3). In all samples, 76,341 exons were identified as losses and 2,357,365 exons were identified as gains. Therefore, for each tumour sample there were about 135,000 altered exons and this number is consistent with the reported values  for ovarian cancer samples. Sensitivity, precision and F-measure for detecting LOH were computed as 92.9%, 96.3% and 94.5% respectively, relative to ASCAT predictions.
Comparison with other copy number predicting algorithms
In order to demonstrate the effectiveness of our proposed method, we carried out a comparison between ADTEx and other somatic CNA detecting algorithms. We selected ExomeCNV , VarScan 2  and Control-FREEC  for the comparison as they were developed for WES data generated from paired tumour/normal samples. Further, we evaluated the performance of our previous work described in Amarasinghe et al., 2013  and details of the evaluation are given below separately in the section “Comparison with previous work”. We selected ten samples from a in-house data set, with different copy number aberrations to compare the competing methods. These samples contain focal aberrations, chromosome arm-level and full chromosomal events. In all cases we used SAMtools pileup/mpileup  to generate coverage files as inputs for the three competing exome based methods. Supplementary Methods section in Additional file 1 describes the parameter settings used with each method.
Results from the comparison study
The overall performance of the four methods is shown in Figure 2 and Table 2. Additional file 1: Table S6 gives the performance matrix of each method on each sample. VarScan2 does not predict the absolute copy number, instead it predicts gains/losses and copy neutral regions. Accordingly, in Figure 2a, VarScan2 result shows losses, neutral regions, and gains as having copy 1, 2 and 3. Figure 2b shows the performance metrics of each method in terms of sensitivity, specificity, precision and accuracy. We were particularly interested in sensitivity [no. of TP/(no. of TP + no. of FN)] and precision [no. of TP/(no. of TP + no. of FP)] as the performance measures to base our comparison. Therefore, we report the F measure values here.
The calculated mean (median) F-measure values for ADTEx, ExomeCNV, VarScan2 and Control-FREEC were 91.5% (92.3%), 76.5% (90.6%), 82.0% (84.7%) and 76.0% (82.5%) respectively. Overall, both mean and median performance scores of ADTEx were better compared to other methods. VarScan2 ranked second and ADTEx showed 12% increase in terms of mean F measure over VarScan2. As per the Table 2, ADTEx showed superior or comparable performance in terms of mean and median values of all performance measures compared to other competing methods. The main reason for superior results produced by ADTEx compared to other methods is that it simultaneously evaluates specific characteristics of tumour WES data, namely (i) noise, (ii) ploidy and (iii) normal cell contamination.Figure 2a shows the copy number predictions of the four methods on chromosome 3 for the sample OV1. Interestingly, Control-FREEC appeared to misidentify the normal regions as it misjudged the baseline of the ratios, resulting in the prediction of copy neutral regions as amplifications and deletions as copy neutral.
Comparison of small CNAs
SNP genotyping arrays cannot be used to detect smaller-size CNAs (<1Mbp) due to limited resolution. To assess the performance at this level, we compared smaller (<1Mbp) CNAs detected by ADTEx with three other exome based methods. Overall, ADTEx, ExomeCNV, VarScan 2 and Control-FREEC detected 448, 7167, 4618 and 1494 CNA events, respectively, of which 79%, 94%, 93% and 82% were identified as smaller than 1Mb. Identification of high percentages of smaller events by all four methods can be attributed to the sparse nature of WES data. As ADTEx performs noise reduction in WES data, many of the smaller CNAs predicted by other three methods are integrated into larger CNAs predicted by ADTEx. The overlap between other methods and 354 smaller CNAs identified by ADTEx is given in Additional file 1: Table S12. VarScan 2 identified 164 of those events, which has the highest concordance rate.
Comparison of LOH and ASCNA predictions
Control-FREEC predicts LOH and ASCNA events similar to ADTEx, hence we compared ADTEx LOH and ASCNA predictions against the results from Control-FREEC (Table 3 and Additional file 1: Table S7 and Table S8). Five samples with different variations were chosen to carry out the comparison. ADTEx outperformed Control-FREEC in terms of precision and F-measure. The mean F-measure and precision of ADTEx were calculated respectively as 89.5% and 89.1% while that of Control-FREEC were 46.3% and 37.0% with respect to LOH predictions. The mean F-measure and precision values for ASCNA predictions were 83.7% and 82.0% for ADTEx and 66.1% and 68.1% for Control-FREEC respectively.
Comparison with previous work
We described a copy number predicting algorithm in Amarasinghe et al., 2013 . The main differences between current work and the algorithm in Amarasinghe et al., 2013  are i) no a priori knowledge of contamination or ploidy is required and ii) genotype status of SNPs in each of the CNA segment predicted by ADTEx. We evaluated the performance of Amarasinghe et al., 2013  using the ten samples in the comparison study and also applied the method on triploid sample OV8 (Table 2 on diploid samples and Additional file 1: Table S9). The ploidy and contamination values were chosen based on manual confirmation and prediction made by ASCAT  on SNP 6.0 data of the same samples. For the ten diploid samples we saw 4% and 5% increases in median precision and F-measure values respectively for ADTEx compared to Amarasinghe et al., 2013 . Further, for the OV8 sample ADTEx showed clear improvements in terms of sensitivity, specificity, precision and accuracy.
Evaluation of the effect of normal cell contamination
We carried out computer simulations to evaluate the effect of normal cell contamination on CNA and LOH predictions by ADTEx. We use OV1 data sample to generate different combination of data sets with 0.1 to 0.7 normal cell contamination and read depths of 150X (original coverage), 60X and 400X. The original normal cell contamination is predicted as 25%, therefore the expected contamination levels can be calculated as 0.325, 0.4, 0.475, 0.55, 0.625, 0.7 and 0.775 respectively. For original coverage, ADTEx accurately predicted the normal cell contamination up to 0.7 (Pearson correlation 0.99). Additional file 1: Figure S7 shows the changes in the predicted values. The F measure performance on CNA predictions (Additional file 1: Figure S8) was reported taking the SNP 6.0 results as ground truths. Relatively high F measure value is maintained (F measure 0.66) at the 0.55 contamination level for CNA detection. The F measure performance of LOH predictions (Additional file 1: Figure S9) suggests ADTEx performed well even at 0.625 normal cell contamination level (F measure 0.96). At high levels of contamination (>0.7), prediction of CNA is difficult due to the very low level of variation in the depth of coverage ratios. Further, 150X and 400X coverage showed better performance compared to 60X coverage.
Chromosome arm level copy number aberrations
Chromosome arm level or full chromosomal CNAs are commonly observed across many tumour types [28, 29]. Different tumour types have been reported to have recurrent arm level events on different chromosomes. For example gain in chromosome 3 or 3q are more common in cervical cancer while loss of chromosome 10 is common in glioblastomas .
ADTEx explicitly predicts chromosome arm level events based on the results generated by exon level copy number predictions. The distribution of DOC ratios of copy neutral regions is calculated from the exon level result. Then, statistical confidence level (assuming no CNA) for each chromosome arm considering the mean DOC ratio is produced based on the calculated distribution. At the 0.05 confidence level we were able to detect 44 chromosome p/q level events (15 gains and 29 losses) and 30 full chromosome CNA events (15 of each gains and losses) in the 17 in-house ovarian tumour samples. According to the carefully curated list of chromosome arm level CNAs (Additional file 1: Table S10) in those ovarian cancer samples, ADTEx showed 96% sensitivity and 99% specificity. Four single copy losses and one single copy gain residing in samples with higher ploidy were not detected as significant compared to copy neutral level.
We have described a new approach to infer somatic CNAs and genotype states in WES data from tumour samples. Our method both models and evaluates tumour related attributes in WES data. Further, the proposed method explicitly predicts chromosome arm level CNA events, which are commonly found in many tumour types. We implemented this approach in a software called “Aberration Detection in Tumour Exome (ADTEx)”, which is freely available under GNU General Public Licence v3 (GPLv3). To our knowledge, ADTEx is the first attempt to computationally derive absolute copy numbers and genotypes using WES data from tumour samples without any a priori knowledge of levels of normal DNA contamination or ploidy of the tumour samples. The algorithm takes DOC ratios and BAFs as inputs and models them using Gaussian distribution. Prior to applying HMM to derive CNAs and genotypes, the DOC ratios are smoothed by discrete wavelet transformation techniques. We applied the algorithm to 35 (in-house and public data) paired ovarian tumour/normal samples captured using three different targeted capture platforms and sequenced using Illumina Genome Analyzer II or HiSeq2000 sequencers. Further, to our knowledge ADTEx is the only method that predicts chromosome arm level CNA in WES data.
We demonstrated the superior performance of ADTEx compared to existing methods. Most importantly, we compared the performance of ADTEx against the results generated by ASCAT on Affymetrix SNP 6.0 data and showed that our method can produce results consistent with SNP array data, the gold standard for detecting CNAs. We believe that the integrated CNA and LOH predictions in ADTEx will greatly improve the type and usefulness of the data generated in large WES studies.
However, We have not addressed the issue of tumour heterogeneity where some components of the tumour biopsy will have different clonal outgrows harbouring different genetic alterations. Clonal heterogeneity will result in an amalgamation of the signals present in both subpopulations which may result in a reduced sensitivity of ADTEx. This could be a future research direction that can be pursued using WES data.
ADTEx processing pipeline is shown in Additional file 1: Figure S1. Overall, the method consists of two HMMs. One HMM predicts copy number alterations using noise reduced DOC ratios and the other HMM predicts tumour zygosity states at heterozygous SNP loci in normal samples. We have proposed a two stage HMM algorithm, which ensures:
Increased performance in predicting CNAs by taking all data points that are available (depth of coverage ratios) for the evaluation of first HMM and then calling zygosity on SNP loci in the second HMM.
Computational efficiency, achieved through dimensionality reduction in the first stage of HMM.
Calculation of DOC ratio and BAF
DOC ratios are calculated as the ratios between average coverage per base of exonic regions in tumour and matched normal samples using the following steps: (i) coverage per base at targeted regions are calculated by BEDTools software , (ii) average coverage per base is calculated for each exonic region, (iii) regions with lower average coverage than a predefined threshold (=10 reads) are excluded from the analysis, (iv) mean coverage normalisation is performed for each sample and (v) ratio between mean normalised DOC of tumour and matched normal samples are calculated. These ratios showed extensive intrinsic noise particularly in low coverage regions. We applied DWT denoising on ratios generated from low coverage regions. DWT denoising helps to achieve a higher sensitivity and precision .
BAF is calculated as the ratio between number of B alleles and total number of A and B alleles. Here, ‘A allele’ refers to the reference allele and ‘B allele’ refers to the non-reference allele in DNA sequencing data when the sequence reads are aligned to the reference genome. We calculated SNP loci in both tumour and matched normal samples by applying The Genome Analysis Toolkit (GATK) software . BAF at each SNP is then calculated applying the following formula,
For example, based on the above formula BAF should be around 0, 0.5 and 1 for genotypes AA, AB and BB, respectively. If a BAF is deviated from these values, it would mean that there is a possibility of copy number alteration. For instance, a BAF value of 0.25 indicates a genotype of AAAB.
DOC ratio baseline evaluation
DOC ratios can be used as an indication of the relevant copy number present in tumour sample compared to the matched normal sample. Copy number of a particular region and the DOC ratio has a direct relationship in non-cancerous samples. For example, ratios of one and two represent copy number two and four respectively. However, due to the presence of extensive abnormalities in tumour samples, mean DOC ratio will differ from the nominal ratio of one. This change in the signal is known as the baseline shift (Additional file 1: Figure S3). We corrected for the baseline by identifying the peaks of the distribution of the DOC ratios. Ratios are normalised based on the value of the peak closest to one.
On the other hand, the baseline ratio would not correspond to copy number of two in most of the tumour samples due to the presence of polyploidy and aneuploidy. We successfully identified the copy number of the baseline by evaluating BAF of each tumour sample. Correct identification of the copy number of the baseline made it possible to predict the absolute copy number in the tumour samples. Our approach, which detects the baseline ploidy is explained in the following section.
HMM to predict copy number variations and identification of baseline ploidy
The definitions of hidden states are described in detail in Additional file 1: Supplementary Methods. We model the emission probability of DOC ratios by a Gaussian distribution. The mean of the distribution depends on the hidden state while standard deviation remains constant for all states. Standard deviation was kept constant after analysing the data and observing that there is not much impact on the end result. In the current work, we used expectation maximisation (EM) algorithm during HMM parameter training step. We trained initial state distribution, stationary transition matrix and mean of the Gaussian distributions. We selected one chromosome at a time during the training step to clearly and accurately capture the initial states and ratio variations in chromosomes. This approach helped to achieve faster computational time as well due to the reduced number of data points. Finally, we applied Viterbi algorithm  to predict the sequence of hidden states.
When the BAFs of tumour samples are present, we fitted HMM for different base ploidy values (reflecting different copy number 2, 3, and 4 states) and then applied the following steps to determine the base ploidy: (i) Select the SNPs which overlaps with the captured exonic regions, (ii) segment BAFs (bi as given in Equation (1)) of SNPs using DNACopy  circular binary segmentation algorithm, (iii) estimate B allele count (NB,i) for different values of contamination () as given in Equation (2), (iv) calculate the cost of each estimation by taking the distance between estimated NB,i and rounded value of NB,i (as B allele counts cannot be fractions) and (v) calculate the minimum summation of distances that would give the best fit for the base ploidy.
This procedure identifies the absolute copy numbers when base ploidy of the tumour sample is not known a priori.
HMM to identify zygosity states
Definition of hidden states
The definitions of the hidden states to predict tumour zygosity are shown in Table 4. ADTEx analyses positions, , with heterozygous SNPs in the matched normal sample. SNPs having BAF within 0.3 and 0.7 in normal sample were selected as heterozygous regions (thresholds were selected as in ). The removal of homozygous loci in the normal sample ensured the detection of tumour-specific somatic LOH events.
As depicted in Table 4, each hidden state can be uniquely identified with a copy number and zygosity state. The copy number of each SNP locus is calculated from the previous HMM predicting copy number.
Probability density function of observations
Mirrored BAFs (as defined below) around 0.5 are used as the observations in the proposed HMM.
Normal cell proportion () and standard deviation (b) of BAF signal are considered as main parameters of the observation distribution. We assumed mirrored BAF signal is normally distributed with state specific mean and a constant b for all states. If BAF of lth SNP is b l , then observation probability distribution function (pdf) for a hidden state Sk with associated copy number CT can be formulated as:
Here, state specific mean () is given by:
where, NB(Sk) is the expected B allele count for state Sk.
Non-stationary transition matrix
The state transition matrix is considered to be non-stationary as described in [6, 32]. The genomic distance (d) between two SNPs is non-uniform and hence we calculated the transition probabilities based on exponential function utilizing d. Further, the transition probability depends on the state specific copy number. For example, if the current observation j, is assigned with a copy number of two, then the model can only transit to ‘hidden state 2’ and ‘hidden state 3’ (Table 4) from previous state i.
In equation (5), L is chosen to be 2 Mb.
Expectation maximisation (EM) algorithm
EM algorithm has been used in various parameter estimation tasks including HMM . In the following text we describe the EM algorithm that is used to estimate HMM parameters. We pooled data points from all the chromosomes for the parameter training. The pooled training procedure helped to make sure that all hidden states would be covered during parameter selection. However, the joint estimation of parameters did not favour the determination of initial state distribution () as it can vary at the beginning of each chromosome. Therefore, we trained a separate initial state distribution for each chromosome.
We can derive the partial log likelihood function for observation distribution as:
In the above Equation (7), l (Sk) is the posterior probability of lth SNP to be in state Sk, which is calculated using forward-backward algorithm . By taking the derivative with respect to and setting it to zero, we derive the following equation regarding normal cell contamination estimation:
Where P T = 2α + (1 − α)C T (S k ) and t = EM iteration number. The algorithm is constrained to identify in the interval of 0 ≤ ≤ 0.7, so if α(t+1) is less than 0 or greater than 0.7, it will be set to 0 or 0.7 respectively.
Estimated as per Equation (8) is used to update state dependent means. Although, the presence of normal cell contamination, shrinks BAFs to 0.5, it does not affect BAFs in the ‘HET’ states of the HMM. Hence, mean of these ‘HET’ states are calculated as per the typical EM training step described in .
Aberration detection in tumour exome
The cancer ganome Atlas
Single nucleotide polymorphism
General public license
Copy number alteration
Loss of heterozygosity
Array comparative hybridization
Massively parallel sequencing
Whole genome sequencing
Whole exome sequencing
International cancer genome consortium
Depth of coverage
B allele frequency
Hidden Markov model
exome hidden Markov model
Copy number inference from exome reads
Discrete wavelet transformation
Allele specific copy number analysis of tumours
Allele specific copy number alterations
Cancer genomics hub
The genome analysis toolkit.
Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet. 2003, 34 (4): 369-376. 10.1038/ng1215.
Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature. 2009, 458 (7239): 719-724. 10.1038/nature07943.
Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov models approach to the analysis of array CGH data. J Multivariate Anal. 2004, 90 (1): 132-153. 10.1016/j.jmva.2004.02.008.
Li A, Liu Z, Lezon-Geyda K, Sarkar S, Lannin D, Schulz V, Krop I, Winer E, Harris L, Tuck D: GPHMM: an integrated hidden Markov model for identification of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome SNP arrays. Nucleic Acids Res. 2011, 39 (12): 4928-4941. 10.1093/nar/gkr014.
Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array‒based DNA copy number data. Biostatistics. 2004, 5 (4): 557-572. 10.1093/biostatistics/kxh008.
Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, Giuliany R, Rosner J, Oloumi A, Shumansky K, Chin S-F, Turashvili G, Hirst M, Caldas C, Marra MA, Aparicio S, Shah SP: Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Research. 2012, 22: 1995-2007. 10.1101/gr.137570.112.
Sathirapongsasuti JF, Lee H, Horst BAJ, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics. 2011, 27 (19): 2648-2654. 10.1093/bioinformatics/btr462.
Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, Lawrence MS, Sivachenko AY, Sougnez C, Zou L, Cortes ML, Fernandez-Lopez JC, Peng S, Ardlie KG, Auclair D, Bautista-Pina V, Duke F, Francis J, Jung J, Maffuz-Aziz A, Onofrio RC, Parkin M, Pho NH, Quintanar-Jurado V, Ramos AH, Rebollar-Vega R, Rodriguez-Cuevas S, Romero-Cordoba SL, Schumacher SE, Stransky N, et al: Sequence analysis of mutations and translocations across breast cancer subtypes. Nature. 2012, 486: 405-409. 10.1038/nature11154.
Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, Wedge DC, Nik-Zainal S, Martin S, Varela I, Bignell GR, Yates LR, Papaemmanuil E, Beare D, Butler A, Cheverton A, Gamble J, Hinton J, Jia M, Jayakumar A, Jones D, Latimer C, Lau KW, McLaren S, McBride DJ, Menzies A, Mudie L, Raine K, Rad R, Spencer Chapman M, Teague J, et al: The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012, 486: 400-404.
Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, Nord AS, Mandell JB, Swisher EM, King M-C: Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc Natl Acad Sci. 2010, 107: 12629-12633. 10.1073/pnas.1007983107.
TheInternationalCancerGenomeConsortium: International network of cancer genome projects. Nature. 2010, 464 (7291): 993-998. 10.1038/nature08987.
Amarasinghe K, Li J, Halgamuge S: CoNVEX: copy number variation estimation in exome sequencing data using HMM. BMC Bioinformatics. 2013, 14 (Suppl 2): S2-10.1186/1471-2105-14-S2-S2.
Fromer M, Moran Jennifer L, Chambert K, Banks E, Bergen Sarah E, Ruderfer Douglas M, Handsaker Robert E, McCarroll Steven A, O’Donovan Michael C, Owen Michael J, Kirov G, Sullivan Patrick F, Hultman Christina M, Sklar P, Purcell Shaun M: Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth. Am J Hum Genet. 2012, 91: 597-607. 10.1016/j.ajhg.2012.08.005.
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012, 22 (3): 568-576. 10.1101/gr.129684.111.
Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP, Quinlan AR, Nickerson DA, Eichler EE, NHLBI Exome Sequencing Project N: Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012, 22 (8): 1525-1532. 10.1101/gr.138115.112.
Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, Tothill RW, Halgamuge SK, Campbell IG, Gorringe KL: CONTRA: copy number analysis for targeted resequencing. Bioinformatics. 2012, 28 (10): 1307-1313. 10.1093/bioinformatics/bts146.
Lonigro RJ, Grasso CS, Robinson DR, Jing X, Wu Y-M, Cao X, Quist MJ, Tomlins SA, Pienta KJ, Chinnaiyan AM: Detection of somatic copy number alterations in cancer using targeted exome capture sequencing. NEOPLASIA. 2011, 13: 1019-1025.
Love M, Mysickova A, Sun R, Kalscheuer V, Vingron M: Modeling read counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol. 2011, 10 (1): 1-
Nord A, Lee M, King M-C, Walsh T: Accurate and exact CNV identification from targeted high-throughput sequence data. BMC Genomics. 2011, 12 (1): 184-10.1186/1471-2164-12-184.
Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E: Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012, 28 (3): 423-425. 10.1093/bioinformatics/btr670.
Sun W, Wright FA, Tang Z, Nordgard SH, Loo PV, Yu T, Kristensen VN, Perou CM: Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 2009, 37 (16): 5365-5377. 10.1093/nar/gkp493.
Zucchini W, MacDonald IL: Hidden Markov Models for Time Series. 2009, Boca Raton, Fla: CRC Press
Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.
Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.
Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, Perou CM, Børresen-Dale A-L, Kristensen VN: Allele-specific copy number analysis of tumors. Proc Natl Acad Sci. 2010, 107: 16910-16915. 10.1073/pnas.1009843107.
Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474: 609-615. 10.1038/nature10166.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.
DeVita VT, Hellman S, Rosenberg SA: Cancer: Principles & Practice of Oncology. 2005
Mermel C, Schumacher S, Hill B, Meyerson M, Beroukhim R, Getz G: GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011, 12 (4): R41-10.1186/gb-2011-12-4-r41.
Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303. 10.1101/gr.107524.110.
Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J: QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007, 35 (6): 2013-2025. 10.1093/nar/gkm076.
This work is partially funded by Australian Research Council (grant DP1096296). KCA was funded by The University of Melbourne MIFRS and MIRS scholarships.
The authors declare that they have no competing interests.
KCA and JL devised the study. KCA developed and implemented the algorithm, conducted all analytical experiments and wrote the initial manuscript. SMH, GLR and PAC gathered the tumour samples and conducted the experiments to generate whole exome sequencing and SNP 6.0 array data sets. IGC is the project leader of the data generation study and provided oversight to sequencing data generation. SKH oversaw the computational and bioinformatics methods of the study. All authors edited, read and approved the final manuscript.
Electronic supplementary material
About this article
Cite this article
Amarasinghe, K.C., Li, J., Hunter, S.M. et al. Inferring copy number and genotype in tumour exome data. BMC Genomics 15, 732 (2014). https://doi.org/10.1186/1471-2164-15-732
- Hide Markov Model
- Discrete Wavelet Transformation
- Hide State
- Whole Exome Sequencing
- Single Nucleotide Polymorphism Locus