High-throughput technologies can effectively replace cytogenetics to generate high-resolution maps of chromosomal aberrations. Cataloging potential markers at different length-scales, such as whole chromosome deletions, to few genes, or even to specific nucleotide mutations, has enabled the association of important biological mechanisms with tumor formation and progression. However, the caveat of interpreting data generated by these techniques is that signals measured from tumor biopsies can be an aggregate profile of different cells. To better understand the potential of high-throughput technologies, our study addresses two issues: i) the effects of subclonal heterogeneity on CNA analysis; ii) the identification of copy number alteration measures that are robust to heterogeneity.
In our work we show that heterogeneity has a hindering effect on CNA analyses. Currently there is no direct mathematical procedure to correctly infer the copy number of a heterogeneous sample when the number of homogeneous components is greater than two. Even in toy-models, in which the main focus is on the aberration status rather than on the actual copy number (binary encoding in vectors of 0 and 1's), the search space of aberration profiles grows exponentially with the number of measurements.
On the other hand, at this point, it is important to note that most aberrations span several loci; thus, measurements from SNP-arrays, or from next-generation sequencing techniques will be grouped in clusters of statistically indistinguishable numerical values. As a result, the number of unique measurements is generally smaller than the total number of SNPs on an array, or than the number of base pairs sequenced using next-generation sequencing techniques. The search space of aberration profiles, however, is too large to lead to a unique solution, even using powerful regularization methods. To address this issue, one has to impose additional mathematical constraints motivated by the specific properties of the biological system under investigation.
Currently, state-of-the-art CNA detectors are model-based, i.e. they attempt to predict the exact copy number status and the genotype by fitting the measured quantities with pre-coded models [10, 11, 20]. The underlying models have been designed to improve sensitivity. However, when HMM-based approaches are employed in simulated scenarios representing optimal signals, the underlying aberrations are not properly inferred (Figure 2A and Figure 3). We propose to reduce the CNA analysis to discrimination between three distinct states, gain, loss, and normal, to detect the presence and type of the state of the strongest aberrant component in the aggregate signal.
We present a simple and biologically motivated framework to design measures of CNAs based on alteration of total DNA, and allelic balance at heterozygous loci. These measures can be easily implemented for three-state classification tasks using thresholding. Our M-measure is one example selected from a family of measures, and it clearly showed unparalleled robustness to sample heterogeneity, thus leading to improved performance in detecting the presence of CNAs. Interestingly, the three-state Viterbi algorithm based on the M-measure did not show a significant improvement in terms of performance over the M-measure alone. This reflects the fact that the decay in performance as the mixing coefficient of the stromal component increases is mostly due to the decreasing strength of the aberrant signal in the mixture, relative to the constant level of experimental noise. Clearly, the simpler goal of three-state classification based on measures such as the M-measure is easier to meet; the advantage is that this type of measure is also more robust when put to test.
Our proposed framework for detecting CNAs in high-throughput SNP-profiling is a conceptual generalization of a class of empirical measures used to identify CNAs  and sample heterogeneity  employing massive parallel sequencing of genomic DNA. Here, we seek to demonstrate that they can be used to analyze signals in other types of sequencing experiments such as RNA-seq, exome sequencing or ChIP-seq. High-throughput mRNA profiling of tumors has shown dependency between the copy number status and the expression level of the mRNA product . Measuring allelic imbalance from RNA-Seq experiments is partially associated with copy number status of the underlying genomic DNA, yet, as mentioned above, it is not feasible to correctly infer the exact CNA status due to a variety of confounding effects, including heterogeneity. We therefore do not expect RNA-seq alone to be useful in inferring focal copy number aberrations in cancer samples. Next-generation sequencing signals, however, are suitable for studying heterogeneity and characterizing subclonal components by their collection of specific markers. Subclonal heterogeneity is not reflected in CNA exclusively; it is also reflected in other somatic mutations (e.g. point mutations) and other traits, such as DNA methylation and histone modification.
We analyzed RNA-Seq experiments to unravel subclonal heterogeneity. We show that some of the measured allele-specific expression patterns result from differences in the abundance of subclonal populations, each harboring different acquired mutations (Figure 5A). The reported loci are remarkable examples of novel candidate somatic polymorphisms, likely associated with subclonal populations. This approach has striking conceptual and methodological simplicity, and in the near future deviations in the distribution of allelic imbalances along exons might be used to infer the extent of the sample's heterogeneity. The possibility of identifying novel candidate somatic mutations associated with subclonal populations requires experimental strategies that will enable separating different subpopulations for further analyses.
Our results uncover a large degree of heterogeneity at the level of genomic DNA, both in terms of CNA and point mutations. Interestingly enough, heterogeneity is present in both primary and metastatic tumors, suggesting that the variety of underlying mutations may already be overwhelming at diagnosis. The identification of driver mutations, which typically requires examination of DNA at high resolution, is linked to our ability to detect subclones capable of escaping the selective survival pressure and metastasizing. Detecting potentially rare subclones at this resolution requires very deep sequencing of large cohorts of patients.