### De-mixing of samples

In a homogeneous sample *s*, the exact genotype of each cell in the sample can be represented by the pair of vectors, A and B, containing respectively the number of copies of the A- and B-allele at each locus. For the sake of simplicity, we assume that each locus is at most bi-allelic, namely that only two variants, A and B, can occur. In the case of tumor biopsies, or studies in which cells from several individuals are pooled together, the measured and vectors correspond to the average copy number of the A- and B-alleles at each locus across all cells in the sample. When studying heterogeneous samples, it is very unlikely to find any cell in the sample whose genotype corresponds exactly to the measured profile.

The "de-mixing problem" is the task of recovering the fraction *x*_{i} , called the mixing coefficient, and the genotype *g*_{i} = {*A*_{i}*, B*_{i} } of each i-th homogeneous group of cells in the sample (e.g. the different subclones). This problem corresponds to solving the linear system

,

where *x*_{i} is an element of the vector X, *j* is the locus index, *a*_{ji} and *b*_{ji} are elements of the matrices *A* and *B* respectively, and *N*_{0} is the set of all non-negative integers.

### Deconvoluting genotyping data signal from heterogeneous cell population

The de-mixing problem is trivially solved when only two independent components (e.g. from two different individuals) are mixed together, and their copy number, *a*_{ji} *+ b*_{ji} , is measured using independent experimental approaches in at least two loci. However, in general, deconvolution of CNA signals from a heterogeneous sample to its homogeneous components is an ill-posed mathematical problem. Linear algebra solution approaches will encounter non-unique solutions corresponding to a non-unique combination of possible discrete vectors representing different subclonal aberration profiles.

Regularization methods (such as L1 minimization for recovery of sparse signals) are often employed to reduce the number of feasible solutions in underdetermined problems. The underlying assumption of applying regularization in this context is that a sample will not generally contain all theoretical subclones, defined by all possible combinations of aberrations. Instead, a few subclones arising during cancer progression will be selected. This assumption is also consistent with the possibility that subclones' viability may be hindered by the lethal synergy of some of their aberrations. Donoho and Tanner reported key results on regularization methods from counting faces of randomly projected polytopes when the projection radically lowers dimension [9]. It is possible to show that even in the simplified problem with *a*_{ji}∈ {0,1} *b*_{ji}∈ {0,1}, the required condition for sparse reconstruction,

is not fulfilled for any *s* > 2, where 2^{s}-1 is the size of the search space, with *s* equal to the number of loci measured, *k* is the number of components (e.g. observable subclones) and *s*+1 is the maximum number of equations, corresponding to the number *s* of unique measurements (e.g. SNPs), including the additional constraint that the sum of mixing coefficients is 1. Thus, a mixture of 3 or more components, which is typical in tumor biopsies, cannot be uniquely deconvolved solving the linear system, or using sparse reconstruction methods. In the presence of heterogeneous samples no inference can be made of the exact copy number from the aggregate signal of the subclones.

To illustrate these points, we provide a detailed example of the conditions and solution of copy number inference for a mixture of a tumor subclone component with its matched stromal component. In this scenario, de-mixing is elicited by additional constraints on the copy number of the stromal and of the tumor components. The stromal component is assumed to have no CNA, thus its number of copies is constant and equal to 2. The linear system in Eq.1 of the measured aggregate genotypes at the j-th locus is reduced to:

,

where *x* is the mixing coefficient of the stromal component, *a*_{j1} and *b*_{j1} are the A- and B-allele copy number of the tumor, *a*_{j2} is the A-allele copy number of the stromal component and *N*_{0} is the set of all non-negative integers. In addition, we include the equation *a*_{j1}*+b*_{j1} *= m*_{j} in the system in order to restrict the copy number of the tumor component at the j-th locus to an integer *m*_{j} . We consider only the loci harboring deletion events with an aggregate copy number smaller than two, where m_{j} - the copy number of the tumor component - can take one of two values: one if the deletion in the tumor cells is hemizygous, or zero if it is homozygous. We then solve the system at each locus independently for the two possible values of *m*_{j} and determine the two corresponding possible values of *x*. Since there is only one tumor component, there must exist one value of *x* to solve Eq.3 for all loci simultaneously. The mode of all values of *x* across all loci corresponds to this value. The corresponding value of *m*_{j} at the j-th locus is then the exact copy number of the tumor component.

### A simple framework for CNA analysis

CNA analysis encompasses the tasks of detecting and classifying copy number aberrations. Current HMM based applications for CNA analysis rely on the assumption that increasing the number of hidden states, thereby increasing the complexity of the underlying model, will eventually result in higher CNA classification accuracy [10–12]. As detailed above, determination of the exact number of copies is an ill-posed problem in the presence of heterogeneous samples. Therefore, we propose to detect substantial deviation from the normal diploid state by classifying the aberrations as gains or losses inferred from the dominant component in the mixture. The goal of CNA analysis then becomes to infer the most recurring aberration that would be observed in a hypothetical, high-resolution karyotyping of the same heterogeneous sample, if it were available.

Peiffer et al. [13] employed a convenient transformation of *A-* and *B-* allele copy numbers to *B-* allele frequency β and the total-DNA enrichment ρ. For clarity of discussion, we use a simpler version of this transformation for each SNP defined as:

,

where *N* is the total number of cells in the sample, are the allele counts estimated from the SNP-arrays. Peiffer and et al. suggest estimating this quantity from the signal of a normal population (e.g. HapMap samples [5]). In general, occurrence of CNA is reflected in the total-DNA enrichment. In the presence of heterogeneous samples, these changes can be difficult to detect due to the high level of noise. As expected, in heterogeneous samples, aberrations also lead to allelic imbalance at positions of heterozygous SNPs (Figure 1). We thus propose to base CNA detection in heterogeneous samples on measures of allelic imbalance.

One such measure of allelic imbalance is

,

where W corresponds to a window of appropriate size. This quantity is an aberration score for each SNP, based on the measure of allelic imbalance in its neighborhood. We denote this measure as the M-measure and use it for the remainder of the analyses with W = 20. W = 20 is an arbitrary choice of a window large enough to have a reasonably robust estimate of the M-Measure. The M-measure is insensitive to hemizygous deletion events if they occur in the majority of cells in the sample. We address this issue by testing whether the total-DNA enrichment ρ, also known as R-ratio, is significantly below its expected value of 1. We use the M-measure to classify SNP CNA profiles into three states: gain, loss and neither. The M-measure is a measure of deviation from the expected state of normal non-tumor sample. In order to have a measure that is robust to noise, we chose trigonometric functions such that their product is close to zero in a wide region around the value of the expected state of a normal, non-tumor sample. These trigonometric functions have a non-linear rapid change of the deviation score once the threshold of having a true gain or loss is crossed. As previously discussed, in this context there are numerous solutions to the problem of inferring the exact number of copies consistent with the measured aggregate cell population signal (Eq.1). As none of the possible solutions can be preferred to the others, characterization of CNAs in terms of gain and loss is a practical alternative.

The three-state M-measure classification can also be used in a three-state Viterbi algorithm whose hidden states are normal, deletion, and duplication segments. We name this classification approach 3SMM. We use 3SMM only in our simulated scenarios. In order to use the best parameters for our simulations, we determined transition and emission probabilities from the underlying true copy number status of the simulated data for each run.

### Simulated data

We randomly generated two haplotypes of 10,000 loci to simulate a mixture of a homogeneous tumor subclone and its matched stromal cells. A haplotype is a vector of zeros and ones, representing the two alleles respectively. Our random generation process ensures that at each locus, the germline had a 90.5% probability of being homozygous and a 9.5% probability of being heterozygous. The stromal (non aberrant) component of the mixture was obtained by combining the two haplotypes as described in Eq.1. The tumor (aberrant) component was obtained by adding aberrations to the two haplotypes. To have a unique profile of aberration status, we excluded simultaneous occurrence of different kinds of aberrations (e.g. deletion and duplication) at the same locus. We mixed the normal and aberrant components with proportions of α and (1-α) respectively. Gaussian noise with a signal-to-noise ratio of 30 was used to simulate experimental noise.

We then simulated samples comprising two tumoral subclones and their matched stromal component. As in the case with one subclone, we excluded simultaneous occurrence of different kind of aberrations (e.g. deletion and duplication) at the same locus in the same subclone (e.g. neutral LOH); however, the two tumor components were generated such that their aggregate signal contained all combinations of aberrations (e.g. deletion and duplication). The three components were then mixed by a linear combination with proportions of α, 2(1-α)/3 and (1-α)/3 corresponding to the normal component and the remaining two tumoral subclones respectively. As in the case with two components, Gaussian noise with a signal-to-noise ratio of 30 was used to simulate experimental noise.

We tested four different types of aberrations: homozygous deletion, hemizygous deletion, a gain of one copy, and a gain of two copies. We did not test neutral LOH since it is not an event that would affect the copy number. Each aberration covers 1,000 loci (~20 Mbps). Aberrations were separated by 1,000 aberration-free loci (~20 Mbps). Our aim was to test the classification performance of different approaches to heterogenous-like data rather than testing the sensitivity to detect short focal events. This motivated our choice of aberrations of 1000 loci each.

### Measure of performance

To assess algorithmic performance we used the Area Under the Receiver-Operator Characteristic Curve (AUCROC) measure. In the present study, which is different from the usual formulation, in which the classification task is to distinguish between a set of positive instances and a set of negative instances, we had three class labels: gain, loss and normal. In a typical machine learning scenario in which there is no parameter to vary, the AUCROC in the operative point (which is also termed balanced accuracy) is computed as

,

where *TP* is the number of true positives, *P* is the number of positives, *TN* is the number of true negatives and *N* is the number of negatives. Similarly, in our three-classes scenario, the AUCROC can be computed as

,

where *TP*_{gain} is the number of SNPs that are amplified and correctly inferred as such by the algorithm, *P*_{gain} is the total number of SNPs that are amplified in the sample, *TN*_{gain} is the number of SNPs that are not amplified and that are correctly inferred as such by the algorithm, *N*_{gain} is the total number of SNPs that are not amplified in the sample; *TP*_{loss} is the number of SNPs that are deleted and correctly inferred as such by the algorithm, *P*_{loss} is the total number of SNPs that are deleted in the sample, *TN*_{loss} is the number of SNPs that are not deleted and that are correctly inferred as such by the algorithm, *N*_{loss} is the total number of SNPs that are not deleted in the sample; *TP*_{normal} is the number of SNPs that are not aberrated and that are correctly inferred as such by the algorithm, *P*_{normal} is the total number of SNPs that are not aberrated in the sample, *TN*_{normal} is the number of SNPs that are aberrated and correctly inferred as such by the algorithm, and *N*_{normal} is the total number of SNPs that are aberrated in the sample.

### SNP profiling using microarrays

DNA from 30 melanoma cell lines were hybridized to Illumina's Human1M BeadChip (Illumina Inc. San Diego, CA). We generated B-allele frequencies and Log-R ratios using standard procedures included in the Illumina BeadStudio package. We normalized with respect to the population of western European ancestry (CEU) from the HapMap project that was analyzed on the Illumina Human1M BeadChip.

### Design, probe annotation, and data processing of the arrays for detection of genome-wide gene expression

We used NimbleGen genome-wide human expression arrays (2005-04-20_Human_60mer_1in2) with a total of < 400,000 probes for < 30,000 transcripts and < 20,000 known genes, as of NimbleGen annotations. NimbleGen provides design and probe annotation. Standard methods for one-channel and two-channel microarrays from the R statistical software were used as previously described [14].

### Transcriptome profiling using next-generation sequencing

We re-analyzed the RNA-seq sample MeWo from a recent study [15, 16]. Namely, the reads were aligned to the reference genome using Bowtie [17] with standard parameters. Nucleotide variations were determined after pileup using Samtools [18], and the frequency of the variant, *β*, was calculated as in Eq.4.

### Fluorescent in situ hybridization (FISH)

Fluorescence in situ hybridization (FISH) was performed using probes from the bacterial artificial chromosome (BAC) clones (RPCI-11 human BAC library) containing the selected genes *E2F8* (248D22 and 80B10 at 11p15.1), *ETV4* (100E5 and 147C10 at 17q21.31), *EZH2* (140E16 and 24N19 at 7q36.1) and *FAM84B* (455K11 and 90G11 at 8q24.21). All BAC clones were cultured in 100 ml LB media supplemented with chloramphenicol at 37°C shaker incubator overnight, and cell pellets collected by centrifuge were used for DNA extraction using the large-construct kit (Qiagen, Valencia, CA). Two BAC clones for the 5'-end or the 3'-end of each gene were labeled differently by SpectrumGreen-dUTP or SpectrumOrange-dUTP using the nick translation kit (Abbott Molecular, Des Plaines, IL). Probe hybridization on slides of interphase cells was performed following the laboratory's standardized protocol. Hybridization signals were visualized and captured using an Olympus BX60 fluorescence microscope with CytoVision software version 4.5.2 (Genetix, San Jose, CA). In each sample, 200 nuclei were inspected and the signal patterns were documented.