In this section, we introduce the generative mixture model of MixClone, which is an extension of our previous work on tumor purity estimation[11]. First, we introduce the notations for input data. Then, we describe the probabilistic models for sequence information of both SCNAs and allele frequencies. Finally, we combine these two types of data into a single likelihood model, and describe an algorithm to solve the model.
Basic notations
The raw input data for MixClone are two aligned whole genome sequencing read sets of paired normal-tumor samples and a genome segmentation file based on the tumor sample. Following the notations from our previous work [11], we assume the tumor genome has been partitioned into J segments. We also assume there are I
j
heterozygous SNP sites within segment j in the corresponding normal genome, and use (i, j) to index SNP site i within segment j. For each SNP site (i, j) we define the A allele to be the reference allele and the B to be the alternative allele, with respect to the reference genome. We also use a superscript N to denote data from normal samples and superscript T to denote data from tumor samples. Overall, the observed data are summarized in the following notations [13]:
= number of reads mapped to the B allele in the normal sample at site (i, j).
= reads depth of the normal sample at site (i, j).
= total number of reads mapped to segment j of the normal sample.
The notations for the observed data from tumor samples are similarly defined, e.g. denotes total number of reads mapped to segment j of the tumor sample.
Modeling SCNAs
Next, we describe the probabilistic model for SCNAs data. For each segment j, we define an allelic configuration H
j
to represent its underlying allele-specific copy number status. For example, if the absolute copy number of segment j is 2, then the compatible allelic configurations are PP, MM and PM, where P and M denotes the paternal and maternal allele of the tumor genome, respectively. Since PP and MM are not distinguishable based on sequence information alone as the reference human genome is not phased, we define the set of all possible allelic configuration as
(1)
assuming the maximum copy number for each segment is 3. The corresponding copy number associated with each allelic configuration in is then
(2)
MixClone allows the user to specify the maximum copy number and the default value is 6 in the released package [11]. We further assume there are K subclonal populations within the tumor sample, each of which has an associated cellular prevalence ϕ
k
∈ 0[1]. The subclonal type of each segment j is denoted as
(3)
representing one of the K possible subclonal populations. Given the allelic configuration H
j
= h and the subclonal type Z
j
= k, the average copy number of segment j within the tumor sample, taking into account the subclonal cellular prevalence ϕ
k
, is
(4)
Based on the Lander-Waterman model [14], the probability of sampling a read from a given segment j depends on three main factors: 1) its copy number, 2) its total genomic length, and 3) its mappability, which depends on factors such as repetitive sequence and GC content [9]. For each segment j, we associate a coefficient θ
j
to account for the effect of its mappability and genomic length. Thus the expected read counts mapped to segment j, which is denoted as λ
j
, is proportional to . For example, for segment x and segment y, we have
(5)
Because the mappability coefficients (θ
j
's) matter only in a relative sense, we take , as these segments should have the same sequence properties between the normal and tumor samples.
Additionally, to determine the absolute value of λ
j
, we curate a list of segments which contain no loss of heterozygosity according to their allele frequencies information. Based on the observed number of reads mapped to each segment, we further remove "outlier" segments from the list if their copy numbers are different from the bulk of the segments' copy numbers in the list. Finally, we call the remaining segments in the list as "baseline segments" and denote the set of these segments as S. We assume the allelic configurations of all the baseline segments are PM with copy number n
s
= 2. Other possible allelic configurations for baseline segments, which have equal copy numbers for each allele (e.g. φ, PPMM), are likely to be rare, and currently we do not model them. Then based on n
s
, we specify λ
j
as follows
(6)
where denotes the number of reads mapped to segment s of the tumor sample.
Finally, we model the number of reads mapped to segment j in the tumor sample as a Poisson distribution, given H
j
and Z
j
(7)
Details on curating the baseline segments are given in Supplementary, Additional file 1.
Modeling allele frequencies
Next, we describe the probabilistic model used for allele frequencies of heterozygous SNP data. For each SNP site i within segment j, we denote its tumor genotype as G
ij
, which is selected from the set of all possible tumor genotypes up to a maximum copy number alteration, e.g.
(8)
assuming the maximum copy number is 3. The corresponding B allele frequencies (BAF) for all the genotypes in are
(9)
in which, ε ≪ 1 is a small random deviation accounting for general sequencing errors. We choose E = 0.01, which is equivalent to a Phred quality of 20 [15].
Given the tumor genotype G
ij
= g, the allelic configuration H
j
= h, and the subclonal type Z
j
= k, the average BAF of site (i, j) within the tumor sample, taking into account the subclonal cellular prevalence φ
k
, is
(10)
in which µ0 = 0.5 is the BAF of heterozygous SNP sites in the normal sample. Finally, we model the distribution of the B allele count at site (i, j) as a binomial distribution, given G
ij
, H
j
and Z
j
(11)
Combining SCNAs and allele frequencies
Now, we combine sequence information from both SCNAs and heterozygous SNP sites. For all the heterozygous SNP sites within the same segment, their genotypes should be consistent with the underlying allelic configuration of the segment. We model this consistency through a predefined conditional probability . If the genotype g is inconsistent with the allelic configuration h, e.g. AA is inconsistent with PM, we assign a small probability σ as Q
gh
, otherwise we assign equal probabilities to genotypes that are consistent with the allelic configuration.
Conditional on the underlying allelic configuration H
j
and subclonal type Z
j
, the probability of observing B allele read count at site (i, j) is given as
(12)
We assume that conditional on the allelic configuration H
j
, the B allele read counts at different sites within the same segment j are independent of each other, and are also independent of the total read count of the segment. Then, the joint probability of observing the two types of read counts information of segment j is
(13)
Likelihood model
We have specified the joint distribution of the two types of read counts information of segment j. We then further model the allelic configuration H
j
and the subclonal type Z
j
of segment j as random variables that follow categorical distributions
(14)
(15)
ρ
j
= (ρj∅, ⋯, ρjPPM/PMM), where is the probability of observing h as the allelic configuration of segment j. π = (π1, ⋯, π
K
), where is the probability of observing subclonal type k for all the segments. The model parameters Θ is defined as
(16)
And the model likelihood of observing all the data is then
(17)
We use Expectation-Maximization (EM) algorithm [16] to find the maximum likelihood estimation of Θ. The complete details of the EM updates are given in Supplementary, Additional file 1.
Model selection
One of the key issues in subclonal analysis is to determine the number of subclonal populations K. PyClone and PhyloSub use posterior sampling methods to estimate K [5, 7], while THetA requires users to specify K as an input [9]. Since the probabilistic model of MixClone is a generative mixture model, the model complexity and the corresponding log-likelihood increases as K increases. Therefore, we use a criterion based on the increase of the log-likelihood to select K. Practically, Mix-Clone allows the user to specify K. If K is not specified, MixClone runs the mixture model five times with different K in range of 1 to 5. We denote the log-likelihoods under the five different settings as , and the total log-likelihood increase as
If |Δ/L1| < 0.01, which means the ratio of total log-likelihood increase is less than 0.01, MixClone predicts there is no subclonal event in the tumor sample and selects K = 1 as the number of subclonal populations. If |Δ /L1| ≥ 0.01, MixClone further calculates another quantity
(19)
which is the cumulative log-likelihood increase from K = 1 to K = i as a percentage regarding to the total increase Δ. If δ
i
≥ 0.9 and δi−1 < 0.9, MixClone selects K = i as the number of subclonal populations.
In practice, we suggest users use this criterion as a heuristic guide when analyzing real data, and determine the number of subclonal populations in conjunction with regard to other external information.
MixClone software package
Figure 1 is the general workflow of MixClone. MixClone is a comprehensive software package, including subclonal cellular prevalences estimation, allelic configuration estimation, absolute copy number estimation and a few visualization tools. This package is implemented in Python and is built on top of the PyLOH package, previously released by us [11]. It also utilizes some features from the software package JointSNVMix [13], which have been explicitly indicated in the source code.