MixClone: a mixture model for inferring tumor subclonal populations

Li, Yi; Xie, Xiaohui

doi:10.1186/1471-2164-16-S2-S1

Volume 16 Supplement 2

Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics

Proceedings
Open access
Published: 21 January 2015

MixClone: a mixture model for inferring tumor subclonal populations

Yi Li¹ &
Xiaohui Xie^1,2,3

BMC Genomics volume 16, Article number: S1 (2015) Cite this article

2196 Accesses
10 Citations
Metrics details

Abstract

Background

Tumor genomes are often highly heterogeneous, consisting of genomes from multiple subclonal types. Complete characterization of all subclonal types is a fundamental need in tumor genome analysis. With the advancement of next-generation sequencing, computational methods have recently been developed to infer tumor subclonal populations directly from cancer genome sequencing data. Most of these methods are based on sequence information from somatic point mutations, However, the accuracy of these algorithms depends crucially on the quality of the somatic mutations returned by variant calling algorithms, and usually requires a deep coverage to achieve a reasonable level of accuracy.

Results

We describe a novel probabilistic mixture model, MixClone, for inferring the cellular prevalences of subclonal populations directly from whole genome sequencing of paired normal-tumor samples. MixClone integrates sequence information of somatic copy number alterations and allele frequencies within a unified probabilistic framework. We demonstrate the utility of the method using both simulated and real cancer sequencing datasets, and show that it significantly outperforms existing methods for inferring tumor subclonal populations. The MixClone package is written in Python and is publicly available at https://github.com/uci-cbcl/MixClone.

Conclusions

The probabilistic mixture model proposed here provides a new framework for subclonal analysis based on cancer genome sequencing data. By applying the method to both simulated and real cancer sequencing data, we show that integrating sequence information from both somatic copy number alterations and allele frequencies can significantly improve the accuracy of inferring tumor subclonal populations.

Background

Tumor genomes have been shown to present extensive cellular heterogeneity for decades since Nowell's original clonal theory for tumor progression [1]. Identifying tumor subclonal populations is important for both understanding the evolution of tumor cells, and for designing more effective treatments as pre-existing mutations occurring in some subclones could lead to drug resistance [2]. For example, a research in lymphocytic leukemia has shown links between the presences of driver mutations within subclones and adverse clinical outcomes [3].

With the advancement of next-generation sequencing (NGS) and launch of large-scale cancer genome sequencing projects [4], computational methods have recently been developed to infer tumor subclonal populations based on cancer genome sequencing data [5–9].

Most of these methods rely on sequence information from somatic point mutations, such as PyClone [5], EXPANDS [6], PhyloSub [7] and rec-BTP [8]. Methods in this category leverage the cluster pattern of allele frequencies at somatic point mutations to detect distinct subclonal populations. However, as the determination of somatic point mutations is imperfect and the inclusion of false-positives is unavoidable [10], deep sequencing with more than 100X coverage is often required for subclonal inferences with high sensitivity and specificity [5, 7, 8].

Other approaches utilizing the read depth information from genomic segments with somatic copy number alterations (SCNAs) to infer the cellular prevalences of subclonal populations have also been developed, such as THetA [9]. THetA explores all combinations of copy number changes across all segments to infer the most likely collection of subclonal populations [9]. However, with the copy number information alone, THetA suffers from the "identifiability problem", where distinct combinations of tumor purity and ploidy are able to explain the read depth information from SCNAs equally well [9]. Additionally, the running time of THetA scales exponentially with the number of genomic segments [9], and often takes a prohibitively long time to run under certain parameter settings.

In this article, we present a novel probabilistic mixture model, MixClone, to infer the cellular prevalences of subclonal populations. MixClone integrates both read depth information from genomic segments with SCNAs and allele frequency information from heterozygous single-nucleotide polymorphism (SNP) sites within a unified probabilistic framework. Such integrative framework has been shown to significantly improve the accuracy of tumor purity estimation in our previous work [11]. Here, we present that MixClone achieves two major advantages compared to the existing methods that (i) it does not require deep sequencing data, (ii) it resolves the identifiability problem. To demonstrate MixClone's utility, we conducted simulation studies and showed that it outperforms existing methods. We also applied MixClone on a breast cancer sequencing dataset [12], and showed that it was able to discover subclonal events not reported before.

Methods

In this section, we introduce the generative mixture model of MixClone, which is an extension of our previous work on tumor purity estimation[11]. First, we introduce the notations for input data. Then, we describe the probabilistic models for sequence information of both SCNAs and allele frequencies. Finally, we combine these two types of data into a single likelihood model, and describe an algorithm to solve the model.

Basic notations

The raw input data for MixClone are two aligned whole genome sequencing read sets of paired normal-tumor samples and a genome segmentation file based on the tumor sample. Following the notations from our previous work [11], we assume the tumor genome has been partitioned into J segments. We also assume there are I_j heterozygous SNP sites within segment j in the corresponding normal genome, and use (i, j) to index SNP site i within segment j. For each SNP site (i, j) we define the A allele to be the reference allele and the B to be the alternative allele, with respect to the reference genome. We also use a superscript N to denote data from normal samples and superscript T to denote data from tumor samples. Overall, the observed data are summarized in the following notations [13]:

$b_{i j}^{N}$ = number of reads mapped to the B allele in the normal sample at site (i, j).

$d_{i j}^{N}$ = reads depth of the normal sample at site (i, j).

$D_{j}^{N}$ = total number of reads mapped to segment j of the normal sample.

The notations for the observed data from tumor samples are similarly defined, e.g. $D_{j}^{T}$ denotes total number of reads mapped to segment j of the tumor sample.

Modeling SCNAs

Next, we describe the probabilistic model for SCNAs data. For each segment j, we define an allelic configuration H_j to represent its underlying allele-specific copy number status. For example, if the absolute copy number of segment j is 2, then the compatible allelic configurations are PP, MM and PM, where P and M denotes the paternal and maternal allele of the tumor genome, respectively. Since PP and MM are not distinguishable based on sequence information alone as the reference human genome is not phased, we define the set of all possible allelic configuration as

H_{j} \in H = {0̸, P/M, PP/MM, PM, PPP/MMM, PPM/PMM}

(1)

assuming the maximum copy number for each segment is 3. The corresponding copy number associated with each allelic configuration in $H$ is then

n_{h} = {0, 1, 2, 2, 3, 3}

(2)

MixClone allows the user to specify the maximum copy number and the default value is 6 in the released package [11]. We further assume there are K subclonal populations within the tumor sample, each of which has an associated cellular prevalence ϕ_k ∈ 0[1]. The subclonal type of each segment j is denoted as

Z_{j} \in Z = {1, 2, \cdot \cdot \cdot, K}

(3)

representing one of the K possible subclonal populations. Given the allelic configuration H_j = h and the subclonal type Z_j = k, the average copy number of segment j within the tumor sample, taking into account the subclonal cellular prevalence ϕ_k, is

{\bar{C}}_{j} = ϕ_{k} n_{h} + (1 - ϕ_{k}) 2

(4)

Based on the Lander-Waterman model [14], the probability of sampling a read from a given segment j depends on three main factors: 1) its copy number, 2) its total genomic length, and 3) its mappability, which depends on factors such as repetitive sequence and GC content [9]. For each segment j, we associate a coefficient θ_j to account for the effect of its mappability and genomic length. Thus the expected read counts mapped to segment j, which is denoted as λ_j, is proportional to ${\bar{C}}_{j} θ_{j}$ . For example, for segment x and segment y, we have

\frac{λ_{x}}{λ_{y}} = \frac{{\bar{C}}_{x} θ_{x}}{{\bar{C}}_{y} θ_{y}}

(5)

Because the mappability coefficients (θ_j's) matter only in a relative sense, we take $θ_{x} / θ_{y} = D_{x}^{N} / D_{y}^{N}$ , as these segments should have the same sequence properties between the normal and tumor samples.

Additionally, to determine the absolute value of λ_j, we curate a list of segments which contain no loss of heterozygosity according to their allele frequencies information. Based on the observed number of reads mapped to each segment, we further remove "outlier" segments from the list if their copy numbers are different from the bulk of the segments' copy numbers in the list. Finally, we call the remaining segments in the list as "baseline segments" and denote the set of these segments as S. We assume the allelic configurations of all the baseline segments are PM with copy number n_s = 2. Other possible allelic configurations for baseline segments, which have equal copy numbers for each allele (e.g. φ, PPMM), are likely to be rare, and currently we do not model them. Then based on n_s, we specify λ_j as follows

λ_{j} = \frac{1}{| S |} \sum_{s \in S} \frac{{\bar{C}}_{j} θ_{j}}{n_{s} θ_{s}} D_{s}^{T}

(6)

where $D_{s}^{T}$ denotes the number of reads mapped to segment s of the tumor sample.

Finally, we model the number of reads mapped to segment j in the tumor sample as a Poisson distribution, given H_j and Z_j

D_{j}^{T} | H_{j}, Z_{j} ~ Poisson (λ_{j})

(7)

Details on curating the baseline segments are given in Supplementary, Additional file 1.

Modeling allele frequencies

Next, we describe the probabilistic model used for allele frequencies of heterozygous SNP data. For each SNP site i within segment j, we denote its tumor genotype as G_ij, which is selected from the set of all possible tumor genotypes up to a maximum copy number alteration, e.g.

G = {0̸, A, B, AA, AB, BB, AAA, AAB, ABB, BBB}

(8)

assuming the maximum copy number is 3. The corresponding B allele frequencies (BAF) for all the genotypes in $G$ are

μ_{g} = {\frac{1}{2}, ϵ, 1 - ϵ, ϵ, \frac{1}{2}, 1 - ϵ, ϵ, \frac{1}{3}, \frac{2}{3}, 1 - ϵ}

(9)

in which, ε ≪ 1 is a small random deviation accounting for general sequencing errors. We choose E = 0.01, which is equivalent to a Phred quality of 20 [15].

Given the tumor genotype G_ij = g, the allelic configuration H_j = h, and the subclonal type Z_j = k, the average BAF of site (i, j) within the tumor sample, taking into account the subclonal cellular prevalence φ_k, is

{\bar{μ}}_{i j} = \frac{ϕ_{k} n_{h} μ_{g} + (1 - ϕ_{k}) 2 μ_{0}}{ϕ_{k} n_{h} + (1 - ϕ_{k}) 2}

(10)

in which µ₀ = 0.5 is the BAF of heterozygous SNP sites in the normal sample. Finally, we model the distribution of the B allele count $b_{i j}^{T}$ at site (i, j) as a binomial distribution, given G_ij , H_j and Z_j

b_{i j}^{T} | d_{i j}^{T}, G_{i j}, H_{j}, Z_{j} ~ Binomial (d_{i j}^{T}, {\bar{μ}}_{i j})

(11)

Combining SCNAs and allele frequencies

Now, we combine sequence information from both SCNAs and heterozygous SNP sites. For all the heterozygous SNP sites within the same segment, their genotypes should be consistent with the underlying allelic configuration of the segment. We model this consistency through a predefined conditional probability $Q_{g h} = ℙ (G_{i j} = g | H_{j} = h)$ . If the genotype g is inconsistent with the allelic configuration h, e.g. AA is inconsistent with PM, we assign a small probability σ as Q_gh, otherwise we assign equal probabilities to genotypes that are consistent with the allelic configuration.

Conditional on the underlying allelic configuration H_j and subclonal type Z_j, the probability of observing B allele read count $b_{i j}^{T}$ at site (i, j) is given as

ℙ (b_{i j}^{T} | H_{j} = h, Z_{j} = k) = \sum_{g \in G} Q_{g h} ℙ (b_{i j}^{T} | G_{i j} = g, H_{j} = h, Z_{j} = k)

(12)

We assume that conditional on the allelic configuration H_j , the B allele read counts ${b_{i j}^{T}}_{i = 1}^{I_{j}}$ at different sites within the same segment j are independent of each other, and are also independent of the total read count $D_{j}^{T}$ of the segment. Then, the joint probability of observing the two types of read counts information of segment j is

\begin{matrix} ℙ (D_{j}^{T}, {b_{i j}^{T}}_{i = 1}^{I_{j}} | H_{j} = h, Z_{j} = k) \\ = ℙ (D_{j}^{T} | H_{j} = h, Z_{j} = k) \times \prod_{i = 1}^{I_{j}} \sum_{g \in G} Q g h ℙ (b_{i j}^{T} | G_{i j} = g, H_{j} = h, Z_{j} = k) \end{matrix}

(13)

Likelihood model

We have specified the joint distribution of the two types of read counts information of segment j. We then further model the allelic configuration H_j and the subclonal type Z_j of segment j as random variables that follow categorical distributions

H_{j} | ρ_{j} ~ Categorical (ρ_{j})

(14)

Z_{j} | π ~ Categorical (π)

(15)

ρ_j= (ρ_j∅, ⋯, ρ_jPPM/PMM), where $ρ_{j h} = ℙ (H_{j} = h)$ is the probability of observing h as the allelic configuration of segment j. π = (π₁, ⋯, π_K), where $π_{k} = ℙ (Z_{j} = k)$ is the probability of observing subclonal type k for all the segments. The model parameters Θ is defined as

Θ = ({ρ_{j}}_{j = 1}^{J}, {π_{k}}_{k = 1}^{K}, {ϕ_{k}}_{k = 1}^{K})

(16)

And the model likelihood of observing all the data is then

\begin{matrix} ℙ ({D_{j}^{T}}_{j = 1}^{J}, {b_{i j}^{T}}_{i = 1, j = 1}^{I_{j,} J} | Θ) \\ = \prod_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{h \in H} ℙ (Z_{j} = k) ℙ (H_{j} = h) ℙ (D_{j}^{T} | H_{j} = h, Z_{j} = k) \\ \times \prod_{i = 1}^{I j} \sum_{g \in G} Q_{g h} ℙ (b_{i j}^{T} | G_{i j} = g, H_{j} = h, Z_{j} = k) \\ = \prod_{j = 1}^{J} \sum_{k = 1}^{K} \sum_{h \in H} π_{k} ρ_{j h} \frac{λ_{j}^{D_{j}^{T}} e^{- λ_{j}}}{D_{j}^{T}!} \\ \times \prod_{i = 1}^{I_{j}} \sum_{g \in G} Q_{g h} (\begin{matrix} d_{i j}^{T} \\ b_{i j}^{T} \end{matrix}) {\bar{μ}}_{i j}^{b_{i j}^{T}} (1 - {\bar{μ}}_{i j}) d_{i j}^{T} - b_{i j}^{T} \end{matrix}

(17)

We use Expectation-Maximization (EM) algorithm [16] to find the maximum likelihood estimation of Θ. The complete details of the EM updates are given in Supplementary, Additional file 1.

Model selection

One of the key issues in subclonal analysis is to determine the number of subclonal populations K. PyClone and PhyloSub use posterior sampling methods to estimate K [5, 7], while THetA requires users to specify K as an input [9]. Since the probabilistic model of MixClone is a generative mixture model, the model complexity and the corresponding log-likelihood increases as K increases. Therefore, we use a criterion based on the increase of the log-likelihood to select K. Practically, Mix-Clone allows the user to specify K. If K is not specified, MixClone runs the mixture model five times with different K in range of 1 to 5. We denote the log-likelihoods under the five different settings as ${L_{K}}_{K = 1}^{5}$ , and the total log-likelihood increase as

Δ = L_{5} - L_{1}

(18)

If |Δ/L₁| < 0.01, which means the ratio of total log-likelihood increase is less than 0.01, MixClone predicts there is no subclonal event in the tumor sample and selects K = 1 as the number of subclonal populations. If |Δ /L₁| ≥ 0.01, MixClone further calculates another quantity

δ_{i} = | L_{i} - L_{1} | Δ, i \in [2, 5]

(19)

which is the cumulative log-likelihood increase from K = 1 to K = i as a percentage regarding to the total increase Δ. If δ_i ≥ 0.9 and δ_i−1< 0.9, MixClone selects K = i as the number of subclonal populations.

In practice, we suggest users use this criterion as a heuristic guide when analyzing real data, and determine the number of subclonal populations in conjunction with regard to other external information.

MixClone software package

Figure 1 is the general workflow of MixClone. MixClone is a comprehensive software package, including subclonal cellular prevalences estimation, allelic configuration estimation, absolute copy number estimation and a few visualization tools. This package is implemented in Python and is built on top of the PyLOH package, previously released by us [11]. It also utilizes some features from the software package JointSNVMix [13], which have been explicitly indicated in the source code.

Results

In this section, we evaluate the performance of MixClone on both simulated and real datasets and compare its performance with two published algorithms: (i) PyClone, a method based on somatic point mutations, and (ii) THetA, a method based on somatic copy number alterations.

Results from simulated data

To generate simulation data, we simulated ten sets of NGS reads from chromosome 1 of artificial paired normal-tumor samples, each with 60X coverage. Heterozygous SNP sites from dbSNP [17] were inserted to the reference human genome to create the artificial normal genome. Both heterozygous SNP sites and somatic point mutations from [18] were inserted to the reference human genome to create artificial tumor genomes. Five of the artificial tumor genomes contain two subclonal populations and the other five contain three subclonal populations. Each artificial tumor genome was randomly assigned with segmentations, allelic configurations and subclonal cellular prevalences. We used segmentations based on both ground truth and BIC-seq [19] as the input for MixClone. We used ground truth somatic point mutation sites and copy numbers as the input for PyClone and THetA. Details on how reads were simulated and preprocessed are given in Supplementary, Additional file 1.

MixClone is able to identify the correct subclonal populations for all the simulated datasets based on ground truth segmentations. Figure 2 shows the result of simulated dataset with two subclonal populations. MixClone also correctly estimates the subclonal cellular prevalences of all the segments with SCNAs except for one small segment in tumor genome case 4 with three subclonal populations. For results based on BIC-seq segmentations, MixClone still correctly estimates the subclonal cellular prevalences of the majority of the segments with SCNAs, except for those with copy-neutral loss of heterozygosity. This is likely due to the incorrect segmentations of BIC-seq, as BIC-seq relies on copy number changes and is unable to detect segments with copy-neutral loss of heterozygosity when they are adjacent to diploid segments. The complete results of all the simulated datasets based on both ground truth and BIC-seq segmentations are shown online through the github website associated with MixClone. As a comparison, we also run PyClone and THetA on the same datasets. We were unable to obtain THetA results after running it for more than 72 hours, likely due to its exponential scalability with the number of segments. In Figure 2, PyClone detects one of the two subclonal populations, whose ground truth cellular prevalence is 20%, but misestimates the other subclonal population, whose ground truth cellular prevalence is 80%, except for a few segments. The performance of MixClone on the other simulated datasets also significantly outperforms PyClone. One possible reason might be that the reads coverage of simulated datasets is not deep enough to support PyClone's non-parametric method [5], thus PyClone tends to report more subclonal populations due to the statistical variance.

Results from breast cancer sequencing data

We also applied MixClone on a whole-genome breast cancer sequencing dataset [12]. The details on data preprocessing are described in Supplementary, Additional file 1.

Figure 3a shows the subclonal inference results of sample MB-116. One estimated subclonal cellular prevalence 32% is consistent with the tumor purities estimated by PyLOH and THetA [11], and another estimated cellular prevalence 66% is consistent with the tumor purity estimated by ABSOLUTE [20] reported in [12].

Figure 3b shows the five log-likelihoods of MB-116 under different numbers of sub-clonal populations. The magenta, red and yellow curves represent the log-likelihoods corresponding to number 1, 3, and 5, respectively. Because the distance between the magenta and red curves (the cumulative log-likelihood increase from 1 to 3) is greater than 0.9 of the distance between the magenta and yellow curves (the total log-likelihood increase from 1 to 5), MixClone selected K = 3 as the number of subclonal populations for MB-116.

For samples without significant subclonal events, MixClone selected one as the number of subclonal populations, e.g. MB-106 (Figure 4). In Figure 4b, the ratio of total log-likelihood increase from 1 to 5 is 1.4 × 10⁻⁴, which is less than the threshold of 0.01. Therefore, MixClone selected K = 1 as the number of subclonal populations for MB-106. The estimated cellular prevalence of this single population is 83%, which is also consistent with the tumor purities estimated by PyLOH, ABSOLUTE and one result of THetA [11] (Figure 4a).

Besides MB-116, MixClone also detected significant subclonal events in MB-45 and MB-123. Results of MB-45 and MB-123 are given in Supplementary, Additional file 1.

Discussion

In this article, we demonstrated MixClone's utility using whole genome sequencing data. However, most of the existing cancer genome sequencing data are from exome sequencing. An important future direction is to extend the current methodology to handle the exome sequencing data. Yet, extending MixClone to whole exome sequencing data is not trivial, as reads coverage on targeted exonic regions are no longer randomly distributed due to probe's variable efficiency [21]. Instead of Poisson distribution, using Gaussian distribution to model reads depth ratios between tumor and normal samples might be more appropriate to account for such additional variances, which has been demonstrated in whole exome sequencing based copy number analysis [21].

Another important future direction to extend MixClone is to implement joint analysis based on multiple samples, which is supported by PyClone and PhyloSub [5, 7]. Multiple samples have been obtained for a single heterogeneous tumor tissue both temporally and spatially, and joint analysis based on these samples may reveal additional patterns of the history of tumor progression [5].

Currently, MixClone runs the subclonal analysis five times with different number of subclonal populations in range of 1 to 5 by default. In reality, larger numbers of subclonal populations may coexist within one tumor sample, but in this case some of the populations are very likely to share similar cellular prevalences. Since Mix-Clone defines different subclonal populations based on distinct cellular prevalences, those populations with similar cellular prevalences may not be differentiated by MixClone. To achieve finer resolution of subclonal populations, subclonal lineages information would be necessary to further differentiate each population in addition to cellular prevalences. And phylogenetic methods may be possible solutions to explicitly incorporate subclonal lineages information [7].

Conclusions

In summary, we have developed a new method for inferring tumor subclonal populations by integrating sequence information gathered from SCNAs and heterozygous SNP sites. We showed that our method outperforms existing ones on simulation data, and applying it to a real breast cancer dataset is able to reveal new subclonal events not discovered before. Compared with existing methods, our method requires no additional deep sequencing of somatic point mutation sites.

References

Nowell PC: The clonal evolution of tumor cell populations. Science. 1976, 194 (4260): 23-28. 10.1126/science.959840.
Article CAS PubMed Google Scholar
Garraway LA, Lander ES: Lessons from the cancer genome. Cell. 2013, 153 (1): 17-37. 10.1016/j.cell.2013.03.002.
Article CAS PubMed Google Scholar
Landau DA, Carter SL, Stojanov P, McKenna A, Stevenson K, Lawrence MS, Sougnez C, Stewart C, Sivachenko A, Wang L, et al: Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell. 2013, 152 (4): 714-726. 10.1016/j.cell.2013.01.019.
Article PubMed Central CAS PubMed Google Scholar
Hudson TJ, Anderson W, Aretz A, Barker AD, Bell C, Bernabé RR, Bhan M, Calvo F, Eerola I, Gerhard DS, et al: International network of cancer genome projects. Nature. 2010, 464 (7291): 993-998. 10.1038/nature08987.
Article CAS PubMed Google Scholar
Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S, Bouchard-Côté A, Shah SP: Pyclone: statistical inference of clonal population structure in cancer. Nature methods. 2014, 11 (4): 396-398. 10.1038/nmeth.2883.
Article CAS PubMed Google Scholar
Andor N, Harness JV, M¨uller S, Mewes HW, Petritsch C: Expands: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics. 2014, 30 (1): 50-60. 10.1093/bioinformatics/btt622.
Article PubMed Central CAS PubMed Google Scholar
Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q: Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics. 2014, 15 (1): 35-10.1186/1471-2105-15-35.
Article PubMed Central PubMed Google Scholar
Hajirasouliha I, Mahmoody A, Raphael BJ: A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioinformatics. 2014, 30 (12): 78-86. 10.1093/bioinformatics/btu284.
Article Google Scholar
Oesper L, Mahmoody A, Raphael BJ: Theta: inferring intra-tumor heterogeneity from high-throughput dna sequencing data. Genome biology. 2013, 14 (7): 80-80. 10.1186/gb-2013-14-7-r80.
Article Google Scholar
Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S, Scott HS, Glonek G, Adelson DL: A comparative analysis of algorithms for somatic snv detection in cancer. Bioinformatics. 2013, 29 (18): 2223-2230. 10.1093/bioinformatics/btt375.
Article PubMed Central CAS PubMed Google Scholar
Li Y, Xie X: Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics. 2014, 174:
Chapter Google Scholar
Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, Lawrence MS, Sivachenko AY, Sougnez C, Zou L, et al: Sequence analysis of mutations and translocations across breast cancer subtypes. Nature. 2012, 486 (7403): 405-409. 10.1038/nature11154.
Article PubMed Central CAS PubMed Google Scholar
Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M, Turashvili G, Oloumi A, et al: Jointsnvmix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics. 2012, 28 (7): 907-913. 10.1093/bioinformatics/bts053.
Article PubMed Central CAS PubMed Google Scholar
Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988, 2 (3): 231-239. 10.1016/0888-7543(88)90007-9.
Article CAS PubMed Google Scholar
Ewing B, Green P: Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome research. 1998, 8 (3): 186-194.
Article CAS PubMed Google Scholar
Dempster AP, Laird NM, Rubin DB, et al: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal statistical Society. 1977, 39 (1): 1-38.
Google Scholar
Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbsnp: the ncbi database of genetic variation. Nucleic acids research. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.
Article PubMed Central CAS PubMed Google Scholar
Berger MF, Lawrence MS, Demichelis F, Drier Y, Cibulskis K, Sivachenko AY, Sboner A, Esgueva R, Pflueger D, Sougnez C, et al: The genomic complexity of primary human prostate cancer. Nature. 2011, 470 (7333): 214-220. 10.1038/nature09744.
Article PubMed Central CAS PubMed Google Scholar
Xi R, Hadjipanayis AG, Luquette LJ, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, et al: Copy number variation detection in whole-genome sequencing data using the bayesian information criterion. Proceedings of the National Academy of Sciences. 2011, 108 (46): 1128-1136. 10.1073/pnas.1110574108.
Article Google Scholar
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al: Absolute quantification of somatic dna alterations in human cancer. Nature biotechnology. 2012, 30 (5): 413-421. 10.1038/nbt.2203.
Article PubMed Central CAS PubMed Google Scholar
Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: Exomecnv. Bioinformatics. 2011, 27 (19): 2648-2654. 10.1093/bioinformatics/btr462.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The work was partly supported by National Institute of Health grant R01HG006870. The authors would also like to acknowledge dbGaP repository for providing the cancer sequencing datasets. The accession numbers for the breast cancer and prostate cancer datasets are phs000369.v1.p1 and phs000447.v1.p1, respectively.

This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S2

Author information

Authors and Affiliations

Department of Computer Science, University of California, Irvine, CA, 92697, US
Yi Li & Xiaohui Xie
Institute for Genomics and Bioinformatics, University of California, Irvine, CA, 92697, US
Xiaohui Xie
Center for Machine Learning and Intelligent Systems, University of California, Irvine, CA, 92697, US
Xiaohui Xie

Authors

Yi Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Xie.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Designed the experiments: YL and XX; Performed the experiments: YL; Wrote the paper: YL and XX; All authors contributed to the analysis, and approved the paper.

Electronic supplementary material

12864_2015_6953_MOESM1_ESM.pdf

Additional file 1: Complete details of (1) detecting heterozygous SNP sites, (2) curating the baseline segments, (3) the EM updates of MixClone, (4) reads simulation for simulated data and (5) reads preprocessing for both simulated data and breast cancer sequencing data. (PDF 113 KB)

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Li, Y., Xie, X. MixClone: a mixture model for inferring tumor subclonal populations. BMC Genomics 16 (Suppl 2), S1 (2015). https://doi.org/10.1186/1471-2164-16-S2-S1

Download citation

Published: 21 January 2015
DOI: https://doi.org/10.1186/1471-2164-16-S2-S1

Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics

MixClone: a mixture model for inferring tumor subclonal populations