 Research
 Open access
 Published:
Comparing copynumber profiles under multicopy amplifications and deletions
BMC Genomics volume 21, Article number: 198 (2020)
Abstract
Background
During cancer progression, malignant cells accumulate somatic mutations that can lead to genetic aberrations. In particular, evolutionary events akin to segmental duplications or deletions can alter the copynumber profile (CNP) of a set of genes in a genome. Our aim is to compute the evolutionary distance between two cells for which only CNPs are known. This asks for the minimum number of segmental amplifications and deletions to turn one CNP into another. This was recently formalized into a model where each event is assumed to alter a copynumber by 1 or −1, even though these events can affect large portions of a chromosome.
Results
We propose a general cost framework where an event can modify the copynumber of a gene by larger amounts. We show that any cost scheme that allows segmental deletions of arbitrary length makes computing the distance strongly NPhard. We then devise a factor 2 approximation algorithm for the problem when copynumbers are nonzero and provide an implementation called cnp2cnp. We evaluate our approach experimentally by reconstructing simulated cancer phylogenies from the pairwise distances inferred by cnp2cnp and compare it against two other alternatives, namely the MEDICC distance and the Euclidean distance.
Conclusions
The experimental results show that our distance yields more accurate phylogenies on average than these alternatives if the given CNPs are errorfree, but that the MEDICC distance is slightly more robust against error in the data. In all cases, our experiments show that either our approach or the MEDICC approach should preferred over the Euclidean distance.
Background
Cancer is widely recognized as an evolutionary process during which cells within a population accumulate aberrant somatic mutations and replicate indefinitely [1]. These cells are divided in subpopulations, called clones, that share common mutation traits and form tumors. A natural problem that arises is to reconstruct the evolution of a set of clones within a tumor. This question has recently led to the development of several phylogenetic algorithms tailored for cancer evolution. Most of them use either information of single nucleotide variants obtained from bulk [2–5] or singlecell [6–8] sequencing data, or copynumber alterations [9–13] (usually in the context of singlecell data). We refer the reader to [14] for a survey of these methods.
In this work, we are interested in the problem of inferring the minimum number of copynumber alteration events that explain how a cell evolved into another. In tumors, several events can make the copynumber of a gene different from the normal diploid twocopy state, thereby creating copynumber aberrations. As an example, the breakage–fusion–bridge (BFB) phenomenon [15] occurs when a region including a telomere breaks off a chromosome. During replication, two sister chromatids have unterminated ends and they fuse, leading to what is essentially a chromosome portion concatenated with a reversed copy of itself (see [15, 16] for a more thorough explanation). Afterwards, the centromeres of the fused chromatids get pulled in opposite directions, leading to another breakage. This BFB cycle repeats until the chromatids receive a telomere (usually after translocation). Each BFB event potentially doubles the copynumber of a gene, and since these events are known to occur in cycles, a gene copynumber may become significantly higher than normal (i.e. more than double) in a short evolutionary time span. Other examples of events include focal deletions [17, 18] or missegregation of chromosomes [19].
Desper et al. [20] were among the first to consider copynumber aberrations for phylogenetic reconstructions, using comparative genomic hybridization (CGH) data to reconstruct a mutation hierarchy. In [21], Liu et al. propose a distancebased approach based on CGH data to infer multicancer phylogenies. Singlecell phylogenetics then gained widespread attention in an influential paper of Navin et al. [9]. The authors applied singlenucleus sequencing on a breast cancer tumor, obtained the copynumber profile (CNP) of several cells, each represented as a vector of integers, and used the Euclidean distance to compare two CNPs. Later, Schwarz et al. [22] pointed out that a single event can amplify or delete large portions of a chromosome, thereby altering the copynumber of several genes and making the Euclidean distance overestimate the true number of events.
The authors proposed the following methodology to compare two CNPs. First, assuming diploid genomes, the copynumber for the two alleles of each gene (which can differ) is inferred from sequencing data. The correspondence between the copynumbers and the alleles is unknown, so a phasing step must be applied. This consists of assigning each copynumber to one of the two alleles (this is done under a minimumevolution principle, see [22] for details). After this step, each chromosome can be represented as a pair of CNPs, and chromosomes from two cells can be compared by computing the distances between the corresponding alleles. The distance proposed is the minimum number of segmental amplification and deletion events required to transform a given CNP into another.
In this work, we focus on the latter step. We assume that the CNP inference and the phasing steps have been performed, and must find a most parsimonious sequence of events explaining two given CNPs. This is analogous to classical rearrangements problems [23], but the main novelty (and difficulty) of CNP comparison is that only copynumbers are known, not the ordering of genes. In [22], Schwarz et al. introduced the MEDICC model, which approximates segmental events on a chromosome by events that alter a subinterval of a CNP by +1 or 1. Figure 1 shows an example turning a CNP u into another v (under our model where any amount of change is allowed). The problem of computing the minimum number of subinterval alterations to transform one CNP into another was solved in exponentialtime in [22] by modeling CNP events with a finitestate transducer. Zeira et al. [24] gave a linear time algorithm, using a clever trick for computing each row of a quadraticsize dynamic programming table in constant time (similar to the techniques used in [25]). In [11], the large phylogeny problem under this model is shown NPhard, though solvable with an ILP. They also present the copynumber triplet problem, which when given two CNPs u and v asks for a CNP whose sum of distances to u and v is minimized. The problem can be solved in pseudopolynomial time O(n^{2}N^{7}), where n is the CNP size and N the maximum copy number. Other distances and phylogenetic approaches are discussed in [10–13, 26, 27]
Our results. The above CNP comparison frameworks limit events to alter copynumbers by 1 or −1. As we exemplified with BFB, several copies of a gene can be affected by a single event. Moreover, the MEDICC software has a copynumber limit of 4, making it inappropriate for genes attaining copynumbers in the tens, twenties or even more, as has been reported for e.g. the MYC or EGFR genes [28–30]. In this work, we address these limitations by generalizing the CopyNumber Transformation problem defined in [22, 24]. We define a distance d_{f}(u,v) between two CNPs u and v which assigns a weight of f(c,δ) to an event that alters a copynumber of c by an amount of δ. We show that computing d_{f}(u,v) becomes strongly NPhard whenever we allow deletions of any amount at unit cost. In the context of our problem, “strongly” means that our hardness holds even if N, the maximum value in u and v, is polynomial in n, the number of elements in our CNPs. This is especially relevant, given that the MEDICC model was initially solved in time O(nN) and that the copynumber triplet problem can be solved in time O(n^{2}N^{7}). Our result implies that such pseudopolynomial time algorithm are impossible in our case unless P = NP. We then show that if any amount of change is permitted across an interval at unit cost, then a simple lineartime factor 2 approximation algorithm can be devised. We validate our approach by reconstructing simulated phylogenies using neighborjoining (NJ), and compare them with the MEDICC distance and Euclidean distance. We perform our experiments on errorfree data and noisy data (where the true copynumbers are altered by a random amount). Using a variety of simulation papameters, we show that both our distance and the MEDICC distance achieve significantly better accuracy than the Euclidean distance. Our distance is slightly more accurate on errorfree data, and the MEDICC distance is slightly more tolerant to error.
Results
We first provide the required preliminary notions required to state our theoretical results. We then show that computing our copynumber distance is strongly NPhard, and present our approximation algorithm. Finally, we present our experimental results on reconstructing simulated phylogenies.
Preliminary notions
Throughout the paper, we use the interval notations [n]={1,2,…,n} and [s,t]={s,s+1,…,t}. Given a vector u=(u_{1},…,u_{n}) of n integers and i∈[n], we will always write u_{i} for the value at the ith position of u. If u_{i}=0, then i is called a null position. We will assume that every vector u of dimension n has special values u_{0}=u_{n+1}=0. We denote by u^{−{i}} the vector obtained by removing position i∈[n], i.e. u^{−{i}}=(u_{1},…,u_{i−1},u_{i+1},…,u_{n}). If v is a vector of the same dimension, then u−v=(u_{1}−v_{1},…,u_{n}−v_{n}).
We assume that a reference chromosome is partitioned into contiguous subsequences, called positions, each numbered from 1 to n. A copynumber profile (CNP) is a vector u=(u_{1},…,u_{n}) of nonnegative integers representing the copynumber of each position in a clone. We consider amplification and deletion events, which respectively have the effect of increasing and decreasing the number of copies in a chromosome. As in [22, 24], we assume that events affect a set of positions that are contiguous in the reference chromosome.
An event is a triple e=(s,t,δ) where 1≤s≤t≤n and \(\delta \in \mathbb {Z} \setminus \{0\}\). Here the [s,t] interval depicts the set of affected positions, and δ is the amount of change. The event e is an amplification when δ>0 and a deletion when δ<0. A copynumber cannot drop below 0 and cannot increase from a 0 to another value (e.g. new genes cannot be created once completely lost). Applying event e=(s,t,δ) on a CNP u yields another CNP u^{′}=(u1′,…,un′) with, for i∈[n],
We denote by u〈e〉 the CNP obtained by applying event e on a CNP u. More generally, if E=(e_{1},…,e_{k}) is an ordered sequence of events, we write u〈E〉=u〈e_{1}〉〈e_{2}〉…〈e_{k}〉 to denote the CNP obtained by applying each event of E in order. We may also write u〈e_{1}…e_{k}〉 instead of u〈(e_{1},…,e_{k})〉. Given two CNPs u and v of dimension n, we say that Etransforms u into v if u〈E〉=v.
We will often use the difference vector of u and v, and usually denote w:=u−v. The representation of w as in Fig. 1 on the right provides the following intuition: if u〈E〉=v, then the events of E need to “squish” the values of w to 0 to make u equal to v (ensuring that no value u_{i} of u drops to 0 in the process unless v_{i}=0).
Minimum cost transformations
Given two CNPs u and v, our goal is to find a minimumcost sequence E that transforms u into v. In [22, 24], the cost of an event (s,t,δ) is δ. Here, we propose a generalization by defining a cost function \(f: \mathbb {N} \times \mathbb {Z} \rightarrow \mathbb {N}^{>0}\) that assigns a positive cost to altering a copynumber c by an amount of δ. That is, if we apply (s,t,δ) on u, each position i∈[s,t] has its own corresponding cost f(u_{i},δ), which could be interpreted as the plausibility of going from copynumber u_{i} to max(u_{i}+δ,0). We then define the cost cost_{f}(u,e) with respect to f of applying e=(s,t,δ) on u as the maximum cost within [s,t], i.e.
The events proposed in the MEDICC algorithm of Schwarz et al. can be decomposed into δ events of unit cost. This can be modeled under our framework with a function mdc defined as mdc(u_{i},δ)=1 if δ∈{−1,1} and mdc(u_{i},δ)=∞ otherwise. Alternatively, one could state that a position with copynumber u_{i} can hardly more than double in a single event (assuming that amplifications are duplications), but that deletions can suppress any number of copies. We call this the doubling function dbl, defined as dbl(u_{i},δ)=1 if u_{i}+δ≤2u_{i}, and dbl(u_{i},δ)=∞ otherwise.
Finally, the most permissive cost function any allows any movement without constraint: simply define cost(u_{i},δ)=1 for any \(\delta \in \mathbb {Z}\). This can, for instance, be used to model succession of events that can potentially amplify copynumbers above their double in a short time span — an example of this being BFB cycles.
In this paper, we mostly analyze the any function for its simplicity, but will sometimes use the dbl function for its relevance. Given two CNPs u and v and a cost function f, the cost of a sequence of events E=(e_{1},…,e_{k}) satisfying u〈E〉=v is equal to the sum of the cost of applying successive events of E on u, i.e.
If cost_{f}(u,E)≤cost_{f}(u,E^{′}) for any other sequence E^{′} satisfying u〈E^{′}〉=v, then E is called optimal. The fdistance between u and v, denoted d_{f}(u,v), is the cost of an optimal sequence of events transforming u into v. Observe that this “distance” is not symmetric (hence the use of doublequotes). For instance, if u=(1,1) and v=(0,0), then d_{mdc}(u,v)=1 but d_{mdc}(v,u) is undefined since v cannot be transformed into u. We will therefore usually assume that u does not have any null position. We note here that the median distance, defined as minw∈V(d_{f}(w,u)+d_{f}(w,v)) (where V ranges over \(\mathbb {Z}^{n}\)), is symmetric for all the functions mentioned above. However, no efficient algorithm is known for any median distance. Our problem is the following.
The CNPtransformation problem:
Given: a source CNP u, a target CNP v, a cost function f and an integer k;
Question: is d_{f}(u,v)≤k?
We say that f is a unitcost function if f(c,δ)∈{1,∞} for any c and δ (e.g. the functions mdc,dbl and any). A cost function f is called deletionpermissive if cost_{f}(u_{i},δ)=1 for any u_{i} and any δ<0, i.e. there is no particular constraint on deletions. We will mainly focus deletionpermissive functions, the rationale being that unlike duplications, deletions could suppress an arbitrary number of copies.
General properties
Before proceeding with our results on computing fdistances, we present some results of general interest that will be useful later on.
Proposition 1
For any two CNPs u and v of the same dimension, any position i∈[n] and any cost function f, d_{f}(u,v)≥d_{f}(u^{−{i}},v^{−{i}})
We omit the proof details. The idea is that given a sequence of events E transforming u into v, we can apply E on u^{−{i}} by ignoring position i when it is affected. A sequence of events E is called ampfirst if all amplifications appear before all deletions. An ampfirst reordering of a sequence E is an ampfirst sequence E^{′} that contains the same events as E. Notice that if E has a amplifications and d deletions, then there are a!d! ampfirst reorderings of E.
Proposition 2
Let u and v be two CNPs with no null positions. If a sequence E satisfies u〈E〉=v, then any ampfirst reordering E^{′} of E satisfies u〈E^{′}〉=v.
Proof
Denote E=((s_{1},t_{1},δ_{1}),…,(s_{k},t_{k},δ_{k})). For any position i, the sum \(\sum _{j=1}^{k} \delta _{k}\) does not change even if we reorder the events in E, so u_{i} should still become v_{i} after reordering the event and applying them on u. The only danger is that a position drops to 0 since v has no null position, but this cannot happen if all amplifications are moved in front of E. □
Given a CNP w of length n, an interval [a,b] is a staircase of w if 0<w_{a}<w_{a+1}<…<w_{b}. The length of the staircase [a,b] is b−a+1. Figure 2 depicts a staircase of length 4. The next lemma can be useful to obtain quick lower bounds on a particular instance, and plays an important role in our hardness result (proof in Supplementary material).
Lemma 1
Let u,v be two CNPs with no null positions. If u−v contains a staircase [a,b] of length k, then d_{f}(u,v)≥k for any unitcost function f.
Strong NPhardness
We show that the CNPtransformation problem is strongly NPhard. This result holds for any deletionpermissive unitcost function f, and even if u and v contain no null position (we note that in [24], null positions make the problem more complex, but not here). In particular, the hardness also holds if only deletions are allowed. We assume that we are given two CNPs u and v and we put w:=u−v.
Suppose that w contains a staircase in interval [1,k] for some k, and that d_{f}(u,v)=k. A sequence E=(e_{1},…,e_{k}) such that u〈E〉=v is called smooth if, for every i∈[k], e_{i}=(i,b_{i},w_{i−1}−w_{i}) for some b_{i}≥k. Intuitively, E removes the first step, then the second, and so on, see Fig. 2. Observe that in a smooth deletion sequence, the positions to the right of k may or may not be affected by deletions.
Lemma 2
Let u and v be two CNPs with no null positions and let f be any unitcost function. If u−v contains a staircase in interval [1,k] and d_{f}(u,v)=k, then there exists a smooth sequence transforming u into v.
Lemma 2 requires the most technical proof of the paper (by far), and we defer it to the Supplementary material. The reduction becomes relatively simple when given this lemma. Our reduction is from the 3partition problem. In this problem, we are given a multiset S={s_{1},…,s_{n}} of n=3m positive integers. Defining \(t := \frac {1}{m} \sum _{i \in [n]}s_{i}\), we are asked whether S can be partitioned into m subsets S_{1},…,S_{m}, each of size 3, such that \(\sum _{s \in S_{i}} s = t\) for all i∈[m]. This problem is known to be strongly NPhard [31] (i.e. it is hard even if the values of S are O(n^{k}) for some constant k). The proof can be found in the Supplementary material.
Theorem 1
The CNPtransformation problem is strongly NPhard for any deletionpermissive unitcost function, even if the CNPs have no null positions.
Approximation algorithm
In this section, we show that if v does not contain any null position, then d_{any}(u,v) can be approximated within a factor of 2 in linear time. We discuss practical ways of handling null positions at the end of the section. We now assume that f=any and will write d(u,v) instead of d_{any}(u,v).
As usual, u and v are the source and target CNPs, respectively, and w:=u−v. The idea of the approximation is that if two consecutive positions i and i+1 have the same difference between u and v, i.e. w_{i}=w_{i+1}, then their value needs to change by the same amount. It might then be a good idea to treat these positions as one and always affect both with the same events. In fact, a whole interval of equal w values can be treated as a single position. We show that the number of distinct equal intervals gives a good bound on d(u,v).
Approximation by flat intervals
Recall that if w is a vector of n integers, it has implicit values w_{0}=w_{n+1}=0. We say that [a,b], with 0≤a≤b<n+1, is a flat interval if w_{i}=w_{j} for every a≤i,j≤b. If no interval properly containing [a,b] is flat, then [a,b] is a maximal flat interval. In fact, in the remainder, we will omit the term “maximal” and always assume that discussed flat intervals are maximal. We write F_{w} for the set of flat intervals of w. Note that this set is welldefined and that it partitions [0,n+1], by the maximality property. The intervals that contain 0 and n+1 in F_{w} are called extreme flat intervals, and always have a value of 0 (also, these intervals are possibly [0,0] and/or [n+1,n+1], but not necessarily). The key lemma says that d_{f}(u,v) is at least about half the number of flat intervals (see Supplementary material).
Lemma 3
Let u,v be two distinct CNPs with no null positions, and let w:=u−v. Then for any unitcost function f, d_{f}(u,v)≥⌈(F_{w}−1)/2⌉.
Lemma 3 yields a very simple factor 2 approximation algorithm: compute F_{w}, and return F_{w}−2. This corresponds to a solution in which we treat each flat interval separately (ignoring the two extremities) and is guaranteed to be at most twice the optimal number of events. Computing F_{w} can be done in a single pass through w by increasing a counter whenever we encounter a position i with w_{i}≠w_{i−1}.
Theorem 2
The CNPtransformation problem can be approximated within factor 2 in linear time for cost function f=any when the CNPs contain no null position.
It is open whether this could be adapted to other functions, e.g. the dbl function.
Improvements to the approximation algorithm
We first observe that the bound in Lemma 3 is essentially tight. This can be seen with any u,v such that u−v=(1,2,3,…,k−1,k,k−1,…,3,2,1) for some k. Indeed, one can decrease F_{w} by two at each round. On the other hand, our naive 2approximation is twice as bad as optimal. We show how to improve this in a heuristic fashion by devising an algorithm that can only perform better than the naive one. We leave it as an open problem to determine the approximation guarantees of this algorithm.
Our goal is to apply events that reduce F_{w} by two as many times as possible. In a greedy fashion, we apply the following strategy for our improved 2approximation: as long as u≠v, find an event e that reduces F_{w} by 2, if one exists, and apply it to u. If no such event exists, take the leftmost nonextreme flat interval [a,b] of w and apply the event (a,b,−w_{a}). Repeat until u=v.
An event (i,j,δ) reduces F_{w} by 2 precisely when w_{i−1}−w_{i}=w_{j+1}−w_{j}=δ (i=j is possible). This way we can merge the two flat intervals at the ends of [i,j]. One can find a good interval by checking all the O(n^{2}) subintervals [i,j] and then, for each of them, checking whether w_{i−1}−w_{i}=w_{j+1}−w_{j}. Moreover, we must check whether applying the event (i,j,δ) would make a value of u go below 0. Verifying every possible event can be done in time O(n^{3}) and as there are O(n) flat intervals, the algorithm takes time O(n^{4}).
This can be improved to O(n^{2} logn) by finding good events in time O(n logn). Due to space constraints, we relegate the detailed analysis of the improved heuristic to the Supplementary material. The idea is to scan w from left to right and store in a treap data structure (see [32]) the set of flat intervals encountered so far, which allows to detect quickly whether the current flat interval could be matched with another one.
Handling null positions
Our approximation ratio is not guaranteed to hold when there are many null positions. However, we show that in many practical cases, we can simply ignore null positions and remove them. In particular, we may assume that v has no two consecutive null positions (Lemma 4) and that for any null position i in v, we have w_{i−1}<w_{i} and w_{i+1}<w_{i} (Lemma 5). Thus instances with null positions can be reduced to ones where the only null positions remaining are “sandwiched” between nonnull positions with a smaller value in w.
Note that our approximation can still perform badly with these two conditions. For instance, suppose that u=(15,2,15,2,…,15,2) and v=(14,0,14,0,…,14,0). We would solve this in about n/2 events. However, the two events (1,n,−2),(1,n,1) turn u into v. Designing a better approximation for these cases is an open problem.
Lemma 4
Suppose that v_{i}=v_{i+1}=0 for some position i. Then removing position i or i+1, whichever is smaller in u, from u and v preserves the distance between u and v. Formally, for any unitcost function f, if u_{i}≥u_{i+1}, then d_{f}(u,v)=d_{f}(u^{−{i+1}},v^{−{i+1}}). Similarly if u_{i+1}≥u_{i}, then d_{f}(u,v)=d_{f}(u^{−{i}},v^{−{i}}).
Lemma 5
Suppose v_{i}=0 for some position i and that w_{i−1}≥w_{i} or w_{i+1}≥w_{i}. Then d_{f}(u,v)=d_{f}(u^{−{i}},v^{−{i}}) for any unitcost function f.
Experiments
We tested our flattening approximation algorithm and its improved version on simulated chromosomes that evolve along a tree through segmental tandem duplications and losses. Chromosomes were represented as strings of genes. Note that we did not simulate CNP evolution under the assumptions of our model. We evolved actual sequences as opposed to integer vectors, and the initial ordering of genes could be broken after several events. Our goal was to reconstruct phylogenies from the distances between the CNPs of the chromosomes at the leaves of the tree. We used the NJ implementation in Phylip [33, 34] and compared four distances: (1) our improved approximation; (2) our flat interval count; (3) the mdc cost, as in the MEDICC model; and (4) the Euclidean distance. To compute d_{mdc}, we implemented the dynamic programming algorithm of Zeira, Zehavi and Shamir [24], hereafter called the ZZS algorithm (we could not use the MEDICC software as it only handles copynumbers up to 4). The Euclidean distance is defined as \(\sqrt {\sum _{i=1}^{n} (u_{i}  v_{i})^{2}}\), as used in [9]. For the first three distances, we took the minimum of d(u,v) or d(v,u) to get a symmetric distance, removing null positions of u and filtering null positions of v as in Lemmas 4 and 5.
Simulated tree generation
We now describe how the trees were generated. First, we select a rooted binary tree T on l leaves labeled {1,…,l} uniformly at random. This is achieved by using the recursive splitting process described by Aldous in [35], which starts with a completely unresolved tree, splits the root in two subtrees chosen uniformly at random, and repeats on these subtrees. We then assign to the root r of T an exemplar chromosome, i.e. any string in which each gene occurs exactly once (note that the initial ordering of genes does not matter for our purposes).
Then for each branch uv of T from top to bottom, we select a random number of events k chosen uniformly at random in the interval [e_{min},e_{max}], where e_{min},e_{max} are simulation parameters. To introduce some rate heterogeneity among branches, we then multiplied k by a random number chosen from a uniform distribution with mean and standard deviation 1. The chromosome string at node v is obtained by applying k random events on the chromosome string associated with its parent u. Each event is either a tandem duplication with probability Δ or a deletion with probability 1−Δ. The starting position of each event is chosen uniformly at random on the chromosome string and, to find the length t of the substring affected, we apply the following process. Start with t=1, then apply the following: as long as a random number between 0 and 1 is above a given parameter r, increment t by 1 and repeat. We stop at the first random number below r. We chose to consider only values r≤0.1 since higher values resulted in copynumbers in the hundreds, sometimes even in the thousands, a road which we did not deem necessary to explore. Setting r between 0.01 and 0.1 generally resulted in copynumbers inside [0,50]. We also note that we also experimented on a model where the event length was chosen as a random fraction of the chromosome length — this led to exponential copynumber growth and we did not investigate this model further.
We observed that this process had a tendency to produce leaf chromosomes with CNPs having between 50–60% null positions under most parameter combinations. This might be deemed unrealistic, and furthermore, our results show that no method is able to predict accurate trees under these conditions. To avoid this, we added a condition in the loop determining the length t of an event: if incrementing t implies deleting the last occurrence of a gene, we continue the procedure with probability q and stop with probability 1−q (where q is another simulation parameter). This can be seen as modeling the idea that there may be resistance when attempting to remove every copy of a gene required for survival. Using q parameter values 0.25,0.5 and 0.75, the proportion of null positions stayed in the intervals 2–5%, 6–10% and 15–25%, respectively.
Note that since each possible tree on l leaves is equally likely to be chosen, the roottoleaf distances in a tree can be significantly different, and hence the trees are not expected to be ultrametric (for instance, a caterpillar can be selected as well as a perfectly binary tree).
Since it is difficult to determine the most realistic simulation conditions, we tested several combinations of parameters for the generation of phylogenies. The summary of the simulation parameters, along with their possible and default values, are summarized here:
l∈{10,50,100} is the number of leaves in the tree. The default is l=100;
n∈{10,100,250} is the number of genes (i.e. distinct characters) in the root chromosome (n is also the number of positions in our vectors). The default is n=100;
(e_{min},e_{max})∈{(2,4),(5,10),(20,40)} is the range of the possible number of events on each branch. The default is (e_{min},e_{max})=(5,10);
Δ∈{0.25,0.5,0.75} is the probability that an event is a duplication (and 1−Δ the probability that an event is a loss). The default is Δ=0.5;
r∈{0.01,0.05,0.1} controls the length of each event: r is the probability that we stop extending the event length. The default is r=0.05;
q∈{0.25,0.5,0.75,1} is the probability that a deletion suppresses the last copy of a gene during the length extension procedure (i.e. 1−q is the probability that the extension stops if it would make a copynumber 0). The default is q=0.25.
Tree reconstruction and performance measure
We generated 50 trees for each parameter combination of interest. For each tree, we took the chromosome strings at the leaves, obtained their CNPs and provided them as input to each of the four evaluated methods. We used the normalized RobinsonFoulds (RF) distance as a measure of the performance of each algorithm [36]. That is, for each inferred tree, we compare it with the “true” tree by counting the number of clades that are present in one tree but not the other, divided by 2(l−3) (the maximum number of clades that can possibly differ, recalling that l is the number of leaves). This yields a number between 0 and 1: the lower the number, the better we consider the reconstruction.
Error tolerance
It should be noted that the above methodology ignores several sources of errors. Inferring exact copynumbers from singlecell sequencing data is a nontrivial task and is still considered an open problem. The inferred CNPs are therefore expected to be noisy, especially with genes having a high copynumber. Moreover, as discussed in [22], assigning copynumbers to their corresponding allele is also a difficult problem. Here, by only considering singleallele chromosomes, we are supposing that the aforementioned phasing step has been performed correctly, whereas copynumber assignments cannot be assumed to be errorfree.
Both of the above problems have the effect of introducing incorrect copynumbers into the CNPs. To account for this, we gave randomly altered CNPs as input to each method. More specifically, given an errorrate parameter α, for each CNP u and each position i we changed u_{i} to a value chosen at random from a normal distribution with mean u_{i} and standard deviation α·u_{i} (noninteger values were rounded). We tested parameter values α∈{0,0.1,0.25,0.5,1}.
Experimental results
We first ran experiments using the default values for all parameters except one in order to isolate the impact of each parameter. On errorfree data, the most interesting results were obtained when varying l and n, see Fig. 3. In most situations, our CNP model slightly improves upon the MEDICC model, both of which are significantly better than the Euclidean distance. The number n of genes is quite important: all the results are poor when each CNP has only n=10 positions, but when n=250, the trees more accurate. This might be because when n=10, there is not enough opportunity for positions acquire a distinct signature during the evolutionary process, making all distances very similar. This suggests that many genes or segments should be considered when analyzing copynumber variants in tumor clones. The duplication rate does not seem to affect the accuracy of the methods, whereas accuracy tends to decrease as the number of events per branch increases. We do note that when the number of events per branch is within [5,10], our model performs better, but when it is high, i.e. in [20,40], the MEDICC model performs better. This tendency is confirmed under other parameterizations (see the Supplementary Material). Figure 4 show the normalized RF distances when varying parameters q and r. As mentioned before, as q gets closer to 1, the proportion of null positions is around 50–60%, making accurate distance computation difficult for all methods. As for the r parameter, the accuracy of our approach is better when r=0.01 but worse when r=0.1. This tendency can be observed under all parameter combinations (see the Supplementary Material). One should note that accuracy is generally better if the lengths of events are smaller. The four approaches on errorfree data exhibit similar behavior on other parameter combinations — additional plots can be found in the Supplementary Materials.
The results on CNPs containing errors are summarized in Fig. 5. We observe that the ZZS algorithm achieves slightly more accurate trees whenever the error rate is above zero. One possible explanation is that a single error in a CNP can split a flat interval into three. This can significantly alter the flat interval counts, whereas the ZZS distance is less dependant on flat intervals. The accuracy of the Euclidean distance appears to be the least affected by error rates and even performs better when α≥0.5. Observe however that accuracy decays rapidly with error rates: when α≥0.25, all approaches have an average normalized distance above 0.7, casting some doubt on their practical usability in this setting. This suggests that it might be beneficial to apply a CNP errorcorrection procedure before comparing them (see the “Discussion” section). More results on noisy data can be accessed in the Supplementary Material.
To summarize, the heuristic and flat count algorithm always yield a lower average RF distance than the ZZS algorithm on errorfree data, except when n=10 (where the average is always above 0.9 anyways), and every method always outperforms the Euclidean distance. On the other hand, the ZZS approach is slightly more robust to error in the CNP counts. However, the accuracy of the heuristic, the flat count and ZZS drops quickly as error rates increase. Even though the Euclidean distance yields better trees at high error rates, their accuracy is still too poor to be able to draw meaningful conclusions from them. Whether the MEDICC model is better than ours or not, we believe that either should be preferred over the Euclidean distance when reconstructing phylogenies from distance matrices as in [9].
Discussion
The results from the “Experiments” section show that our CNP distance performs reasonably well on simulated data. The incorporation of segmental events into the model does not appear to provide a significant advantage over the unitary events of the ZZS model. However, the simulations suggest that both approaches yield better results than the traditional Euclidean distance. This demonstrates that either our method or the ZZS algorithm should be preferred as the CNP comparison component in a singlecell phylogenetic reconstruction pipeline.
It should be noted that our algorithms only approximate the true CNP distance whereas the ZZS algorithm provides an exact solution. In order to evaluate the true performance of our segmental model, exact approaches should be developed in the future, perhaps using techniques from the field of parameterized complexity. Moreover, our approaches are very sensitive to errors, even more so than ZZS/MEDICC. One possible explanation for this is that both of our algorithms derive their distance from the number of flat intervals. A single error in a copynumber can turn one flat interval into three, and thus even moderate levels of noise can lead to highly incorrect predictions. We believe that the ZZS algorithm is less sensitive to such errors because that, if copynumber differences are large enough, a small error only increases the distance by 1, which is small in comparison to all the unit events required to handle the high difference. Therefore, even if the true event distance is overestimated, in a comparative setting the relative distances might be closer to the truth. It will be interesting to consider CNP error correction procedures based on flat intervals. For instance, when performing analysis of multiple cells, one could detect a potentially incorrect copynumber of a given segment by checking whether, after altering a predicted copy number by a small amount, several flat intervals get “fixed” when comparing the cell with others.
Another point of interest is that current approaches, including ours and ZZS/MEDICC, ignore rearrangements that change the ordering of segments. Our models assume that the set of contiguous segments remains the same in all cells during evolution. However, when duplications and deletions occur, the relative ordering of genes changes and the set of contiguous genes affected by the events will differ from that in the reference. Inversions, translocations or even chromothripsis also have the same effect. This is a difficult problem to handle if only CNPs are known, since integer vectors do not contain enough information to determine which genes are contiguous or not. One possibility is to ask the following: given two CNPs C_{1} and C_{2} to compare, choose a genome G_{1} whose CNP is C_{1} and a genome G_{2} whose CNP is C_{2} such that the rearrangement distance between G_{1} and G_{2} is minimized.
Our work also leaves several questions open. From a theoretical perspective, it remains to achieve a constant factor approximation when null positions are present in the input. Moreover, it is unknown whether the dbl function admits good approximation algorithms and, more generally, whether there are other biologically plausible functions that should be studied. On another note, it might be interesting to investigate the copynumber triplet problem (see the introduction) under our model, as it allows to define a symmetric distance between CNPs.
On a practical level, phylogenetic approaches that are not distancebased should be investigated. For instance, we could consider maximum parsimony as in [11], where the objective is to minimize the number of events across branches of a tree. The recent distancebased approach of Xia et al. [12], which is based on the MEDICC model but with an extra error correction step, should also be evaluated in our setting. On another note, it remains to test our approach on real data. We have ignored the problem of calling copynumbers and the aforementioned problem of phasing. These can introduce noise in the data and, as shown in our experiments, all the evaluated methods are sensitive to errors. This motivates the need for new methods to assign copynumbers to alleles under our model. Also, our CNP comparison framework assumes a singlecell setting, where the CNP of each individual cell is known. Since bulk sequencing is still commonplace, it will be useful to develop methods that are able to compare genomes extracted from samples that contains multiple types of cells.
Conclusion
In this work, we provided a general framework for the comparison of CNPs depicting genomes that evolve by segmental amplifications and deletions. We have shown that if there is no bound on the number of copies that a deletion can affect, then computing the minimum number of events transforming one CNP into another is strongly NPhard. One important implication of this result is that unless P = NP, one cannot use the fact that copynumbers are not too large (e.g. under 100) to devise a practical pseudopolynomial time algorithm, and other solutions must be explored. On the other hand, we proposed two simple and fast approximation algorithms that were shown to perform reasonably well on simulated datasets.
Availability of data and materials
The source code and data are available at: https://github.com/AEVOlab/cnp2cnp.
Abbreviations
 BFB:

Breakagefusionbridge
 CGH:

Comparative genomic hybridization
 CNP:

Copynumber profile
 NJ:

Neighborjoining
 RF:

RobinsonFoulds
 ZZS:

Zeira, Zehavi, Shamir
References
Nowell PC. The clonal evolution of tumor cell populations. Science. 1976; 194(4260):23–8.
Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics. 2014; 15(1):35.
ElKebir M, Oesper L, AchesonField H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multisample sequencing data. Bioinformatics. 2015; 31(12):62–70.
Malikic S, McPherson AW, Donmez N, Sahinalp CS. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics. 2015; 31(9):1349–56.
Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. Bitphylogeny: a probabilistic framework for reconstructing intratumor phylogenies. Genome Biol. 2015; 16(1):36.
Jahn K, Kuipers J, Beerenwinkel N. Tree inference for singlecell data. Genome Biol. 2016; 17(1):86.
Ross EM, Markowetz F. Onconem: inferring tumor evolution from singlecell sequencing data. Genome Biol. 2016; 17(1):69.
ElKebir M. Sphyr: tumor phylogeny estimation from singlecell sequencing data under loss and error. Bioinformatics. 2018; 34(17):671–9.
Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, et al.Tumour evolution inferred by singlecell sequencing. Nature. 2011; 472(7341):90.
Abo RP, Ducar M, Garcia EP, Thorner AR, RojasRudilla V, Lin L, Sholl LM, Hahn WC, Meyerson M, Lindeman NI, et al.Breakmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res. 2014; 43(3):19.
ElKebir M, Raphael BJ, Shamir R, Sharan R, Zaccaria S, Zehavi M, Zeira R. Copynumber evolution problems: complexity and algorithms. In: International Workshop on Algorithms in Bioinformatics. Springer: 2016. p. 137–49.
Xia R, et al. Phylogenetic Reconstruction for CopyNumber Evolution Problems. IEEE/ACM transactions on computational biology and bioinformatics. 2018; 16(2):694–699.
Zhou J, et al. Maximum parsimony analysis of gene copy number changes. In: International Workshop on Algorithms in Bioinformatics. Berlin: Springer: 2015. p. 108–20.
Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017; 18(4):213.
Lo AW, Sabatier L, Fouladi B, Pottier G, Ricoul M, Mumane JP. Dna amplification by breakage/fusion/bridge cycles initiated by spontaneous telomere loss in a human cancer cell line. Neoplasia. 2002; 4(6):531–8.
Marotta M, Chen X, Inoshita A, Stephens R, Budd GT, Crowe JP, Lyons J, Kondratova A, Tubbs R, Tanaka H. A common copynumber breakpoint of erbb2 amplification in breast cancer colocalizes with a complex block of segmental duplications. Breast Cancer Res. 2012; 14(6):150.
Rajaram M, Zhang J, Wang T, Li J, Kuscu C, Qi H, Kato M, Grubor V, Weil RJ, Helland A, et al.Two distinct categories of focal deletions in cancer genomes. PLoS ONE. 2013; 8(6):66264.
Liu Y, Chen C, Xu Z, Scuoppo C, Rillahan CD, Gao J, Spitzer B, Bosbach B, Kastenhuber ER, Baslan T, et al.Deletions linked to tp53 loss drive cancer through p53independent mechanisms. Nature. 2016; 531(7595):471.
Holland AJ, Cleveland DW. Boveri revisited: chromosomal instability, aneuploidy and tumorigenesis. Nat Rev Mol Cell Biol. 2009; 10(7):478.
Desper R, Jiang F, Kallioniemi OP, Moch H, Papadimitriou CH, Schäffer AA. Inferring tree models for oncogenesis from comparative genome hybridization data. J Comput Biol. 1999; 6(1):37–51.
Liu J, Bandyopadhyay N, Ranka S, Baudis M, Kahveci T. Inferring progression models for cgh data. Bioinformatics. 2009; 25(17):2208–15.
Schwarz RF, Trinh A, Sipos B, Brenton JD, Goldman N, Markowetz F. Phylogenetic quantification of intratumour heterogeneity. PLoS Comput Biol. 2014; 10(4):1003535.
Fertin G, et al. Combinatorics of genome rearrangements. Cambridge: MIT press; 2009.
Zeira R, Zehavi M, Shamir R. A lineartime algorithm for the copy number transformation problem. J Comput Biol. 2017; 24(12):1179–94.
Lafond M, Swenson KM, ElMabrouk N. An optimal reconciliation algorithm for gene trees with polytomies. In: International Workshop on Algorithms in Bioinformatics. Berlin: Springer: 2012. p. 106–22.
Letouzé E, Allory Y, Bollet MA, Radvanyi F, Guyon F. Analysis of the copy number profiles of several tumor samples from the same patient reveals the successive steps in tumorigenesis. Genome Biol. 2010; 11(7):76.
Paul S, Su C, Pang J, Mizera A. A DecompositionBased Approach towards the Control of Boolean Networks. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York: Association for Computing Machinery: 2018. p. 11–20. https://doi.org/10.1145/3233547.3233550,.
Santarius T, Shipley J, Brewer D, Stratton MR, Cooper CS. A census of amplified and overexpressed human cancer genes. Nat Rev Cancer. 2010; 10(1):59.
Park HS, Jang MH, Kim EJ, Kim HJ, Lee HJ, Kim YJ, Kim JH, Kang E, Kim SW, Kim IA, et al.High egfr gene copy number predicts poor outcome in triplenegative breast cancer. Mod Pathol. 2014; 27(9):1212.
Campbell K, GastierFoster JM, Mann M, Naranjo AH, Van Ryn C, Bagatell R, Matthay KK, London WB, Irwin MS, Shimada H, et al.Association of mycn copy number with clinical features, tumor biology, and outcomes in neuroblastoma: A report from the children’s oncology group. Cancer. 2017; 123(21):4224–35.
Garey MR, Johnson DS. Computers and Intractability, vol 29. New York: W.H. Freeman; 2002.
Seidel R, Aragon CR. Randomized search trees. Algorithmica. 1996; 16(45):464–97.
Saitou N, Nei M. The neighborjoining method: a new method for reconstructing phylogenetic trees. Molecular Biol Evol. 1987; 4(4):406–25.
Felsenstein J. PHYLIP (phylogeny Inference Package), Version 3.5 C: Joseph Felsenstein.; 1993.
Aldous D. Probability distributions on cladograms. In: Random Discrete Structures. New York: Springer: 1996. p. 1–18.
Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(12):131–47.
Funding
Publication was funded by the Natural Sciences and Engineering Research Council (NSERC).
Author information
Authors and Affiliations
Contributions
GC and ML both participated in writing the manuscript, establishing the theoretical results, performing the experiments and implementing the algorithms. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary file S33S1.pdf contains all the missing proofs.
Additional file 2
Supplementary file S33S2.pdf contains all the additional experimental results.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cordonnier, G., Lafond, M. Comparing copynumber profiles under multicopy amplifications and deletions. BMC Genomics 21 (Suppl 2), 198 (2020). https://doi.org/10.1186/s1286402066113
Published:
DOI: https://doi.org/10.1186/s1286402066113