Comparing copy-number profiles under multi-copy amplifications and deletions

Background During cancer progression, malignant cells accumulate somatic mutations that can lead to genetic aberrations. In particular, evolutionary events akin to segmental duplications or deletions can alter the copy-number profile (CNP) of a set of genes in a genome. Our aim is to compute the evolutionary distance between two cells for which only CNPs are known. This asks for the minimum number of segmental amplifications and deletions to turn one CNP into another. This was recently formalized into a model where each event is assumed to alter a copy-number by 1 or −1, even though these events can affect large portions of a chromosome. Results We propose a general cost framework where an event can modify the copy-number of a gene by larger amounts. We show that any cost scheme that allows segmental deletions of arbitrary length makes computing the distance strongly NP-hard. We then devise a factor 2 approximation algorithm for the problem when copy-numbers are non-zero and provide an implementation called cnp2cnp. We evaluate our approach experimentally by reconstructing simulated cancer phylogenies from the pairwise distances inferred by cnp2cnp and compare it against two other alternatives, namely the MEDICC distance and the Euclidean distance. Conclusions The experimental results show that our distance yields more accurate phylogenies on average than these alternatives if the given CNPs are error-free, but that the MEDICC distance is slightly more robust against error in the data. In all cases, our experiments show that either our approach or the MEDICC approach should preferred over the Euclidean distance.


Background
Cancer is widely recognized as an evolutionary process during which cells within a population accumulate aberrant somatic mutations and replicate indefinitely [1].These cells are divided in subpopulations, called clones, that share common mutation traits and form tumors.A natural problem that arises is to reconstruct the evolution of a set of clones within a tumor.This question has recently led to the development of several phylogenetic algorithms tailored for cancer evolution.Most of them use either information of single nucleotide variants obtained from bulk [2,3,4,5] or single-cell [6,7,8] sequencing data, or copy-number alter-arXiv:2002.11271v1[q-bio.GN] 26 Feb 2020 ations [9,10,11,12,13] (usually in the context of single-cell data).We refer the reader to [14] for a survey of these methods.
In this work, we are interested in the problem of inferring the minimum number of copy-number alteration events that explain how a cell evolved into another.In tumors, several events can make the copy-number of a gene different from the normal diploid two-copy state, thereby creating copy-number aberrations.As an example, the breakage-fusion-bridge (BFB) phenomenon [15] occurs when a region including a telomere breaks off a chromosome.During replication, two sister chromatids have unterminated ends and they fuse, leading to what is essentially a chromosome portion concatenated with a reversed copy of itself (see [15,16] for a more thorough explanation).Afterwards, the centromeres of the fused chromatids get pulled in opposite directions, leading to another breakage.This BFB cycle repeats until the chromatids receive a telomere (usually after translocation).Each BFB event potentially doubles the copy-number of a gene, and since these events are known to occur in cycles, a gene copy-number may become significantly higher than normal (i.e. more than double) in a short evolutionary time span.Other examples of events include focal deletions [17,18] or missegregation of chromosomes [19].
Desper et al. [20] were among the first to consider copy-number aberrations for phylogenetic reconstructions, using comparative genomic hybridization (CGH) data to reconstruct a mutation hierarchy.In [21], Liu et al. propose a distance-based approach based on CGH data to infer multi-cancer phylogenies.Single-cell phylogenetics then gained widespread attention in an influential paper of Navin et al. [9].The authors applied single-nucleus sequencing on a breast cancer tumor, obtained the copy-number profile (CNP) of several cells, each represented as a vector of integers, and used the Euclidean distance to compare two CNPs.Later, Schwarz et al. [22] pointed out that a single event can amplify or delete large portions of a chromosome, thereby altering the copy-number of several genes and making the Euclidean distance overestimate the true number of events.
The authors proposed the following methodology to compare two CNPs.First, assuming diploid genomes, the copy-number for the two alleles of each gene (which can differ) is inferred from sequencing data.The correspondence between the copynumbers and the alleles is unknown, so a phasing step must be applied.This consists of assigning each copy-number to one of the two alleles (this is done under a minimum-evolution principle, see [22] for details).After this step, each chromosome can be represented as a pair of CNPs, and chromosomes from two cells can be compared by computing the distances between the corresponding alleles.The distance proposed is the minimum number of segmental amplification and deletion events required to transform a given CNP into another.
In this work, we focus on the latter step.We assume that the CNP inference and the phasing steps have been performed, and must find a most parsimonious sequence of events explaining two given CNPs.This is analogous to classical rearrangements problems [23], but the main novelty (and difficulty) of CNP comparison is that only copy-numbers are known, not the ordering of genes.In [22], Schwarz et al. introduced the MEDICC model, which approximates segmental events on a chromosome by events that alter a subinterval of a CNP by +1 or -1. Figure 1 shows an example Figure 1: Left: two CNPs u and v, represented as integer vectors.The CNP u can be turned into v with three events: two deletions and one amplification.Right: a visual representation of the difference vectors obtained at each step.Note that a 0 remains a 0 even after amplification.turning a CNP u into another v (under our model where any amount of change is allowed).The problem of computing the minimum number of subinterval alterations to transform one CNP into another was solved in exponential-time in [22] by modeling CNP events with a finite-state transducer.Zeira et al. [24] gave a linear time algorithm, using a clever trick for computing each row of a quadratic-size dynamic programming table in constant time (similar to the techniques used in [25]).In [11], the large phylogeny problem under this model is shown NP-hard, though solvable with an ILP.They also present the copy-number triplet problem, which when given two CNPs u and v asks for a CNP whose sum of distances to u and v is minimized.The problem can be solved in pseudo-polynomial time O(n 2 N 7 ), where n is the CNP size and N the maximum copy number.Other distances and phylogenetic approaches are discussed in [12,10,13,11,26,27] Our results.The above CNP comparison frameworks limit events to alter copynumbers by 1 or −1.As we exemplified with BFB, several copies of a gene can be affected by a single event.Moreover, the MEDICC software has a copy-number limit of 4, making it inappropriate for genes attaining copy-numbers in the tens, twenties or even more, as has been reported for e.g. the MYC or EGFR genes [28,29,30].In this work, we address these limitations by generalizing the Copy-Number Transformation problem defined in [22,24].We define a distance d f (u, v) between two CNPs u and v which assigns a weight of f (c, δ) to an event that alters a copynumber of c by an amount of δ.We show that computing d f (u, v) becomes strongly NP-hard whenever we allow deletions of any amount at unit cost.In the context of our problem, "strongly" means that our hardness holds even if N , the maximum value in u and v, is polynomial in n, the number of elements in our CNPs.This is especially relevant, given that the MEDICC model was initially solved in time O(nN ) and that the copy-number triplet problem can be solved in time O(n 2 N 7 ).Our result implies that such pseudo-polynomial time algorithm are impossible in our case unless P = NP.We then show that if any amount of change is permitted across an interval at unit cost, then a simple linear-time factor 2 approximation algorithm can be devised.We validate our approach by reconstructing simulated phylogenies using neighbor-joining (NJ), and compare them with the MEDICC distance and Euclidean distance.We perform our experiments on error-free data and noisy data (where the true copy-numbers are altered by a random amount).Using a variety of simulation papameters, we show that both our distance and the MEDICC distance achieve significantly better accuracy than the Euclidean distance.Our distance is slightly more accurate on error-free data, and the MEDICC distance is slightly more tolerant to error.

Results
We first provide the required preliminary notions required to state our theoretical results.We then show that computing our copy-number distance is strongly NP-hard, and present our approximation algorithm.Finally, we present our experimental results on reconstructing simulated phylogenies.

Preliminary notions
Throughout the paper, we use the interval notations [n] = {1, 2, . . ., n} and [s, t] = {s, s + 1, . . ., t}.Given a vector u = (u 1 , . . ., u n ) of n integers and i ∈ [n], we will always write u i for the value at the i-th position of u.If u i = 0, then i is called a null position.We will assume that every vector u of dimension n has special values u 0 = u n+1 = 0. We denote by u −{i} the vector obtained by removing position We assume that a reference chromosome is partitioned into contiguous subsequences, called positions, each numbered from 1 to n.A copy-number profile (CNP) is a vector u = (u 1 , . . ., u n ) of non-negative integers representing the copy-number of each position in a clone.We consider amplification and deletion events, which respectively have the effect of increasing and decreasing the number of copies in a chromosome.As in [22,24], we assume that events affect a set of positions that are contiguous in the reference chromosome.
An event is a triple e = (s, t, δ) where 1 ≤ s ≤ t ≤ n and δ ∈ Z \ {0}.Here the [s, t] interval depicts the set of affected positions, and δ is the amount of change.The event e is an amplification when δ > 0 and a deletion when δ < 0. A copy-number cannot drop below 0 and cannot increase from a 0 to another value (e.g.new genes cannot be created once completely lost).Applying event e = (s, t, δ) on a CNP u yields another CNP u = (u 1 , . . ., u n ) with, for i ∈ [n], We denote by u e the CNP obtained by applying event e on a CNP u.More generally, if E = (e 1 , . . ., e k ) is an ordered sequence of events, we write u E = u e 1 e 2 . . .e k to denote the CNP obtained by applying each event of E in order.We may also write u e 1 . . .e k instead of u (e 1 , . . ., e k ) .Given two CNPs u and v of dimension n, we say that We will often use the difference vector of u and v, and usually denote w := u − v.
The representation of w as in Figure 1 on the right provides the following intuition: if u E = v, then the events of E need to "squish" that values of w to 0 to make u equal to v (ensuring that no value u i of u drops to 0 in the process unless v i = 0).

Minimum cost transformations
Given two CNPs u and v, our goal is to find a minimum-cost sequence E that transforms u into v.In [22,24], the cost of an event (s, t, δ) is |δ|.Here, we propose a generalization by defining a cost function f : N × Z → N >0 that assigns a positive cost to altering a copy-number c by an amount of δ.That is, if we apply (s, t, δ) on u, each position i ∈ [s, t] has its own corresponding cost f (u i , δ), which could be interpreted as the plausibility of going from copy-number u i to max(u i + δ, 0).We then define the cost cost f (u, e) with respect to f of applying e = (s, t, δ) on u as the maximum cost within [s, t], i.e.
The events proposed in the MEDICC algorithm of Schwarz et al. can be decomposed into δ events of unit cost.This can be modeled under our framework with a function mdc defined as mdc(u i , δ) = 1 if δ ∈ {−1, 1} and mdc(u i , δ) = ∞ otherwise.Alternatively, one could state that a position with copy-number u i can hardly more than double in a single event (assuming that amplifications are duplications), but that deletions can suppress any number of copies.We call this the doubling function dbl, defined as dbl(u i , δ) = 1 if u i + δ ≤ 2u i , and dbl(u i , δ) = ∞ otherwise.
Finally, the most permissive cost function any allows any movement without constraint: simply define cost(u i , δ) = 1 for any δ ∈ Z.This can, for instance, be used to model succession of events that can potentially amplify copy-numbers above their double in a short time span -an example of this being BFB cycles.
In this paper, we mostly analyze the any function for its simplicity, but will sometimes use the dbl function for its relevance.Given two CNPs u and v and a cost function f , the cost of a sequence of events E = (e 1 , . . ., e k ) satisfying u E = v is equal to the sum of the cost of applying successive events of E on u, i.e.
) for any other sequence E satisfying u E = v, then E is called optimal.The f -distance between u and v, denoted d f (u, v), is the cost of an optimal sequence of events transforming u into v.Observe that this "distance" is not symmetric (hence the use of double-quotes).For instance, if u = (1, 1) and v = (0, 0), then d mdc (u, v) = 1 but d mdc (v, u) is undefined since v cannot be transformed into u.We will therefore usually assume that u does not have any null position.We note here that the median distance, defined as min w∈V (d f (w, u) + d f (w, v)) (where V ranges over Z n ), is symmetric for all the functions mentioned above.However, no efficient algorithm is known for any median distance.Our problem is the following.
The CNP-transformation problem: Given: a source CNP u, a target CNP v, a cost function f and an integer k; We say that f is a unit-cost function if f (c, δ) ∈ {1, ∞} for any c and δ (e.g. the functions mdc, dbl and any).A cost function f is called deletion-permissive if cost f (u i , δ) = 1 for any u i and any δ < 0, i.e. there is no particular constraint on deletions.We will mainly focus deletion-permissive functions, the rationale being that unlike duplications, deletions could suppress an arbitrary number of copies.

General properties
Before proceeding with our results on computing f -distances, we present some results of general interest that will be useful later on.
Proposition 1.For any two CNPs u and v of the same dimension, any position i ∈ [n] and any cost function We omit the proof details.Proof Denote E = ((s 1 , t 1 , δ 1 ), . . ., (s k , t k , δ k )).For any position i, the sum k j=1 δ k does not change even if we reorder the events in E, so u i should still become v i after reordering the event and applying them on u.The only danger is that a position drops to 0 since v has no null position, but this cannot happen if all amplifications are moved in front of E.

Strong NP-hardness
We show that the CNP-transformation problem is strongly NP-hard.This result holds for any deletion-permissive unit-cost function f , and even if u and v contain no null position (we note that in [24], null positions make the problem more complex, but not here).In particular, the hardness also holds if only deletions are allowed.We assume that we are given two CNPs u and v and we put w := u − v.
Suppose that w contains a staircase in interval [1, k] for some k, and that Intuitively, E removes the first step, then the second, and so on, see Figure 2. Observe that in a smooth deletion sequence, the positions to the right of k may or may not be affected by deletions.
Lemma 2. Let u and v be two CNPs with no null positions and let f be any then there exists a smooth sequence transforming u into v.
Lemma 2 requires the most technical proof of the paper (by far), and we defer it to the Supplementary material.The reduction becomes relatively simple when given this lemma.Our reduction is from the 3-partition problem.In this problem, we are given a multi-set S = {s 1 , . . ., s n } of n = 3m positive integers.Defining t := s i , we are asked whether S can be partitioned into m subsets S 1 , . . ., S m , each of size 3, such that s∈Si s = t for all i ∈ [m].This problem is known to be strongly NP-hard [31] (i.e. it is hard even if the values of S are O(n k ) for some constant k).The proof can be found in the Supplementary material.
Theorem 1.The CNP-transformation problem is strongly NP-hard for any deletion-permissive unit-cost function, even if the CNPs have no null positions.

Approximation algorithm
In this section, we show that if v does not contain any null position, then d any (u, v) can be approximated within a factor of 2 in linear time.We discuss practical ways of handling null positions at the end of the section.We now assume that f = any and will write d(u, v) instead of d any (u, v).
As usual, u and v are the source and target CNPs, respectively, and w := u − v.The idea of the approximation is that if two consecutive positions i and i + 1 have the same difference between u and v, i.e. w i = w i+1 , then their value needs to change by the same amount.It might then be a good idea to treat these positions as one and always affect both with the same events.In fact, a whole interval of equal w values can be treated as a single position.We show that the number of distinct equal intervals gives a good bound on d(u, v).

Approximation by flat intervals
Recall that if w is a vector of n integers, it has implicit values In fact, in the remainder, we will omit the term "maximal" and always assume that discussed flat intervals are maximal.We write F w for the set of flat intervals of w.Note that this set is well-defined and that it partitions [0, n + 1], by the maximality property.The intervals that contain 0 and n + 1 in F w are called extreme flat intervals, and always have a value of 0 (also, these intervals are possibly [0, 0] and/or [n + 1, n + 1], but not necessarily).The key lemma says that d f (u, v) is at least about half the number of flat intervals (see Supplementary material).Lemma 3. Let u, v be two distinct CNPs with no null positions, and let w := u−v.Then for any unit-cost function f , Lemma 3 yields a very simple factor 2 approximation algorithm: compute F w , and return |F w | − 2. This corresponds to a solution in which we treat each flat interval separately (ignoring the two extremities) and is guaranteed to be at most twice the optimal number of events.Computing F w can be done in a single pass through w by increasing a counter whenever we encounter a position i with w i = w i−1 .
Theorem 2. The CNP-transformation problem can be approximated within factor 2 in linear time for cost function f = any when the CNPs contain no null position.
It is open whether this could be adapted to other functions, e.g. the dbl function.

Improvements to the approximation algorithm
We first observe that the bound in Lemma 3 is essentially tight.This can be seen with any u, v such that u − v = (1, 2, 3, . . ., k − 1, k, k − 1, . . ., 3, 2, 1) for some k.Indeed, one can decrease |F w | by two at each round.On the other hand, our naive 2-approximation is twice as bad as optimal.We show how to improve this in a heuristic fashion by devising an algorithm that can only perform better than the naive one.We leave it as an open problem to determine the approximation guarantees of this algorithm.
Our goal is to apply events that reduce |F w | by two as many times as possible.In a greedy fashion, we apply the following strategy for our improved 2-approximation: as long as u = v, find an event e that reduces |F w | by 2, if one exists, and apply it to u.If no such event exists, take the leftmost non-extreme flat interval [a, b] of w and apply the event (a, b, −w a ).Repeat until u = v.
An event (i, j, δ) reduces |F w | by 2 precisely when w i−1 − w i = w j+1 − w j = δ (i = j is possible).This way we can merge the two flat intervals at the ends of [i, j].One can find a good interval by checking all the O(n 2 ) subintervals [i, j] and then, for each of them, checking whether w i−1 − w i = w j+1 − w j .Moreover, we must check whether applying the event (i, j, δ) would make a value of u go below 0. Verifying every possible event can be done in time O(n 3 ) and as there are O(n) flat intervals, the algorithm takes time O(n 4 ).This can be improved to O(n 2 log n) by finding good events in time O(n log n).Due to space constraints, we relegate the detailed analysis of the improved heuristic to the Supplementary material.The idea is to scan w from left to right and store in a treap data structure (see [32]) the set of flat intervals encountered so far, which allows to detect quickly whether the current flat interval could be matched with another one.

Handling null positions
Our approximation ratio is not guaranteed to hold when there are many null positions.However, we show that in many practical cases, we can simply ignore null positions and remove them.In particular, we may assume that v has no two consecutive null positions (Lemma 4) and that for any null position i in v, we have w i−1 < w i and w i+1 < w i (Lemma 5).Thus instances with null positions can be reduced to ones where the only null positions remaining are "sandwiched" between non-null positions with a smaller value in w.
Lemma 4. Suppose that v i = v i+1 = 0 for some position i.Then removing position i or i + 1, whichever is smaller in u, from u and v preserves the distance between u and v. Formally, for any unit-cost function f , if Lemma 5. Suppose v i = 0 for some position i and that

Experiments
We tested our flattening approximation algorithm and its improved version on simulated chromosomes that evolve along a tree through segmental tandem duplications and losses.Chromosomes were represented as strings of genes.Note that we did not simulate CNP evolution under the assumptions of our model.We evolved actual sequences as opposed to integer vectors, and the initial ordering of genes could be broken after several events.Our goal was to reconstruct phylogenies from the distances between the CNPs of the chromosomes at the leaves of the tree.We used the NJ implementation in Phylip [33,34] and compared four distances: (1) our improved approximation; (2) our flat interval count; (3) the mdc cost, as in the MEDICC model; and (4) the Euclidean distance.To compute d mdc , we implemented the dynamic programming algorithm of Zeira, Zehavi and Shamir [24], hereafter called the ZZS algorithm (we could not use the MEDICC software as it only handles copy-numbers up to 4).The Euclidean distance is defined as n i=1 (u i − v i ) 2 , as used in [9].For the first three distances, we took the minimum of d(u, v) or d(v, u) to get a symmetric distance, removing null positions of u and filtering null positions of v as in Lemmas 4 and 5.

Simulated tree generation
We now describe how the trees were generated.First, we select a rooted binary tree T on l leaves labeled {1, . . ., l} uniformly at random.This is achieved by using the recursive splitting process described by Aldous in [35], which starts with a completely unresolved tree, splits the root in two subtrees chosen uniformly at random, and repeats on these subtrees.We then assign to the root r of T an exemplar chromosome, i.e. any string in which each gene occurs exactly once (note that the initial ordering of genes does not matter for our purposes).
Then for each branch uv of T from top to bottom, we select a random number of events k chosen uniformly at random in the interval [e min , e max ], where e min , e max are simulation parameters.To introduce some rate heterogeneity among branches, we then multiplied k by a random number chosen from a uniform distribution with mean and standard deviation 1.The chromosome string at node v is obtained by applying k random events on the chromosome string associated with its parent u.Each event is either a tandem duplication with probability ∆ or a deletion with probability 1−∆.The starting position of each event is chosen uniformly at random on the chromosome string and, to find the length t of the substring affected, we apply the following process.Start with t = 1, then apply the following: as long as a random number between 0 and 1 is above a given parameter r, increment t by 1 and repeat.We stop at the first random number below r.We chose to consider only values r ≤ 0.1 since higher values resulted in copy-numbers in the hundreds, sometimes even in the thousands, a road which we did not deem necessary to explore.Setting r between 0.01 and 0.1 generally resulted in copy-numbers inside [0, 50].We also note that we also experimented on a model where the event length was chosen as a random fraction of the chromosome length -this led to exponential copy-number growth and we did not investigate this model further.
We observed that this process had a tendency to produce leaf chromosomes with CNPs having between 50-60% null positions under most parameter combinations.This might be deemed unrealistic, and furthermore, our results show that no method is able to predict accurate trees under these conditions.To avoid this, we added a condition in the loop determining the length t of an event: if incrementing t implies deleting the last occurrence of a gene, we continue the procedure with probability q and stop with probability 1 − q (where q is another simulation parameter).This can be seen as modeling the idea that there may be resistance when attempting to remove every copy of a gene required for survival.Using q parameter values 0.25, 0.5 and 0.75, the proportion of null positions stayed in the intervals 2-5%, 6-10% and 15-25%, respectively.
Note that since each possible tree on l leaves is equally likely to be chosen, the root-to-leaf distances in a tree can be significantly different, and hence the trees are not expected to be ultrametric (for instance, a caterpillar can be selected as well as a perfectly binary tree).
Since it is difficult to determine the most realistic simulation conditions, we tested several combinations of parameters for the generation of phylogenies.The summary of the simulation parameters, along with their possible and default values, are summarized here: • l ∈ {10, 50, 100} is the number of leaves in the tree.The default is l = 100; • n ∈ {10, 100, 250} is the number of genes (i.e.distinct characters) in the root chromosome (n is also the number of positions in our vectors).The default is n = 100; • (e min , e max ) ∈ {(2, 4), (5,10), (20,40)} is the range of the possible number of events on each branch.The default is (e min , e max ) = (5, 10); • ∆ ∈ {0.25, 0.5, 0.75} is the probability that an event is a duplication (and 1 − ∆ the probability that an event is a loss).The default is ∆ = 0.5; • r ∈ {0.01, 0.05, 0.1} controls the length of each event: r is the probability that we stop extending the event length.The default is r = 0.05; • q ∈ {0.25, 0.5, 0.75, 1} is the probability that a deletion suppresses the last copy of a gene during the length extension procedure (i.e. 1 − q is the probability that the extension stops if it would make a copy-number 0).The default is q = 0.25.

Tree reconstruction and performance measure
We generated 50 trees for each parameter combination of interest.For each tree, we took the chromosome strings at the leaves, obtained their CNPs and provided them as input to each of the four evaluated methods.We used the normalized Robinson-Foulds (RF) distance as a measure of the performance of each algorithm [36].That is, for each inferred tree, we compare it with the "true" tree by counting the number of clades that are present in one tree but not the other, divided by 2(l − 3) (the maximum number of clades that can possibly differ, recalling that l is the number of leaves).This yields a number between 0 and 1: the lower the number, the better we consider the reconstruction.

Error tolerance
It should be noted that the above methodology ignores several sources of errors.Inferring exact copy-numbers from single-cell sequencing data is a non-trivial task and is still considered an open problem.The inferred CNPs are therefore expected to be noisy, especially with genes having a high copy-number.Moreover, as discussed in [22], assigning copy-numbers to their corresponding allele is also a difficult problem.Here, by only considering single-allele chromosomes, we are supposing that the aforementioned phasing step has been performed correctly, whereas copy-number assignments cannot be assumed to be error-free.
Both of the above problems have the effect of introducing incorrect copy-numbers into the CNPs.To account for this, we gave randomly altered CNPs as input to each method.More specifically, given an error-rate parameter α, for each CNP u and each position i we changed u i to a value chosen at random from a normal distribution with mean u i and standard deviation α • u i (non-integer values were rounded).We tested parameter values α ∈ {0, 0.1, 0.25, 0.5, 1}.

Experimental results
We first ran experiments using the default values for all parameters except one in order to isolate the impact of each parameter.On error-free data, the most interesting results were obtained when varying l and n, see Figure 3.In most situations, our CNP model slightly improves upon the MEDICC model, both of which are significantly better than the Euclidean distance.The number n of genes is quite important: all the results are poor when each CNP has only n = 10 positions, but when n = 250, the trees more accurate.This might be because when n = 10, there is not enough opportunity for positions acquire a distinct signature during the evolutionary process, making all distances very similar.This suggests that many genes or segments should be considered when analyzing copy-number variants in tumor clones.The duplication rate does not seem to affect the accuracy of the methods, whereas accuracy tends to decrease as the number of events per branch increases.
We do note that when the number of events per branch is within [5,10], our model performs better, but when it is high, i.e. in [20,40], the MEDICC model performs better.This tendency is confirmed under other parameterizations (see the Supplementary Material).Figure 4 show the normalized RF distances when varying parameters q and r.As mentioned before, as q gets closer to 1, the proportion of null positions is around 50-60%, making accurate distance computation difficult for all methods.As for the r parameter, the accuracy of our approach is better when r = 0.01 but worse when r = 0.1.This tendency can be observed under all parameter combinations (see the Supplementary Material).One should note that accuracy is generally better if the lengths of events are smaller.The four approaches on errorfree data exhibit similar behavior on other parameter combinations -additional plots can be found in the Supplementary Materials.
The results on CNPs containing errors are summarized in Figure 5.We observe that the ZZS algorithm achieves slightly more accurate trees whenever the error rate is above zero.One possible explanation is that a single error in a CNP can split a flat interval into three.This can significantly alter the flat interval counts, whereas the ZZS distance is less dependant on flat intervals.The accuracy of the Euclidean distance appears to be the least affected by error rates and even performs better when α ≥ 0.5.Observe however that accuracy decays rapidly with error rates: when α ≥ 0.25, all approaches have an average normalized distance above 0.7, casting some doubt on their practical usability in this setting.This suggests that it might be beneficial to apply a CNP error-correction procedure before comparing them (see the Discussion section).More results on noisy data can be accessed in the Supplementary Material.
To summarize, the heuristic and flat count algorithm always yield a lower average RF distance than the ZZS algorithm on error-free data, except when n = 10 (where the average is always above 0.9 anyways), and every method always outperforms the Euclidean distance.On the other hand, the ZZS approach is slightly more robust to error in the CNP counts.However, the accuracy of the heuristic, the flat count and ZZS drops quickly as error rates increase.Even though the Euclidean distance yields better trees at high error rates, their accuracy is still too poor to be able to draw meaningful conclusions from them.Whether the MEDICC model is better than ours or not, we believe that either should be preferred over the Euclidean distance when reconstructing phylogenies from distance matrices as in [9].

Discussion
The results from the experiments section show that our CNP distance performs reasonably well on simulated data.The incorporation of segmental events into the model does not appear to provide a significant advantage over the unitary events of the ZZS model.However, the simulations suggest that both approaches yield better results than the traditional Euclidean distance.This demonstrates that either our method or the ZZS algorithm should be preferred as the CNP comparison component in a single-cell phylogenetic reconstruction pipeline.
It should be noted that our algorithms only approximate the true CNP distance whereas the ZZS algorithm provides an exact solution.In order to evaluate the true performance of our segmental model, exact approaches should be developed in the future, perhaps using techniques from the field of parameterized complexity.Moreover, our approaches are very sensitive to errors, even more so than ZZS/MEDICC.One possible explanation for this is that both of our algorithms derive their distance from the number of flat intervals.A single error in a copy-number can turn one flat interval into three, and thus even moderate levels of noise can lead to highly incorrect predictions.We believe that the ZZS algorithm is less sensitive to such errors because that, if copy-number differences are large enough, a small error only increases the distance by 1, which is small in comparison to all the unit events required to handle the high difference.Therefore, even if the true event distance is overestimated, in a comparative setting the relative distances might be closer to the truth.It will be interesting to consider CNP error correction procedures based on flat intervals.For instance, when performing analysis of multiple cells, one could detect a potentially incorrect copy-number of a given segment by checking whether, after altering a predicted copy number by a small amount, several flat intervals get "fixed" when comparing the cell with others.
Another point of interest is that current approaches, including ours and ZZS/MEDICC, ignore rearrangements that change the ordering of segments.Our models assume that the set of contiguous segments remains the same in all cells during evolution.However, when duplications and deletions occur, the relative ordering of genes changes and the set of contiguous genes affected by the events will differ from that in the reference.Inversions, translocations or even chromothripsis also have the same effect.This is a difficult problem to handle if only CNPs are known, since integer vectors do not contain enough information to determine which genes are contiguous or not.One possibility is to ask the following: given two CNPs C 1 and C 2 to compare, choose a genome G 1 whose CNP is C 1 and a genome G 2 whose CNP is C 2 such that the rearrangement distance between G 1 and G 2 is minimized.
Our work also leaves several questions open.From a theoretical perspective, it remains to achieve a constant factor approximation when null positions are present in the input.Moreover, it is unknown whether the dbl function admits good approximation algorithms and, more generally, whether there are other biologically plausible functions that should be studied.On another note, it might be interesting to investigate the copy-number triplet problem (see the introduction) under our model, as it allows to define a symmetric distance between CNPs.
On a practical level, phylogenetic approaches that are not distance-based should be investigated.For instance, we could consider maximum parsimony as in [11], where the objective is to minimize the number of events across branches of a tree.The recent distance-based approach of Xia et al. [12], which is based on the MEDICC model but with an extra error correction step, should also be evaluated in our setting.On another note, it remains to test our approach on real data.We have ignored the problem of calling copy-numbers and the aforementioned problem of phasing.These can introduce noise in the data and, as shown in our experiments, all the evaluated methods are sensitive to errors.This motivates the need for new methods to assign copy-numbers to alleles under our model.Also, our CNP comparison framework assumes a single-cell setting, where the CNP of each individual cell is known.Since bulk sequencing is still commonplace, it will be useful to develop methods that are able to compare genomes extracted from samples that contains multiple types of cells.

Conclusion
In this work, we provided a general framework for the comparison of CNPs depicting genomes that evolve by segmental amplifications and deletions.We have shown that if there is no bound on the number of copies that a deletion can affect, then computing the minimum number of events transforming one CNP into another is strongly NP-hard.One important implication of this result is that unless P = NP, one cannot use the fact that copy-numbers are not too large (e.g.under 100) to devise a practical pseudo-polynomial time algorithm, and other solutions must be explored.On the other hand, we proposed two simple and fast approximation algorithms that were shown to perform reasonably well on simulated datasets.

List of abbreviations
BFB: breakage-fusion-bridge; CGH: comparative genomic hybridization; CNP: copy-number profile; NJ: neighbor-joining; RF: Robinson-Foulds; ZZS: Zeira, Zehavi, Shamir     Declarations make position k drop below 0).In other words, and by the same arguments as above, So far, we know that only deletions affect position k (Property 1), and all these deletions also affect position k − 1 (Property 2).Because this implies that some amplification event ê must affect position k−1 (otherwise, applying only the deletion events affecting position k on position k − 1 would make position k − 1 drop below v k−1 ).Let us assume, again using Proposition 2, that ê is the first event of E, i.e. e 1 = ê.We use the same trick for a third time.That is, let û := u ê and notice that û has a staircase in interval This contradiction forces us to conclude that l < k is false, which proves the lemma.
Lemma 2. Let u and v be two CNPs with no null positions and let f be any unit-cost function.If u−v contains a staircase in interval [1, k] and d f (u, v) = k, then there exists a smooth sequence transforming u into v.
Proof of Lemma 2. We prove the lemma by induction over k.As a base case, the statement is easy to see when k = 1 since a single step can only removed by a deletion, which is smooth.So assume k > 1 and that for any u , v such that there is an optimal smooth sequence transforming u into v .Let E be any sequence of k events such that u E = v.If E is smooth, then we are done so assume otherwise.The proof is divided in two parts.Assuming the inductive hypothesis, we first show that there is an optimal sequence Ê containing only deletions such that u Ê = v.These deletions are not necessarily smooth.We complete the induction in a second step, where we convert this deletion sequence into a smooth one.For the remainder of the proof, we will denote w := u − v.
Part 1: proof that u can be transformed into v using only deletions.
Assume that E = (e 1 , . . ., e k ) contains some amplification, otherwise we are done proving our first step.We first claim that only deletions affect positions k to n, inclusively.To see this, assume on the contrary that e i = (a, b, δ) is an amplification where b ≥ k.By Proposition 2, we may assume that e i = e 1 .But u e 1 still has a staircase in interval [1, k], and by Lemma 1, d f (u, v) ≥ k.This is a contradiction since e 1 should reduce the distance to from u to v. Hence our claim holds.
We now claim that, on the other hand, some amplification in E affects position k − 1.This is clearly true if every deletion affecting position k also affects position k − 1.Indeed, we have w k−1 < w k and without an amplification on k −1 it would be impossible that position k −1 becomes equal to v k−1 > 0. Thus if we suppose that no amplification affects position k − 1, there must be some deletion e i = (k, h, d) that affects position k but not k − 1, where here h ≥ k.
Let u := u e i .Since no amplification affects any position in [k, h], u has no position with value 0. Furthermore, u − v contains a staircase of length By induction, there is a (smooth) deletion sequence E such that u E = v.In that case, the sequence formed by e i followed by E transforms u into v and has only deletions, which is what we want.Thus we may assume that our claim saying that some amplificatio affects k − 1 holds.Moving on, let e i = (a, k − 1, δ) be an amplification in E that affects position k − 1 (but not k).Our previous claims show that e i exists.By Proposition 2, we may assume that e 1 = e i .Let u := u e 1 and w Moreover, the differences in value between the steps have not changed, except at position a.Formally, for each i By induction, u E = v for some smooth deletion sequence E = (e 1 , . . ., e k−1 ). Here be the deletion events of E that affect position k, i 1 < i 2 < . . .< i l .We distinguish two cases.
Case 1: a / ∈ {i 1 , . . ., i l }.Then the event (a, b a , w a−1 − w a ) of E does not affect position k, meaning that b a = k − 1 (by smoothness).Consider the sequence E obtained from E by replacing the event (a, k − 1, w a−1 − w a ) by the event (a, k − 1, w a−1 − w a ).Since u E − v has a 0 everywhere and w a − w a−1 = w a − w a−1 + δ, it follows that u E − v has value 0 everywhere, except at positions from a to k − 1 where it has value δ.But then, the only difference between u and u is that positions a to k − 1 are increased by δ.Thus u E − v has a value of 0 everywhere (and u never drops below 0, due to the smoothness of E ).This means that u E = v, which is a contradiction since E has k − 1 events.It follows that u has a staircase of length k − 1 in positions [2, k].No event of Ê can affect position 1 after e 1 , so we can ignore this position in u and w .That is, suppose we remove position 1 from u and v, yielding two vectors u and v of length n − 1.Let w := u − v .Then w has a staircase of length k − 1 in interval [1, k − 1].This allows us to use induction, so that there is a smooth sequence Ê of length k − 1 transforming u into v .This easily translates into a sequence Ê transforming u into v: we just "shift" every event to the right to account for position 1 in Ê .To be specific, we replace any event (s, t, ) from Ê by the event (s + 1, t + 1, ) in Ê .Since Ê is smooth, then we can write Ê = ((2, b 2 , 2 ), . . ., (k, b k , k )) where, for each i ∈ {2, . . ., k}, b i ≥ k and d i = w i − w i−1 .We have not shown smoothness yet, because ê1 might not affect the whole [1, k] interval as we wish.If indeed ê1 affects position k, i.e. if b ≥ k, then it is easy to see that applying ê1 followed by Ê is a smooth sequence transforming u into v.Thus we may assume that b < k.Observe that w i − w i−1 = w i − w i−1 for all i ∈ {2, . . ., k} \ {b + 1}, because w b+1 − w b = w b+1 − w b + w 1 (recall that −δ = w 1 ).Let (b + 1, b , w b − w b+1 − w 1 ) be the deletion of Ê that starts at position b, where b ≥ k by smoothness.Suppose that we replace it with the deletion (b + 1, b , w b − w b+1 ) in Ê , yielding an alternate sequence Ẽ.Then u Ẽ − v has a 0 everywhere, except at positions b + 1 to b where it has value w 1 .This means that if in Ê, we replace ê1 by ẽ = (1, b , −w 1 ) and follow it by Ẽ, we obtain a sequence transforming u into v.Now, let ũ := u ẽ .If we remove position 1 from ũ (recalling that ũ1 = v 1 ) and from v, we obtain a CNP with a staircase at [1, k − 1].Applying induction, we get a smooth sequence Ẽ which we can modify into Ẽ to make it applicable to u (just as we did from Ê to Ê ).It is then straightforward to see that ẽ1 followed by Ẽ is a smooth deletion sequence turning u into v.
Theorem 1.The CNP-transformation problem is strongly NP-hard for any deletion-permissive unit-cost function, even if the CNPs have no null positions.
Proof of Theorem 1. From a 3-partition instance S = {s 1 , . . ., s n }, construct u and v as follows.First define K := 100n and, for all i ∈ [n], put p i := i j=1 s j , the idea being that p i and p i−1 differ by an amount of s i .Then put v as a vector containing only 1s.For u, construct it by adding one position at a time from left to right: first insert the values i + 1 + Kp i for i = 1..n, and then the values i(Kt + 3) + 1 for i = m..This can be done in polynomial time in n (in particular, each p i is polynomial).Observe that we have w = (1 + Kp 1 , . . ., n + Kp n , m(Kt + 3), . . ., Kt + 3) In particular, w has a staircase in interval [1, n], followed by a decreasing staircase in interval [n + 1, n + m].By Lemma 1, we know that d f (u, v) ≥ n.We will show that S is a YES-instance to 3-partition if and only if d f (u, v) = n.
(⇒): Suppose that there exists m triplets S 1 , . . ., S m such that s ∈Si s = t for all i ∈ [m].We may assume that each s i ∈ S is distinguishable, so that for each s i there is a unique k such that s i ∈ S k .We construct a sequence E = (e 1 , . . ., e n ) of n deletions such that u where k if the unique integer such that s i ∈ S k .Note that the e i events are allowed because f is deletion-permissive (this is actually the only place where we need this assumption).One can check that E is a smooth deletion sequence and it is clear that positions 1 to n become equal to 1 after applying E on u.Now consider the events that end at position n + k, k ∈ [m].For each s i ∈ S k , there is such an event that decreases all the positions n + 1 to n + k by w i − w i−1 = Ks i + 1.We get si∈S k (Ks i + 1) = Kt + 3. Since this is true for every position from n + 1 to n + m, the total decrease for a position k ∈ [m] will be . ., e n ) be an optimal sequence of events transforming u into v.By Lemma 2, we may assume that E is smooth.Thus each e i is a deletion of the form (i, b i , w i−1 − w i ) = (i, b i , −(Ks i + 1)), where b i ∈ [n, n + m].Let us define S k := {s i : b i = n + k}.We claim that si∈S k (Ks i + 1) = Kt + 3.For k = m, this must be true since w n+m = Kt + 3.For k < m, we have the difference w n+k −w n+k+1 = Kt+3.This means that the deletions that affect position n + k but not n + k + 1 (i.e.those with b i = n + k) must incur a total decrease of exactly Kt + 3, as claimed.We now argue that We have therefore shown that |S k | = 3, which in turn implies that si∈S k s i = t.Therefore S is a YES instance.Lemma 3. Let u, v be two distinct CNPs with no null positions, and let Proof of Lemma 3. We prove the Lemma by induction on d f (u, v).As a base case, when d f (u, v) = 1, then F w has 3 flat intervals: the extreme ones and the flat interval that gets affected in the single event transforming u into v (recall that we have artificial positions w 0 = 0 and w n+1 = 0, which guarantee that there are always two extreme intervals plus another one somewhere in [i1, n]).The statement is clearly true in this case, as |F w | − 1)/2 = 1.Now assume that the Lemma holds for any pair of CNPs u , v satisfying d f (u , v ) < d f (u, v).Let E = (e 1 , . . ., e k ) be an optimal sequence of events Lemma 4. Suppose that v i = v i+1 = 0 for some position i.Then removing position i or i + 1, whichever is smaller in u, from u and v preserves the distance between u and v. Formally, for any unit-cost function f , if Proof of Lemma 4. Assume that u i ≥ u i+1 (the other case is identical).We know that d f (u, v) ≥ d f (u −{i+1} , v −{i+1} ), by Proposition 1.We consider the converse bound.Take any sequence E = (e 1 , . . ., e k ) of events transforming u −{i+1} into v −{i+1} .Modify E to transform u into v as follows: each event affects the same positions as before (including those that have shifted after reinserting i + 1), but we ensure that every event affecting position i also affects position i + 1.To be formal, define E = (e 1 , . . ., e k ) as follows.If e i increases interval [a, b] by δ (which is possibly negative), then make e i increase interval [a , b ] by δ, where Aside from the new position i in u and v, every position reaches the same value as before.Also because u i ≥ u i+1 , position i + 1 reaches 0 after applying E on u.Lemma 5. Suppose v i = 0 for some position i and that w i−1 ≥ w i or w i+1 ≥ w i .Then d f (u, v) = d f (u −{i} , v −{i} ) for any unit-cost function f .
Proof of Lemma 5.The proof is essentially the same as in Lemma 4. If, without loss of generality, w i−1 ≥ w i , we can take an event sequence from u −{i} to v −{i} and adapt it so that every event affecting position i − 1 also affects position i.This guarantees that position i drops to 0. We omit the technical details.

Finding good events in time O(n log n)
We say that an event e is good if applying it on u reduces |F w | by 2.Here we present the detailed version of our improved heuristic.The main algorithm that follows transforms u into v by making calls to the f indGoodEvent subroutine, which is defined afterwards.

Data: vectors u, v
Result: Find a sequence that transforms u into v compute w := u − v; initialize empty sequence S; for u = v do if findGoodEvent(u, v, w) returns (i, j, x) then add (i, j, x) to S; for k = i, ..., j do u k = max u k + x, 0 else find the first flat interval [i, j] with w i = 0; increase u i , . . .u j by −w i ; add (i, j, −w i ) to S; return S Algorithm 1: Main algorithm The algorithm f indGoodEvent below can be implemented in time O(n log n).
Our goal is to find a range of values [i, j] that verifies w i − w i−1 = w j − w j+1 := −δ.We further need that δ > 0, or that δ < 0 and ∀k ∈ [i, j], u k ≥ −δ : we can then apply the event (i, j, δ).To achieve this, the idea is simply to scan w from left to right.Each time we detect a change of w k − w k+1 , we check if we encountered the same amount of change before at some position k (this is −δ in the algorithm).If so, we can return the k, k pair since it can be part of a good event.Otherwise, we map δ = w k+1 − w k to position k + 1 to store the fact that k + 1 is the latest position that could be matched with a change of δ.
The last line of the for loop ensures that if we match two positions k < k, all positions in-between are sufficiently high to allow a deletion of amount δ.We argue two components: that f indGoodEvent does find a good event, if there is one, and that it can be implemented to take time O(n log n).
Proof that Algorithm f indGoodEvent returns an event (i, j, δ) that reduces |F w | by 2 when it exists.Consider an output (i, j, δ).Due to the construction, we had −δ ∈ R, which can only be inserted with −δ = w i − w i−1 and δ = w j+1 − w j , so w i−1 − w i = w j+1 − w j , in which case it is easy to see that F w is reduced by 2. Furthermore, if δ < 0 and we had some k ∈ [i, j] with −u k > δ, the k-th iteration would have deleted δ from E. This means that (i, j, δ) is indeed an event that reduces |F w | and does not make any u k drop to 0. Reciprocally, if there is an event (i, j, δ) to be found we want to prove that the algorithm returns something (not necessarily the same event).If the algorithm exits before iteration j, it returns some event that we have already proven must be correct.Let us assume that we do not exit the loop before iteration j : we have added −δ at rank i, and it is still in R because for every k ∈ [i, j] we did not have −δ > u k by hypothesis.Since −δ is in E and w j+1 − w j = x, the algorithm returns (i, j, δ).
Complexity.The complexity of f indGoodEvent depends on the following operations: we need to be able to test the existence of a value in a dictionary, to add a key/value pair and, a bit less usual, to filter all values lower than a certain amount (the last line of f indGoodEvent).We can use a treap structure (see [1]), which is a form of binary search tree that allows to split the values higher and lower to a certain number in log n time.This gives us a total complexity of O(n log(n)).
Given a CNP w of length n, an interval [a, b] is a staircase of w if 0 < w a < w a+1 < . . .< w b .The length of the staircase [a, b] is b − a + 1. Figure 2 depicts a staircase of length 4. The next lemma can be useful to obtain quick lower bounds on a particular instance, and plays an important role in our hardness result (proof in Supplementary material).Lemma 1.Let u, v be two CNPs with no null positions.If u−v contains a staircase [a, b] of length k, then d f (u, v) ≥ k for any unit-cost function f .

Figure 1 :
Figure 1: an example of a CNP-to-CNP transformation.

Figure 2 :
Figure 2: a visual representation of a staircase and a smooth deletion sequence.

Figure 3 :
Figure 3: average normalized RF distances of the four methods evaluated when varying l, n, ∆ and (e min , e max ).

Figure 4 :
Figure 4: average normalized RF distances of the four methods evaluated when varying q and r.

Figure 5 :
Figure 5: average normalized RF distances of the four methods evaluated with varying error rates.

Case 2 :
a = i h for some h ∈ [l].Then the deletion of E starting at a is (a, b a , −(w a − w a−1 )) = (a, b a , w a−1 − w a − δ) and affects position k, i.e. b a ≥ k.Consider the sequence E obtained from E by replacing the event (a, b a , w a−1 − w a − δ) by (a, b a , w a−1 − w a ).Then u E − v has a 0 everywhere, except at positions from a to b a where it has value δ.Also, u E − v has a 0 everywhere, except at positions from k to b a where it has value δ.We can apply the deletion (k, b a , −δ) to u E to obtain v. Since E has k−1 events, this yields a sequence of k deletions transforming u into v.This concludes the first part.That is, we have shown that if our inductive hypothesis holds, then some deletion sequence of length k transforms u into v.Part 2: construction of a smooth sequence.Now let Ê = (ê 1 , . . ., êk ) be a sequence of k deletions transforming u into v, which exists by Part 1.Let (1, b, δ) be any deletion affecting position 1.Since Ê contains only deletions, it is safe to assume that ê1 = (1, b, δ).Let u := u ê1 and w := u − v.If −δ < w 1 , then w contains a staircase of length k and we reach a contradiction since this implies d f (u , v) ≥ k.If −δ > w 1 , then w 1 < 0 and position 1 can never have the same value as v 1 since Ê has only deletions.We deduce that −δ = w 1 .

1 .
That is, let v = (1, 1, . . ., 1) u = (2 + Kp 1 , 3 + Kp 2 , . . ., n + 1 + Kp n , m(Kt + 3) + 1, . . ., (Kt + 3) + 1) such that u E = v.Let û := u e 1 and ŵ := û − v. Let e 1 = (c, d, x), where x could be negative in case of a deletion.Let F w = {[a, b] ∈ F w : [a, b] ∩ [c, d] = ∅} be the affected flat intervals.Assume that F w has l ≥ 0 intervals, say F w = {[a 1 , b 1 ], . . ., [a l , b l ]}, and that they are ordered so that b i + 1 = a i+1 for each i ∈ [l − 1].First consider [a i , b i ] with 2 ≤ i ≤ l − 1.Note that [a i , b i ] cannot be an extreme flat interval in w.We claim that [a i , b i ] must still be a non-extreme flat interval in û.To see this, observe that ŵai−1 = w ai−1 + x and ŵai = w ai + x.Since w ai−1 = w ai by maximality, we have ŵai−1 = ŵai .By a similar argument, ŵbi+1 = ŵbi .And because all values in [a i , b i ] have changed by the same amount x, [a i , b i ] is a (maximal) flat interval (note that we need the assumption of no null positions to argue that all positions change by the same amount).Moreover, [a i , b i ] cannot be extreme.If instead [a i , b i ] was in the extreme interval containing w 0 , then we would have ŵh = 0 for all 0 ≤ h ≤ b i .In particular, this would imply ŵai−1 = ŵai , contrary to what we just argued.The same occurs if we assume that [a i , b i ] is part of the extreme interval containing w n+1 .Now consider any flat interval [a, b] ∈ F w \ F w .It is easy to see that [a, b] is still a flat interval in ŵ, unless perhaps if b + 1 = a 1 or a − 1 = b l .In these cases, it is possible that ŵb = ŵa1 and/or ŵa = ŵb l .These have the effect of "merging" two flat intervals, effectively eliminating [a 1 , b 1 ] and/or [a l , b l ] (note that the argument also holds when [a 1 , b 1 ] or [a l , b l ] become part of an extreme interval).Since every flat interval except these two stays in ŵ, it follows that |F ŵ| ≥ |F w | − 2. Then using induction, d f (u, v) − 1 = d f (û, v) ≥ (|F w | − 3)/2 = (|F w | − 1)/2 − 1 and it follows that d f (u, v) ≥ (|F w | − 1)/2 .

Data:
vectors u, v, w Result: Find an event that reduces |F w | by 2 initialization of an empty dictionary R;for k = 1, ..., n − 2 do δ := w k+1 − w k ; if δ == 0 then continue ; if −δ ∈ R then return (R[−δ], k, δ); else Set R[δ] = k + 1;delete all the key/value pairs (x, y) in R with u k ≤ x; return no possible event Algorithm 2: findGoodEvent The idea is that given a sequence of events E transforming u into v, we can apply E on u −{i} by ignoring position i when it is affected.A sequence of events E is called amp-first if all amplifications appear before all deletions.An amp-first reordering of a sequence E is an amp-first sequence E that contains the same events as E. Notice that if E has a amplifications and d deletions, then there are a!d! amp-first reorderings of E. Proposition 2. Let u and v be two CNPs with no null positions.If a sequence E satisfies u E = v, then any amp-first reordering E of E satisfies u E = v.