 Research
 Open access
 Published:
Explaining evolution via constrained persistent perfect phylogeny
BMC Genomics volume 15, Article number: S10 (2014)
Abstract
Background
The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree.
Results
We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters.
Conclusions
A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model.
Background
Characterbased phylogeny is a broad notion to represent an evolutionary history describing the ancestral relationships among extant taxa or individuals. Recent applications show that the model can be applied to study the evolution of mutations related to various genomic information, such as protein domains [1] or markers in tumors. Thus in our formulation, it is not important whether we are actually studying taxa or individuals or other genomic data. We will follow the usual convention of calling species the units of study. The main element of this notion is that the instance is also made of a set of characters, and each species is in a specific state for each character [2]. The goal is to find a phylogeny where the known species are the leaves, and the internal nodes are labeledjust as the leavesby a state for each character. For each edge (x, y) of the phylogeny, the mutated characters along the edge are those whose states are different in x and y. The simplest case is when all characters are binary, that is only two states (0 and 1) are possible, modeling the situation when each species has or does not have a given feature, such as wings (a phenotypical trait) or the mutation encoding lactase persistence (a genotypical trait).
Moreover, we are assuming a coalescent model, that is the fact that a characteristic shared by a set of species can be traced back to a single ancestral species. Assuming that the state 1 encodes the fact that a species has a given character (for example, the fact that the species has acquired a given mutation), the coalescent model implies that the phylogeny is directed. Restrictions on the type of changes from zero to one and vice versa lead to a variety of specific models [3].
The perfect phylogeny is one of the most investigated coalescent models [2]. Conceptually the model is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. The binary perfect phylogeny problem has received much attention, culminating with the linear time algorithm when all data is known [4] and an efficient algorithm when the input data is incomplete [5]. While the infinite sites assumption is quite restrictive, the perfect phylogeny model turned out to be splendidly coherent within the haplotyping problem [6, 7], where we want to distinguish the two haplotypes present in each individual when only genotype data is given. More precisely, the interest here is in computing a set of haplotypes and a perfect phylogeny such that the haplotypes (i) label the vertices of the perfect phylogeny and (ii) explain the input set of genotypes. This context has been deeply studied in the last decade, giving rise to a number of algorithms [8, 9]. Still, the perfect phylogeny model and the assumptions that have been central in the previous decades cannot be employed without adaptations or improvements. A first generalization in the literature allows for more states (but keeping the infinite sites assumption). In the general case, the problem is NPhard [10], but it has an algorithm parameterized by the number of states [11, 12]. The special cases when there are three or four possible states have more efficient algorithms [13–15].
Even allowing more states cannot explain the biological complexity of real data, when homoplasy events (such as recurrent mutations or back mutations) are present. Two cases where those limitations are evident are the study of carcinogenesis and protein domains. Carcinogenesis consists of the factors and mechanisms that cause the onset of cancer; it results from many combinations of mutations, but only a few, called progression pathways, seem to account for most human tumors [16]. The observation that tumors are evolving cell populations leads to phylogenybased studies. At the same time the intrinsic nature of quickly and degenerately proliferating cancer cells, results in a relative high amount of sites with multiple mutations (i.e., in violations of the infinite sites assumption). A protein domain is a part of protein sequence and structure that can evolve independently of the rest of the protein chain. Many proteins consist of several structural domains, while a domain may appear in a variety of different proteins. In this case it is quite frequent to acquire a domain and then to lose it [17].
Thus a central goal of this paper is to find a model that is more widely applicable than the perfect phylogeny, while retaining its computational efficiency (in fact, more general models such as the Dollo and the CaminSokal models are NPhard [3]). The problem of constructing phylogenies where the deviations from perfect phylogeny are small has been tackled under the name of near perfect phylogeny [11] or near perfect phylogeny haplotyping problems [18]. Especially the impossibility of losing a character that has been previously acquired is too restrictive, resulting in more elaborated models, such as the persistent character [1] and the General Character Compatibility [19, 20].
More precisely, the Persistent Perfect Phylogeny model [21] allows each character to be lost (i.e., going from state 1 to 0) in at most an edge of the phylogeny, while the General Character Compatibility imposes some restrictions on the possible mutations (that is on the possible states labeling the endpoints of an edge), while allowing the input data to be a set of possible states for each character of a species. In this paper we combine the Persistent Perfect Phylogeny (PPP) and the General Character Compatibility (GCC), introducing the Constrained Persistent Perfect Phylogeny problem (CPPP) which generalizes the PPP by adding a constraint for some characters c in the input data, given by the fact they cannot be persistent for some species s (i.e., the state of c does not go from 1 to 0 for any edge lying on the path from the root to s). Since the CPPP problem is equivalent to a case of GCC whose complexity is still open [19, 22], our results also apply to GCC.
Finally, we explore some algorithmic solutions for the CPPP problem. In particular, we give a polynomial time solution of the CPPP problem over matrices whose conflict graph has no edge. This result partially answer the open problem stated in [21] of determining the computational complexity of the PPP problem. In the paper we have run a preliminary experimental analysis showing that our method can manage successfully binary characters data incorporating back mutations. The results show that the algorithm performs efficiently on simulated matrices as well as on real data taken from the HapMap project.
The persistent perfect phylogeny
Our approach follows [21] to which we refer the reader for a detailed discussion of PPP, while we give here only a cursory treatment. The input of the PPP problem is an n × m binary matrix M whose columns are associated with the set C = {c_{1}, . . . , c_{ m }} of characters and whose rows are associated with the set S = {s_{1}, . . . , s_{ n }} of species. Then M[i, j] = 1 if and only if the species s_{ i } has character c_{ j }, otherwise M[i, j] = 0. The character c is gained in the only edge where its state goes from 0 to 1 or, more formally, in the edge (x, y) such that y is a child of x and c has state 0 in x and state 1 in y. In this case the edge (x, y) is labeled by c_{+}. Conversely, c is lost in the edge (x, y) if y is a child of x and the c has state 1 in x and state 0 in y. In the latter case the edge (x, y) is labeled by c^{−}. For each character c, we allow at most one edge labeled by c^{−} [21, 23].
Definition 1 (Persistent Perfect Phylogeny) Let M be an n × m binary matrix. Then a persistent perfect phylogeny, in short ppp, for M is a rooted tree T such that:
1 each node x of T is labeled by a vector l_{ x } of length m;
2 the root of T is labeled by a vector of all zeroes, while for each node x of T the value l_{ x }[j] ∈ {0, 1} represents the state of character c_{ j } in tree T;
3 each edge e = (v, w) is labeled by at least a character;
4 for each character c_{ j } there are at most two edges e = (x, y) and e' = (u, v) such that l_{ x }[j] ≠ l_{ y }[j] and l_{ u }[j] ≠ l_{ v }[j] (representing a change in the state of c_{ j }). In that case e, e' occur along the same path from the root of T to a leaf of T; if e is closer to the root than e', then l_{ x }[j] = l_{ v }[j] = 0, l_{ y }[j] = l_{ u }[j] = 1, and the edge e is labeled {c}_{j}^{+}, while e' is labeled {c}_{j}^{};
5 each row r of M labels exactly one node x of T. Moreover the vector l_{ x } is equal to the row r.
Let s be a species and let c be a character such that, in a persistent perfect phylogeny T, the path from the root of T to s traverses an edge labeled c^{−}. Then c is called persistent for s in T.
The Persistent Perfect Phylogeny problem asks to find, if it exists, a persistent perfect phylogeny for a given binary matrix M. We can restate the PPP problem as a variant of the Incomplete Directed Perfect Phylogeny [5] by transforming the complete input matrix into an incomplete matrix, called extended matrix.
Definition 2 (Extended Matrix) Let M be an instance of the PPP problem. The extended matrix associated with M is an n × 2m matrix M_{ e } over alphabet {0, 1, ?} which is obtained by replacing each column c of M by a pair of columns (c^{+}, c^{−}), where ? means that the value of such cell is not given. Moreover for each row s of M if M[s, c] = 1, then M_{ e }[s, c^{+}] = 1 and M_{ e }[s, c^{−}] = 0, while if M[s, c] = 0, then M_{ e }[s, c^{+}] =? and M_{ e }[s, c^{−}] =?.
In this case the characters (c^{+}, c^{−}) are called conjugate. Informally, the assignment of the conjugate pair (?, ?) in a species row s for two conjugate characters (c^{+}, c^{−}) means that character c could be persistent in species s, i.e., it is first gained and then lost. On the contrary, the pair (1, 0) means that character c is only gained by the species s. A completion of a pair (?, ?) associated to a species s and characters (c^{+}, c^{−}) of M_{ e } consists of forcing M_{ e }[c^{+}, s] = M_{ e }[c^{−}, s] = 0 or M_{ e }[c^{+}, s] = M_{ e }[c^{−}, s] = 1, while a partial completion M_{ e } is a completion of some of its conjugate pairs. Notice that M admits a persistent phylogeny if and only if there exists a completion of M_{ e } admitting a directed perfect phylogeny [21].
A fundamental contribution of [21], building upon [5], is to frame the problem as a graph theory question. We briefly recall here the two graphs that are used in the description of the algorithm.
Let M be a binary matrix and let c_{1}, c_{2} be two characters of M. Then the configurations induced by the pair (c_{1}, c_{2}) in M is the set of ordered pairs (M[s, c_{1}], M[s, c_{2}]) over all species S. Two characters c_{1} and c_{2} of M are conflicting if and only if the configurations induced by such pair of columns is the set of all possible pairs (0, 1), (1, 1), (1, 0) and (0, 0). The conflict graph G_{ c } = (C, E_{ c } ⊆ C × C) of a matrix M has vertices C and as edges the pairs (c_{ i }, c_{ j }) of conflicting characters (see Figure 1). We also need some graphtheoretic definitions. A graph without edges is called edgeless. A connected component is called nontrivial if it has more than one vertex.
The second graph used in the algorithm provides a representation of a completion of characters of an extended matrix. The redblack graph G_{ RB } = (V, E) associated to an extended matrix M_{ e } is the edgecolored graph where (i) the vertices are the species and the conjugate pairs of M_{ e } (that is for each two conjugate characters c^{+} and c^{−}, only c is a vertex of G_{ RB }), (ii) a pair (s, c) is a black edge iff the conjugate pairs c^{+} and c^{−} are still incomplete in matrix M_{ e } and M_{ e }[s, c^{+}] = 1 and M_{ e }[s, c^{−}] = 0, (iii) (s, c) is a red edge iff the conjugate pairs c^{+} and c^{−} are completed as M_{ e }[s, c^{+}] = M_{ e }[s, c^{−}] = 1.
An algorithm to compute a persistent perfect phylogeny
Let T be any persistent perfect phylogeny for a matrix M and consider a depthfirst visit of T, the sequence of edge labels traversed during the visit is uniquely defined. The converse also holds, that is given a sequence C of edge labels, we can reconstruct the unique persistent perfect phylogeny T (if it exists) such that C is the sequence of edge labels traversed during a depthfirst visit of T [21].
The main idea is that we associate a partial phylogeny P to each prefix of C, where each leaf x of P is labeled with the submatrix M_{ x } of M_{ e } such that M_{ x } has exactly the species and the characters that will be in the subtree of T rooted at x. Recall that each matrix M_{ x } has a graph representation given by the redblack graph. Then determining the next edge label to be added to the prefix of C is called to realize a character in the redblack graph representing M_{ x } as follows.
Let (c^{+}, c^{−}) be two conjugate characters of M_{ e } and let G_{ RB } its associated red black graph. Let \mathcal{C}\left(c\right) be the connected component of G_{ RB } containing the vertex c. A character is in one of three possible states: inactive (the initial state of all characters), active, and free. The realization of a character c in G_{ RB } consists of the following steps:
1 if c is inactive then:

(a)
for each species s\notin \mathcal{C}\left(c\right), pose M_{ e }[s, c^{+}] = M_{ e }[s, c^{−}] = 0;

(b)
for each species s\in \mathcal{C}\left(c\right) if (c, s) is not an edge of G_{ RB }, add a red edge(c, s) and complete M_{ e } by posing M_{ e }[s, c^{+}] = M_{ e }[s, c^{−}] = 1;

(c)
remove from G_{ RB } all black edges (c, s) and label c active.
2 else if c is active and c is connected by red edges to all species in \mathcal{C}\left(c\right), then:

(a)
all such red edges are deleted from G_{ RB } and c is labeled free;
Notice that when (i) c is free, or (ii) c is active but there exists a species s\in \mathcal{C}\left(c\right) that is not connected to c by a red edge, none of the stated conditions hold. In these cases the realization is impossible.
Figures 2 and 3 illustrate the realization of characters. Moreover, isolated vertices of G_{ RB } correspond to leaves of the partial phylogeny P whose associated matrix has only one species; that instance is trivially solvable, therefore isolated vertices can be removed from G_{ RB }.
We recall that, to obtain an algorithm for PPP, it suffices to have an algorithm that finds the edge label to be added to the prefix of C computed up to that point. The sequence \mathcal{C} obtained by a depthfirst visit of the tree is a sequence of edge labels whose realization results in an edgeless redblack graph [21]. Such sequence \mathcal{C} is called successful creduction of the redblack graph.
The rest of the paper is devoted to give a formal definition of the CPPP problem and to provide an efficient algorithm to solve that problem. Moreover we will test our algorithm on some instances that do not admit a perfect phylogeny, showing that we are able to quickly compute a persistent perfect phylogeny, hence giving a possible phylogenetic interpretation of those data.
Results and discussion
We can now formally define the Constrained Persistent Perfect Phylogeny (CPPP) problem where the fact that a pair (c, s) (i.e., a character c and a species s) is constrained means that s and all its ancestors do not have the character c. The input of the problem is a binary matrix M and a set F=\left\{\left({c}_{{i}_{1}},{s}_{{i}_{1}}\right),\dots ,\left({c}_{{i}_{l}},{s}_{{i}_{l}}\right)\right\} of constraints, such as M\left[{s}_{{i}_{j}},{c}_{{i}_{j}}\right]=0 for each j. A solution for such instance is a persistent perfect phylogeny T for M such that, for each constraint \left({c}_{{i}_{j}},{s}_{{i}_{j}}\right), none of the edges from the root of T to the leaf labeled by {s}_{{i}_{j}} is labeled {c}_{{i}_{j}}^{+}. This implies that no edge from the root of T to the leaf labeled by s_{ i }j can be labeled {c}_{{i}_{j}}^{}.
The idea of the extended matrix M_{ e } applies also to the CPPP problem. In this case, if M[s, c] = 1, then M_{ e }[s, c^{+}] = 1 and M_{ e }[s, c^{−}] = 0, if M[s, c] = 0 and (c, s) is a constraint, then M_{ e }[s, c^{+}] = M_{ e }[s, c^{−}] = 0. Finally, if M[s, c] = 0 but (c, s) is not a constraint, then M_{ e }[s, c^{+}] =? and M_{ e }[s, c^{−}] =?. An immediate extension of the result in [21] shows that M_{ e } has a directed perfect phylogeny if and only if (M, F) has a constrained persistent perfect phylogeny.
Just as for the PPP problem, we first explore a graph formulation of the CPPP problem based on the equivalence of PPP to a problem of completing a matrix where each character c has two columns c^{+}, c^{−}, with c^{+} (c^{−}) equal to 1 in a species s in the matrix corresponds to the fact that s has gained (lost) the character c. The graph formulation derives again by representing a completion in terms of redblack graph associated to extended matrices. Notice that there exists a 1to1 correspondence between completing entries of the matrix and realizing characters of the redblack graph. When considering the CPPP problem, some entries of a partially completed matrix are constrained which means that some characters in the associated redblack graph cannot be realized. On the other hand, all characters in a redblack graph for the PPP problem can be realized. Thus it is quite easy to show that the main redblack graph reduction characterization stated for the PPP problem can be extended to the constrained persistent perfect phylogeny problem, by simply adding the constraint that some characters cannot be realized in a redblack graph.
Now, the redblack graph reduction turns out to be quite useful to investigate new algorithmic solutions to the PPP problem. In this paper we are able to prove that there exists a class of binary matrices that always admit a positive solution for the PPP problem, that is they admit a persistent perfect phylogeny that can be computed in polynomial time. For this special case we also provide a polynomial algorithm that works for the general CPPP problem. Based on this polynomial time algorithm we give a fixedparameter (in the number of characters) algorithm for the CPPP, based on the search tree technique [24], improving the exponential time algorithm given in [21].
We observe that the CPPP problem is a special case of the General Character Compatibility problem (GCC) [19]. An instance of the GCC problem is a matrix M_{ G } having rows which are species and columns that are characters. Each entry of the matrix M_{ G } is a subset of the states that character c may assume in species s. Another part of the instance is a specification of all allowed transitions between states in a solution. A feasible solution is a perfect phylogeny where for each species s and for each character c, the state is picked from the input set M_{ G }[s, c]. Given an instance (M, F) of CPPP, we obtain a matrix M_{ G } as follows. If M[s, c] = 1, then M_{ G }[s, c] = {1}. If M[s, c] = 0 and (c, s) ∈ F, then M_{ G }[s, c] = {0}. Finally, if M[s, c] = 0 and (c, s) ∉ F, then M_{ G }[s, c] = {0, 2}. The only allowed transitions are from the state 0 to 1 and from 1 to 2. This case of GCC corresponds to cases 5 and 6 of Table 1 in [19], whose complexity is reported as open. Thus the results we give in the paper also apply to those cases.
We recall that a main result of [21] is that finding a solution of PPP is equivalent to finding a successful creduction, that is a sequence of edge labels (corresponding to a depthfirst visit of the tree) whose realization makes the redblack graph edgeless. For the CPPP problem a similar result holds, but we have to adapt the notion of reduction, so that there is a third case when the reduction is impossible; when for some species s, with (c, s) ∈ F (that is M_{ e }[s, c^{+}] = M_{ e }[s, c^{−}] = 0), (c, s) is also a red edge of G_{ RB }. Notice that, in order to obtain an algorithm to compute a persistent perfect phylogeny, it suffices to have an algorithm that finds the edge label to be added to the prefix of C computed up to that point.
Solving CPPP on matrices with edgeless conflict graphs
In the following, we will exploit some properties of the redblack graph to show that a matrix M whose conflict graph is edgeless always admits a persistent perfect phylogeny. Moreover, we provide a polynomial time algorithm for the CPPP problem in this case.
Given M a binary matrix, the partial order graph for M is the partial order P obtained by ordering columns of M under the <relation which is defined as follows: given two character c and c', we will say that c < c' iff M[s, c] ≤ M[s, c'] for each species s. Moreover, we build a graph G = (V, E), called adjacency graph for M : V is the set of columns of M and (u, v) is an edge of G if and only if u, v are adjacent, i.e. there is a species s that is adjacent to both u and v in the redblack graph for the extended matrix M_{ e } associated with M. Our algorithm for solving the CPPP problem finds a successful creduction by simply computing the maximal inactive characters in the poset P that can be realized in the redblack graph.
In the following we give some Lemmas that are used to show that maximal characters in the poset P can be realized without inducing in the redblack graph any redsigma graph: this is a graph of red edges consisting of a path of length four and having two characters and three species. Such a graph represents the forbidden matrix {0, 1}, {1, 0} and {1, 1} in the completion of the extended matrix M_{ e } and thus whenever it is present in the redblack graph it means that the completion does not admit a directed perfect phylogeny [2]. In fact, by definition of redblack graph associated to a completion, a redsigma graph corresponds to two completed characters a^{+}, b^{+} in the extended matrix such that M_{ e }[s_{1}, a^{+}] = 1 = M_{ e }[s_{2}, a^{+}] and M_{ e }[s_{2}, b^{+}] = 1 = M_{ e }[s_{3}, b^{+}], while all other entries of M_{ e } are 0 for pairs (a^{+}, s_{3}) and (b^{+}, s_{1}). The following property is easily proved by induction on the length of a path in the redblack graph connecting two maximal characters.
Algorithm 1: Procedure SolveCPPPemptyconflict
Input : A constrained binary matrix (M, F) whose associated conflict graph is edgeless.
Output : A realization S_{ c } of the characters of M resulting in a constrained persistent perfect phylogeny for (M, F), if such a phylogeny exists.
1 S_{ c } ← empty sequence;
2 P ← the partial order for M;
3 G_{ RB } ← the redblack graph for the extended matrix M_{ e } of M.
4 while G _{ RB } is not edgeless do
5 C_{ M } ← maximal elements in P that are in the same connected component of G_{ RB };
6 D ← the subset of C_{ M } consisting of the characters that can be realized;
7 if D = ∅ then
8 return no solution
9 else
10 Add to S_{ c } all characters in D;
11 Realize the characters of D in any order, updating G_{ RB };
12 add to D the free characters in the graph G_{ RB };
Lemma 3 Let M be a binary matrix with an edgeless conflict graph. Assume that the extended matrix associated with M induces a connected redblack graph and let P be the partial order graph for M. Let C_{ M } be the set of maximal elements in P. Then C_{ M } consists of elements that are pairwise adjacent in the adjacency graph for M.
The following properties can be proved by as consequences of the definition of realization of characters, and assuming that the input matrix has an edgeless conflict graph.
Lemma 4 Let M be a binary matrix that has an edgeless conflict graph. Let G_{ RB } be the redblack graph for the extended matrix associated with M. The realization of two characters a and b that are adjacent in the adjacency graph for M adds at most two disjoint components consisting of red edges. In this case one connected component has the vertex a and the other one b.
Lemma 5 Let G_{ RB } be a connected redblack graph whose conflict graph is edgeless. Let C_{ M } be the set of maximal characters in G_{ RB } and let {C}_{M}^{\prime}be the set of maximal characters in the redblack graph {G}^{\prime}obtained after the realization of C_{ M }. Then: (1) the elements of C_{ M } are in at most two distinct connected components of {G}^{\prime}and (2) in each of such disjoint connected component, each maximal character c\in {C}_{M}^{\prime}is either adjacent to all species of the component or all active characters of C_{ M } are free.
Notice that, the absence of conflicts does not guarantee that a solution actually exists. However, we are able to provide an efficient algorithm (Algorithm 1) for this case, which will be a cornerstone for our algorithm for the general case.
Algorithm 1 builds a successful creduction S_{ c } by iteratively adding to S_{ c } the maximal inactive characters or free characters of the redblack graph G_{ RB }. Notice that the successful creduction provides a completion of the extended matrix that admits a perfect phylogeny. The latter can be built using the classical linear time algorithm [2].
Theorem 6 Let (M, F) be a binary matrix that has an edgeless conflict graph. Then Algorithm 1 computes a successful creduction of the redblack graph associated to the extended matrix for M, if it exists. Moreover, if F is empty then M admits a solution.
Proof First observe that the correctness of Algorithm 1 is a consequence of the fact that maximal characters are realized before any character they include by the <relation. Assume that c_{1} < c_{2} and let T be a persistent perfect phylogeny. If c_{2} is not persistent for s in T, then also c_{1} is not persistent for s in T. In fact, assume to the contrary that c_{1} is persistent for s in T and c_{2} is not persistent for s. This fact implies that there exists a species s' such that has c_{1} and s' and s share a common ancestor in the tree which is below edge labeled c^{+}. Since c_{1} < c_{2}, it follows that species s' has also character c_{2} and thus the edge labeled by c^{−} is below the edge {c}_{2}^{+}. But since s does not have character c_{2} and c_{2} cannot be persistent we obtain a contradiction.
We show that at each iteration of Algorithm 1 each connected component G_{ RB } has only black edges, or the connected components with red edges has no redsigma graphs. Initially, by assumption, since no character is active, no red edge is in the connected components of the redblack graph. Then, by applying Lemma 3 and 4, the realization of the maximal characters C_{ M } of poset P does not induce any red sigmagraph, thus proving the invariant. Now, a successive iteration of the algorithm requires to add to S_{ c } the free characters or the maximal inactive characters of the redblack graph. By applying Lemma 5, the redblack graph has connected components without red edges or at most two components having red edges, since the active characters by statement 1 are in at most two components. For the first type of components, the invariant property is immediate since the component does not have any red edge. Consider now the second type of components. By Lemma 5, there are at most two such components, moreover, either each connected component has some maximal active character that are free or the maximal inactive are adjacent to all species of the connected component of the redblack graph. Assume that the active characters in the connected component having rededges are free. Thus by definition, these active characters are removed from the redblack graph including all incident edges. Otherwise, the maximal active characters are all adjacent to all species and thus they are realized without adding new red edges. In both cases, the invariant property holds. Clearly, if all characters are in S_{ c } after the application of the algorithm, it is immediate that the redblack graph is edgeless since all active characters are free (no redsigma graph is possible, indeed). Thus S_{ c } is a successful creduction. Observe that in case F is empty, all characters can be realized, and consequently, the sequence S_{ c } after the iterations of the algorithm includes all characters of the redblack graph, thus implying that a solution always exists. □
An algorithm for CPPP
In this section we propose an algorithm for the CPPP problem that is based on the procedure SolveCPPPemptyconflict(M). Our algorithm is based on the search tree technique [24], where we explore the tree of all possible creductions. Since in a creduction each signed character (c^{+} or c^{−}) can appear at most once, the search tree has at most (2m)! leaves. Therefore we only need to describe a polynomialtime algorithm to compute an edge of the search tree (which mainly consists of realizing a signed character).
Just as the algorithm in [21], we transform the matrix M of the instance (M, F) into an extended matrix M_{ e } which is then analyzed to find a solution. In fact, (M, F) has a solution if and only if there exists a successful creduction for M_{ e } that can be associated to a constrained perfect phylogeny. The algorithm in [21] explores all feasible permutations of the set of characters (feasible permutations means that c^{−} must follow c^{+} and that all constraints are satisfied) of M_{ e } in order to find one that is a successful creduction, if such a creduction exists.
Clearly computing all permutation is not efficient, therefore we implicitly build a decision tree, where at each step we fix a character in a given position of the permutation. To each node x of the decision tree, we associate the matrix M_{ e }(x), obtained from M_{ e } by realizing the characters labeling the edges from the root to x, and its associated redblack and conflict graphs (respectively G_{ RB }(x), G_{ c }(x)). When G_{ c }(x) is edgeless, instead of further exploring the decision tree, we apply Algorithm 1. At the same time, if G_{ RB }(x) contains a redsigma graph, then M_{ e }(x) does not admit a persistent perfect phylogeny. A fortiori, in that case M_{ e }(x) cannot admit a persistent perfect phylogeny, hence we can stop exploring that portion of the decision tree. Moreover, we can stop the search as soon as we find a solution, since we have no optimization criterion to discriminate between feasible solutions. In practice, all those criteria allow to avoid exploring a large part of the decision tree, as shown in our experimental analysis.
Experimental analysis
We have implemented our algorithm as a C++ program and we have tested it over simulated data produced by ms [25]. Moreover, we have tested our program on real data coming from the International HapMap project [26]. All tests have been performed on a standard workstation.
The two different kinds of data correspond to two separate goals. The analysis on simulated data is aimed at studying the scalability of our approach for increasing numbers of species and characters. More precisely we have run our program for n = 10, 20, 40, 60 (recall that n is the number of species) and for values of m (the number of characters) ranging from n/2 to \frac{3}{2}n. The reason for the choice of m is based on some properties of all persistent phylogenies. Let T be a persistent perfect phylogeny consistent with a n × m matrix, and assume that the input matrix has no duplicated rows or columns. Then we can prove that n/2 ≤ m ≤ 2n.
Moreover, ms produces matrices that have a perfect phylogeny, but can have duplicated rows and columns. To introduce back mutations, we have randomly modified at most one state of each duplicated row. For each choice of the parameters n and m we have produced 100 random instances, on which we have run our program with a 15minute timeout, without imposing any constraint. The results are represented in Table 1.
Then, for the first 10 of the 100 instances of each parameter choice, we have modified the input matrices, by introducing some random constraints, in order to determine if constraining the set of feasible solutions can help in finding a persistent phylogeny. For each instance of the first phase, we have produced 10 instances with 1 or 16 random constraints. For both cases we determine when at least one of the 10 constrained instances is solved more quickly than the unconstrained instance. The goal is to determine when there is a sizable (in our case 10%) probability that introducing some random constraints can help in computing a persistent phylogeny. Moreover, we determine when the median of the 10 constrained instances is solved more quickly than the unconstrained instance. In this case the goal is to determine when there is a 50% probability that some random constraints can help in computing a persistent phylogeny.
The most important result of this experiment is that for instances where our implementation requires at least a second (on average), the idea of introducing random constraints is often beneficial. This fact suggests a direction for further improvements, that is incorporating into our program some deterministic constraints, based on a cursory analysis of the conflict and of the redblack graphs. Actually, how we manage an edgeless conflict graph is as an example of this idea. Table 2 summarizes the experiment on constrained simulated instances.
Finally, the algorithm has been tested on real data coming from the International HapMap project. The data are classified by type of population. In our case, we used data from the set ASW (African ancestry in Southwest USA). Each individual is described by the two haplotypes (in our application the two haplotypes correspond to two different species, i.e. two different rows of the matrix). This experiment investigates the usefulness of the constrained persistent model to manage haplotypes data that cannot be explained by the perfect phylogeny model. In fact none of those instances admits a perfect phylogeny, but our model and implementation are able to find a reasonable interpretation to the data. The data set consists of binary matrices of dimensions 10 × 10, 26 × 15, 26 × 25, and 26 × 30. For each group we considered 10 matrices. In all cases the matrices do not admit perfect phylogeny, and the number of conflicts changes from a minimum of 4 to a maximum of 138. Increasing the size of the matrix, and therefore the number of conflicts, the percentage of matrices that admit persistent perfect phylogeny decreases. More in detail, 80% of the tested matrices of size 10 × 10 admits solution, only 20% of the tested matrices of size 26 × 15 admits solution, and none of the sets 26 × 25, and 26 × 30 admits solution. The results show that haplotype data may be related by the persistent phylogeny in case they cannot be explained by the perfect model. It would be interesting to investigate the biological soundness of the persistent perfect phylogeny in this context.
Conclusions
The algorithms and models discussed in the paper may have interesting applications in the construction of evolutionary trees based on the analysis of binary genetic markers, where variants of the perfect phylogeny have already been considered, such as in the study of evolution based on introns [1] or progression pathways using tumor markers or in discovering significant associations between phenotypes and singlenucleotide polymorphism markers [27] and also in haplotype analysis. In this paper we have investigated the CPPP problem, which is the general problem of computing a persistent perfect phylogeny for binary matrices where some characters may be forced not to be persistent in the tree. We provide algorithmic solutions for the problem: mainly a polynomial time algorithm when the conflict graph is edgeless and a fixedparameter algorithm. In particular we show that when no constraint is given and the conflict graph is edgeless, a solution for PPP always exists. We experimentally show that the search tree technique, combined with the use of constraints allows to obtain efficiently solutions for matrices that otherwise would require exponential time. Future research will be devoted to experimental investigation of possible improvements based on introducing a carefully crafted set of constraints to speed up the computation. The computational complexity of the CPPP problem is open and it would be interesting to solve the problem for the unconstrained case.
References
Przytycka T, Davis G, Song N, Durand D: Graph theoretical insights into dollo parsimony and evolution of multidomain proteins. Journal of Computational Biology. 2006, 13 (2): 351363. 10.1089/cmb.2006.13.351.
Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1997, Cambridge University Press, Cambridge
Felsenstein J: Inferring Phylogenies. 2004, Sinauer Associates, Sunderland, MA (USA)
Gusfield D: Efficient algorithms for inferring evolutionary trees. Networks. 1991, 1928.
Peer I, Pupko T, Shamir R, Sharan R: Incomplete directed perfect phylogeny. SIAM Journal on Computing. 2004, 33 (3): 590607. 10.1137/S0097539702406510.
Bonizzoni P, Della Vedova G, Dondi R, Li J: The haplotyping problem: a view of computational models and solutions. International Journal of Computer and Science Technology. 2003, 18: 675688. 10.1007/BF02945456.
Gusfield D: Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. Proc 6th Annual Conference on Research in Computational Molecular Biology (RECOMB 2002). 2002, 166175.
Bonizzoni P: A linear time algorithm for the Perfect Phylogeny Haplotype problem. Algorithmica. 2007, 48 (3): 267285. 10.1007/s0045300700943.
Ding Z, Filkov V, Gusfield D: A linear time algorithm for Perfect Phylogeny Haplotyping (pph) problem. Journal of Computational Biology. 2006, 13 (2): 522553. 10.1089/cmb.2006.13.522.
Bodlaender HL, Fellows MR, Warnow T: Two strikes against perfect phylogeny. Automata, Languages and Programming. 1995, 937: 1726.
FernándezBaca D, Lagergren J: A polynomialtime algorithm for nearperfect phylogeny. SIAM J Comput. 2003, 32 (5): 11151127. 10.1137/S0097539799350839.
Kannan S, Warnow T: A fast algorithm for the computation and enumeration of perfect phylogenies. SIAM Journal on Computing. 1997, 26 (6): 17491763. 10.1137/S0097539794279067.
Dress A, Steel M: Convex tree realizations of partitions. Applied Mathematics Letters. 1992, 5 (3): 36. 10.1016/08939659(92)900266.
Kannan SK, Warnow TJ: Inferring evolutionary history from dna sequences. SIAM Journal on Computing. 1994, 231 (4): 713737.
Gysel R, Lam F, Gusfield D: Constructing perfect phylogenies and proper triangulations for threestate characters. Algorithms for Molecular Biology. 2012, 7 (1):
Subramanian A, Shackney S, Schwartz R: Inference of tumor phylogenies from genomic assays on heterogeneous samples. BioMed Research International. 2012, 2012:
Przytycka T, Davis G, Song N, Durand D: Graph theoretical insights into evolution of multidomain proteins. Journal of computational biology. 2006, 13 (2): 351363. 10.1089/cmb.2006.13.351.
Satya RV, Mukherjee A, Alexe G, Parida L, Bhanot G: Constructing nearperfect phylogenies with multiple homoplasy events. ISMB (Supplement of Bioinformatics). 2006, 514522.
Manuch J, Patterson M, Gupta A: Towards a characterisation of the generalised cladistic character compatibility problem for nonbranching character trees. ISBRA. 2011, 440451.
Maňuch J, Patterson M, Gupta A: On the generalised character compatibility problem for nonbranching character trees. Computing and Combinatorics. 2009, 268276.
Bonizzoni P, Braghin C, Dondi R, Trucco G: The binary perfect phylogeny with persistent characters. Theoretical computer science. 2012, 454: 5163.
Benham C, Kannan S, Paterson M, Warnow T: Hen's teeth and whale's feet: generalized characters and their compatibility. Journal of Computational Biology. 1995, 2 (4): 515525. 10.1089/cmb.1995.2.515.
Zheng J, Rogozin IB, Koonin EV, Przytycka TM: Support for the Coelomata Clade of Animals from a Rigorous Analysis of the Pattern of Intron Conservation. Mol Biol Evol. 2007, 24 (11): 25832592. 10.1093/molbev/msm207.
Downey RG, Fellows MR: Parameterized Complexity. 1999, Springer, Berlin (Germany)
Hudson RR: Generating samples under a wrightfisher neutral model of 31 genetic variation. Bioinformatics. 2002, 18 (2): 337338. 10.1093/bioinformatics/18.2.337.
The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851861. 10.1038/nature06258.
Pan F, McMillan L, de Villena FPM, Threadgill D, Wang W: Treeqa: Quantitative genome wide association mapping using local perfect phylogeny trees. Pacific Symposium on Biocomputing. 2009, 415426.
Acknowledgements
The authors acknowledge the support of the MIUR PRIN 20102011 grant 2010LYA9RH (Automi e Linguaggi Formali: Aspetti Matematici e Applicativi), of the Cariplo Foundation grant 20130955 (Modulation of anticancer immune response by regulatory noncoding RNAs), of the FA 2013 grant (Metodi algoritmici e modelli: aspetti teorici e applicazioni in bioinformatica).
Declarations
Publication charges for this work was funded by MIUR PRIN 20102011 grant 2010LYA9RH.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors have contributed equally to the paper.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bonizzoni, P., Carrieri, A.P., Della Vedova, G. et al. Explaining evolution via constrained persistent perfect phylogeny. BMC Genomics 15 (Suppl 6), S10 (2014). https://doi.org/10.1186/1471216415S6S10
Published:
DOI: https://doi.org/10.1186/1471216415S6S10