Proceedings of the 17th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: genomics
 Research
 Open Access
 Published:
Forcing external constraints on tree inference using ASTRAL
BMC Genomics volume 21, Article number: 218 (2020)
Abstract
Background
To account for genomewide discordance among gene trees, several widelyused methods seek to find a species tree with the minimum distance to input gene trees. To efficiently explore the large space of species trees, some of these methods, including ASTRAL, use dynamic programming (DP). The DP paradigm can restrict the search space, and thus, ASTRAL and similar methods use heuristic methods to define a restricted search space. However, arbitrary constraints provided by the user on the output tree cannot be trivially incorporated into such restrictions. The ability to infer trees that honor userdefined constraints is needed for many phylogenetic analyses, but no solution currently exists for constraining the output of ASTRAL.
Results
We introduce methods that enable the ASTRAL dynamic programming to infer constrained trees in an effective and scalable manner. To do so, we adopt a recently developed tree completion algorithm and extend it to allow multifurcating input and output trees. In simulation studies, we show that the approach for honoring constraints is both effective and fast. On real data, we show that constrained searches can help interrogate branches not recovered in the optimal ASTRAL tree to reveal support for alternative hypotheses.
Conclusions
The new algorithm is added ASTRAL to all userprovided constraints on the species tree.
Background
Phylogeny inference requires solving an optimization problem over the space of all trees. The superexponential growth of the tree topology space makes examining all trees impossible, even for moderately large datasets. As a result, tree inference algorithms have adopted several heuristics strategies, including iterative search (e.g., hillclimbing), used by most maximum parsimony and maximum likelihood methods.
An increasingly popular alternative is searching the tree space using dynamic programming (DP). For an optimization score of interest, we need a recursion formulating how the optimal tree on a subset of leaves (or similar constructs) can be computed from the optimal trees on smaller subsets. With a recursive equation, DP can be used to compute the optimal solution in the classic fashion, typically implemented using memoization. Since the powerset grows exponentially with the set cardinality, this DP requires exponential running time. However, a restricted version of DP can be designed where each set is divided into only some of its subsets; the restricted DP can have polynomial running time with respect to the number of the leaves.
Phylogenetic inference using this particular DP approach has been known at least as early as 1996 [1] and has been used for many optimization criteria, including duplication and loss [2–4], deep coalescence [5], Robinson Foulds (RF) distance [6], quartet score [7, 8], and others [9]. Among these, ASTRAL [8], which estimates a species tree from a set of gene trees by minimizing the quartet distance, has found increasing popularity [10]. DP is mostly used for problems where the input is a set of trees, and the output is a tree with the minimum total distance to the input trees. The popularity of DP for these problems is because the input set of trees create natural ways for restricting the space explored by DP. For example, the set of bipartitions observed in input trees can be used as the restriction set. In addition, ASTRAL has introduced heuristics to enrich the set of allowable bipartitions [11] while keeping the size of the search space polynomial [12].
The restrictions imposed on DP are not to be confused with the related concept of userimposed constrained inference (we use “restricted” DP instead of “constrained” used in previous publications to avoid confusion). Systematists often would like to infer the best possible tree among trees that are compatible with a constraint tree of their choice, thereby completing and resolving the constraint tree. Constrained tree inference is needed for hypothesisdriven analyses that aim to choose the best among a set of hypotheses available by prior knowledge [13–16]. Constrained searches can help in model selection, testing whether a polytomy [17] or the monophyly of a group [18] can be rejected. Similarly, they can help gauge the “hidden” support for branches not recovered in the main analysis. Moreover, constrained searches have been successfully used to combine the results of multiple methods [19]. More recently, constrained trees were used in taxonomic profiling [20]. Finally, constrained searches enable updating existing trees without recomputing trees from scratch. For these reasons, most phylogenetic inference tools allow userprovided constraints.
To our knowledge, the DP paradigm has not been adopted to perform tree search with userdefined constraints. Performing constrained searches in the DP paradigm may appear easy: one needs to make sure the restricted set of bipartitions explored by DP are consistent with the constraints. As we show, there are roadblocks when the userprovided input is allowed to be arbitrary. The challenge is to find a largeenough search space that satisfies the userprovided constraints. Here, building on two recent advances [21, 22], we propose a solution to this challenge. We implement our solution inside the ASTRAL software for species tree inference, thereby enabling it to perform constrained searches for the first time. In extensive tests, we show that the constrained searches remain as accurate as unconstrained searches while reducing the running time, can improve accuracy in the presence of external knowledge about individual relationships, and can reveal hidden support.
Method
Our goal is to extend ASTRAL so that it can honor a userprovided constraint tree. We start by reviewing ASTRAL and the RF(+) algorithm that we will use.
Notations We are given a set of k potentially multifurcating input trees \(\mathcal {T}\) on subsets of a leafset \(\mathcal {L}\) of size n and a potentially multifurcating constraint tree, \(\bar {T}\) on a subset of \(\mathcal {L}\). Let l(t) or l(u) be the set of leaves of a tree t or leaves below a node u. We use s(u) to denote sister(s) of u; i.e., the set of all nodes sharing a parent with u. Let \(\mathcal {L}'=\mathcal {L}\setminus \{o\}\) where \(o\in \mathcal {L}\) is a fixed arbitrarily chosen species. Denote \(A\subset \mathcal {L}'\) as a cluster. Each edge in an unrooted tree corresponds to a bipartition of leaves, which corresponds to a cluster (the side missing o). A cluster A (i.e. the bipartition \(A\mathcal {L}\setminus A\)) is called compatible with a tree T iff a tree exists that includes the bipartition and induces a resolution of T when restricted to same leaves as T. Two clusters are compatible iff they can be in the same tree [23].
Background: tree completion
Completing a tree based on a reference tree is a wellstudied problem and is often formulated as minimizing the distance to the reference tree while maintaining compatibility with the original tree [4, 21, 22, 24]. A natural objective is to find a complete tree with the minimum RF distance (i.e., the total number of bipartitions that differ between the two trees) to the reference tree [25]. OCTAL was the first quadratic time solution to this RF completion problem [21]. Bansal later introduced a linear time solution [22], which we call BRF(+) algorithm. Both methods take as input an incomplete backbone tree, T_{b}, and a complete and binary reference tree, T_{r}, and output a binary and complete tree compatible with T_{b} such that the RF distance to T_{r} is minimized among all allowable trees.
The BRF(+) algorithm [22] achieves linear time using constant time leastcommonancestor (LCA) lookups made possible after a linear time preprocessing using the Schieber–Vishkin technique [26]. Both trees are rooted on an arbitrary shared leaf. For every fullymissing node u of T_{r} (i.e., ∀u:l(u)∩l(T_{b})=∅), the subtree below u is added intact as the sister to the LCA of l(s(u)) inside T_{b} (Fig. 1). This placement of the subtree below u preserves all its bipartitions, the bipartition above it, and potentially the bipartition above the parent of u (we will come back to this point). The order of additions to T_{b} is determined by a preorder traversal of T_{r}, adding each fullymissing node u when we visit the parent node of u. Note that the topology of the backbone tree will not change by the addition of new subtrees. Bansal proved that this algorithm minimizes the RF distances between T_{r} and any possible binary tree that is compatible with T_{b}.
Background: DP algorithm implemented in ASTRAL
ASTRAL estimates an unrooted (species) tree given a set of unrooted (gene) trees \(\mathcal {T}\) and is statistically consistent under the multispecies coalescent model [27] of incomplete lineage sorting (ILS) given a sample of true gene trees. ASTRAL seeks the tree T with the maximum quartet score to \(\mathcal {T}\) defined as \(\sum _{t\in \mathcal {T}} Q(T)\cap Q(t)\), where Q(.) is the set of quartet topologies of a tree. Let S(A) be the score for an optimal subtree on the cluster A. Defining S({x})=0 for \(x\in \mathcal {L}\), the recursion is:
where X is a set of clusters and \(w_{\mathcal {T}}\) is a function assigning weights to tripartitions of \(\mathcal {L}\) such that the sum of all weights for any tree gives its quartet score. If X is set to \(2^{\mathcal {L}'}\), the recursion tests all ways of dividing A into two smaller clusters and under this condition, \(S(\mathcal {L}')\) (Eq. 1) gives the optimal quartet score [8] for \(\mathcal {T}\) in time growing exponentially with n (expected, as the problem is NPHard [28]).
Forming set X: heuristics and restrictions
To handle large datasets, we need X to have a manageable size, preferably growing polynomially with n and k. At the same time, we ideally want X to have all clusters of the optimal tree. An obvious way of building X is to set it to all clusters in all trees in \(\mathcal {T}\), hoping that all clusters in the optimal tree appear in at least one input. However, two difficulties emerge. Firstly, simulations under very high levels of gene tree discordance have shown this heuristic to be insufficient as biparti tions in the optimal tree can frequently be absent from gene trees [11]. To deal with this issue, starting from ASTRALII, set X is enhanced using a set of heuristic methods, and since ASTRALIII, the size of X is restricted to grow linearly with n and k [12, 29]. These heuristics (among other techniques) build consensus trees from input trees and add resolutions of polytomies of consensus trees to X.
The second difficulty is having full resolutions. Equation 1 is welldefined only if for every nonsingleton A∈X, there is A^{′}⊂A such that A^{′}∈X and A∖A^{′}∈X. More generally, a cluster A in X is useful only if there exists a fully binary tree on \(\mathcal {L}'\) that includes A and all of its clusters are in X. Including any other cluster in X is a waste of computation. Thus, set X (which needs to be nonempty) needs to satisfy this property (recall \(o\notin A\subset \mathcal {L}'\) and 2n−3 is the number of clusters in a fully resolved tree): P1: ∀A_{1}∈X,∃{A_{1},A_{2},…,A_{2n−3}}⊂X s.t.∀(i,j):A_{i} is compatible with A_{j}.
Building X using bipartitions of input trees \(\mathcal {T}\) can fail to satisfy this property unless all trees are complete and binary. Thus, starting from ASTRALII, three steps are taken. i) Before adding bipartitions from \(\mathcal {T}\) to X, it first completes each tree with respect to other trees using a distance matrix computed from quartet frequencies in \(\mathcal {T}\) and an algorithm based on the fourpoint condition [30]. ii) Polytomies in input trees are resolved once [11] or more [12] using heuristic methods that sample leaves around polytomies and use the distance matrix mentioned earlier. iii) Heuristic enhancements of set X employed in ASTRALII and ASTRALIII are all explicitly designed such that P1 is automatically satisfied, a feat that has been particularly challenging for multiindividual datasets [29]. Thus, in effect, the set X includes all clusters from each tree in a set of binary and complete trees; the set includes modified input trees and others that ASTRAL heuristically selects.
Enabling input constraints in ASTRAL
Given a constraint tree \(\bar {T}\) and a set of gene trees, \(\mathcal {T}\), our goal is to find the tree among all trees compatible with \(\bar {T}\) that has the maximum quartet score with respect to \(\mathcal {T}\). Compatibility with \(\bar {T}\) is achieved if we enforce a second property on X. P2: ∀A∈X:A is compatible with \(\bar {T}\).
Existing methods for forming X are not guaranteed to satisfy P2. One may think that we can follow standard methods of forming X and simply refuse to add clusters when they violate P2. Unfortunately, that approach, in addition to being slow, can violate property P1 and is not viable. Thus, the main challenge in building set X is maintaining P1 and P2 simultaneously and doing so in a scalable fashion.
Forming set X using tree completion
Our solution relies on completing and resolving the tree \(\bar {T}\) using every tree in X. We require a tree completion method Comp(T_{b},T_{r}) that adds to T_{b} leaves that are present in T_{r} but are absent from T_{b}. The algorithm should only add missing leaves to T_{b} and can also resolve (some of) its polytomies. In other words, the output restricted to leaves of T_{b} is a resolution of T_{r}; thus, \(Comp(\bar {T},t)\) will be compatible with \(\bar {T}\).
Tree completion/resolution methods were traditionally proposed for completing a gene tree using the species tree [4, 21, 22]. However, in our algorithm, we turn the problem on its head and complete the constraint species tree \(\bar {T}\) using individual gene trees and use these mixed trees to build the set X. This uncommon use of the completion method is the main algorithmic idea that enables us to satisfy P1 and P2.
Given a Comp(T_{b},T_{r}) method, we propose Algorithm 1 for forming set X. The first step is the primary new step and forces gene trees to be compatible with \(\bar {T}\) (by definition of Comp(T_{b},T_{r})). Step 2 is identical to ASTRALIII, where, compatible gene trees are completed in a referencefree fashion with respect to each other; any completion method such as the distancebased method used by ASTRAL [30] is valid here. Step 3, like ASTRALIII, creates a set of multifurcating consensus trees \({\mathcal {C}}\) from the compatible gene trees \(\bar {\mathcal {T}'}\). Then, in Step 4, like Step 2, we force consensus trees to be compatible with \(\bar {T}\). Thus, all trees in \(\bar {\mathcal {T}'}\) and \(\bar {\mathcal {C}}\) are compatible with \(\bar {T}\) and any resolution of these trees is also compatible with \(\bar {T}\). Step 5 uses sampling methods of ASTRALIII to resolve these trees heuristically and adds their bipartitions to X. Thus,
Claim 1
All bipartitions of X created by Algorithm 1 are part of a fully binary and complete tree (P1) and are compatible with the constraint tree \(\bar {T}\) (P2).
After forming X (Algorithm 1), DP proceeds as before, computing \(w_{\mathcal {T}}\) using the original gene trees \(\mathcal {T}\). Since all bipartitions in X are compatible with \(\bar {T}\), any tree formed by DP will be compatible with \(\circ \bar{T}\); thus, no other changes are needed.
Tree completion with nonbinary input/output
We now describe our choice for the Comp(T_{b},T_{r}) method. We base our solution on the BRF(+) algorithm [22] described earlier, which we reimplemented inside ASTRAL. However, several changes to the algorithm were necessary.
Multifurcating output. The BRF(+) algorithm and OCTAL force the output to be binary. As a result, the output can include arbitrary branches that increase false positive (FP) edges (branches in the output missing from T_{r}) without reducing false negative (FN) edges (branches in T_{r} missing from the output) (Fig. 1). Thus, if the output tree is allowed to be multifurcating, neither algorithm is optimal (shown by a counterexample; Fig. 1). As mentioned earlier, ASTRAL has several heuristics to resolve polytomies in the input trees (Step 5 of Algorithm 1), and these heuristics are preferable to an arbitrary resolution. Thus, we changed the BRF(+) algorithm so that Comp(T_{b},T_{r}) avoids adding arbitrary resolutions in Steps 2 and 4, leaving resolving polytomies to heuristics of Step 5. Our experiments show that this change substantially reduced the RF of completed trees (Fig. S1).
Recall that the BRF(+) algorithm adds each fullymissing node u in T_{r} as sister to the LCA of l(s(u)) in T_{b}; denote this LCA node as s_{b}. Let \({\mathcal {L}^{m}} = l(T_{r})\setminus l(T_{b})\) be the set of leaves missing from T_{b} and let \({\mathcal {L}^{s}} = l(T_{r})\cap l(T_{b})\) be shared leaves. Two cases arise.
Case 1.\(l(s(u))\setminus {\mathcal {L}^{m}} = l(s_{b})\). In this case, the optimal placement of the subtree below u in T_{b} is as sister to s_{b}, creating a new node above s_{b}. The reason is that this placement leads to the bipartition identified by s(u) to be identical between T_{r} and the completed tree, thereby avoiding a FN edge.
Case 2.\(l(s(u))\setminus {\mathcal {L}^{m}} \neq l(s_{b})\). Here, no placement of the subtree below u onto T_{b} can avoid the FN penalty associated with missing the bipartition associated with s(u) in T_{b}. However, placing the subtree as sister to s_{b} by creating a new internal node does lead to an unnecessary FP edge in the completed tree, separating l(s_{b}) from other leaves (Fig. 1). To avoid these FP edges, we can simply create a polytomy in the completed tree by putting the new subtree as another child of s_{b}.
Our new algorithm (called BRF(*) here) is a straightforward change of the BRF(+) algorithm. We compute the LCA mapping both ways. When inserting the u subtree into T_{b} at s_{b}, we check if the LCA of s_{b} in T_{r} matches s(u). If it does, we create a new internal node above s_{b}; otherwise, we create a polytomy in T_{b} by adding the subtree as a child of s_{b}. Let T^{∗} be the output of our updated algorithm and T^{+} be the output of the original BRF(+) algorithm. By construction, every branch in T^{∗} is also present in T^{+}, and thus, T^{∗} is a contraction of T^{+}.
Claim 2
The output of BRF(*) has the minimum RF distance to the binary tree T_{r} among all (potentially multifurcating) trees compatible with T_{b}.
We prove Claim 2 by showing that T^{∗} has the minimum possible number of FN and FP branches; the optimality of RF follows. We rely on the result that T^{+} is optimal among all binary trees [22]. By optimality of T^{+}, it has the minimum possible FN that any tree (binary or multifurcating) can achieve (a multifurcating tree cannot have a lower FN than T^{+} because binary resolution of that tree would also have a lower FN than T^{+}). Also, T^{∗} is a contraction of T^{+}, where, a branch is contracted only if it is a FP branch due to Case 2. Thus, the number of FN branches in T^{∗} equals those of T^{+} and hence is the minimum possible.
Now, let the number of FP branches in the T_{b} compared to \(T_{r}\!\!\upharpoonright _{\mathcal {L}^{s}}\) be f. Since adding \({\mathcal {L}^{m}}\) to T_{b} cannot reduce its FP branches, f is a lower bound on the number of FP branches in the optimal tree. We claim that the number of FP branches in T^{∗} is also f. Starting with T_{b}, every addition of a subtree u to the current tree (T_{i}) keeps the number of FP branches fixed. To see this, first, consider a branch b of T_{i} that matches T_{r} (restricted to common leaves) before u is added and assume u is not placed on b; then, b cannot become a FP because u is placed on the correct side of b by the proof of the BRF(+) algorithm. Next, consider the branch b where u is placed. If b matches T_{r} before the addition, u is placed here only if l(s(u)) matches leaves below b, as in Case 1, where two true positive branches are created. If b does not matches T_{r} before the addition, Case 2 ensures that we create a polytomy and avoid adding a new FP branch. Thus, every step keeps the number of FP branches fixed.
Multifurcating backbone. We defined BRF(*) for binary T_{b}, but we can have multifurcating T_{b} (here, \(\bar {T}\)). To adopt BRF(*) to multifurcating backbones, prior to completion, we need to add to T_{b} those bipartitions in T_{r} that are compatible with T_{b} (or else we will have unnecessary FN branches). This can be done using the same LCA mapping of the BRF(*) algorithm. In a postorder traversal of T_{r}, for every node u that maps to a polytomy v in T_{b}, we check whether all children of u have a LCA mapping to a child of v. If they do, we create a new node below v and move mapped children under v to be children of the new node. It is easy to see that this method adds all missing bipartitions from T_{r} to T_{b}.
Multifurcating reference. Changing BRF(*) to handle multifurcating T_{r} is straightforward. In the preorder traversal, for any polytomy node u encountered in T_{r}, when there are multiple fullymissing nodes under u, we add them as a group to the same position (as a polytomy) in T_{b}. Other cases are naturally handled by the LCA mapping used by the BRF(*) algorithm (note that we defined “sister” as the set of nodes sharing a parent with a node).
Results: simulations
We first test constrained ASTRAL on an existing [11] simulated dataset with 201 species. This SimPhy [31] dataset has three model conditions with moderate, high, or very high levels of ILS, controlled by setting the maximum species tree height to 10^{7}, 2×10^{6}, or 5×10^{5} generations; mean RF distance (normalized by total number of branches in both trees) between the true species tree and true gene trees are 15%, 34%, and 69%, respectively. We use gene trees inferred using FastTreeII [32] from sequence data in our analysis. The estimated gene trees have relatively high levels of gene tree error (the average RF distance between estimated and true gene trees are 25%, 31%, and 47% for the three model conditions). Following previous publications [11], three replicates of high ILS dataset are removed due to a extreme lack of gene tree resolution.
We compare both constrained and unconstrained ASTRAL to the true tree. We measure the topological error using the normalized RF distance, and also report the change in quartet score and running time between constrained and unconstrained ASTRAL. Note that quartet score of the constrained tree can be higher than unconstrained ASTRAL, as the default version of ASTRAL has a heuristic definition of X and is not guaranteed to find the optimal solution. We ask whether our method of forming the constrained and restricted set X is effective in providing the same level of accuracy as unconstrained (but restricted) searches while improving the running time. We then ask if the use of constraints can benefit accuracy.
Is set X restricted to a constraint tree sufficiently large?
Constraint tree \(\bar {T}\) with missing leaves
We built constraint trees that include \(\frac {1}{4}\), \(\frac {1}{2}\), or \(\frac {3}{4}\) of species by taking the ASTRALIII tree on the full dataset and pruning leaves uniformly at random. Since the unconstrained tree induces the constraint tree, the relative accuracy of constrained and unconstrained search is entirely a function of the completeness of X.
The accuracy of the constrained ASTRAL in most condition matches that of the unconstrained ASTRAL (Fig. 2). For moderate (10^{7}) and high (2×10^{6}) levels of ILS, the drop in average accuracy for different numbers of genes never exceeds 0.4%. Only with very high ILS (5×10^{5}) and constraint trees with 50 species do we start to see small but noticeable drops in the accuracy of constrained ASTRAL. For example, with 50 genes and very high ILS, the error goes up from 20% with no constraints to 26% with a \(\bar {T}\) that has 50 leaves. Consistent with this, the quartet score of the ASTRAL tree also remains largely unchanged in most cases, except, again, for the case of very high ILS and backbone trees that include only 50 leaves (Fig. S2).
Constrained searches also impact the running time and the search space X (Fig. 3). With very high ILS and \(\bar {T}\) including 150 of leaves (pruning 50), we get from 4 to 8x improvement in running times and substantial reduction in the size of the search space. This reduction explains the small reduction in the accuracy of ASTRALIII with constrained searches under these conditions. As backbone size becomes smaller, the running time converges to unconstrained ASTRAL; however, even when \(\bar {T}\) includes a quarter of the leaves, we still have 1.2 to 3x improvement in the running time. With moderate and high levels of ILS, improvements are less pronounced but still substantial. Running time improvements mirror the change in the search space size, which is dramatically reduced (Fig. 3).
We also study a scenario where missing leaves in the constraint tree form clades instead of being uniformly distributed. Results of the cladebased removal did not substantially differ from the random removal (Fig. S3). With moderate or high ILS, random and cladebased removal were indistinguishable, and for very high ILS, only small differences were observed.
To summarize, our method of forming X for constraints with missing leaves retains accuracy and reduces running time, with only small reductions in the accuracy in the most extreme conditions, namely very high ILS and small constraint trees.
Constraint tree \(\bar {T}\) with multifurcation
We next collapse randomly chosen branches from the unconstrained ASTRALIII tree to create a complete but unresolved constraint tree. With these multifurcating constraint trees, constrained ASTRAL search is as accurate as the unconstrained ASTRAL even for very high levels of ILS (Fig. 4). Differences in mean accuracy are no more than 0.1% in any of the 27 conditions we tested. Remarkably, in the case of very high ILS, we even see a small but noticeable improvement in quartet score (but not accuracy) when the constraint tree includes only 50 branches (Fig. S4). Once again, the running time and the size of the search space both reduce dramatically in the constrained searches (Figure S5). Thus, our method of forming X is effective in the face of multifurcating constraint trees.
Can a constrained search help accuracy?
Constrained searches have the power to improve accuracy if prior knowledge of parts of the tree is available. To test this proposition in simulations, we study the accuracy of constrained ASTRAL when the constraint tree \(\bar {T}\) is a subset or a contraction of the true species tree. In both cases (Figs. 2 and 4), the accuracy of the ASTRAL tree improves, and changes are dramatic when the constraint tree is missing only 50 leaves or branches. The improvements are especially strong for the case of complete but multifurcating true species trees (Fig. 4) where a constraint tree with only 50/198 branches can reduce the error from 19% to 13% with 50 genes with very high ILS. If \(\bar {T}\) includes 150/198 branches, the error reduces down to 4%.
The dramatic improvements in accuracy are perhaps not surprising given the fact that parts of the tree are fixed to match the true tree. More interesting is whether adding constraints to some part of the tree improves the accuracy of the remaining parts of the tree. We thus evaluated the accuracy of trees only restricted to the leaves that are not part of the constraint tree. We observe that the accuracy of the remaining leaves has also increased dramatically as a result of having constraints (Fig. 5). For example, in the very high ILS case with 50 genes, when 50 species are missing from the correct constraint tree, the error for the placement of these 50 species has reduced from 21% with no constraints to only 7% with constraints. Similarly strong levels of improvement are observed across all conditions, except when the constraint tree includes only 50 leaves. To summarize, the result demonstrates that given correct prior knowledge about parts of the phylogeny, a constraint ASTRAL search can improve the accuracy of the remaining parts.
Results: biological dataset
We reanalyze the avian phylogenomic dataset [33] with 48 bird species and more than 14,000 loci. The statistical binning method has been proposed to enable coalescentbased analyses of this dataset despite the low phylogenetic signal [34]. The main novel result found using this dataset is the division to Passerea and Columbea at the base of Neoaves, a relationship that was recovered in most analyses of the dataset, including concatenation (TENT), MPEST [35] run on binned gene trees (MPEST*), and ASTRAL run on unbinned gene trees with low support branches contracted [12]. However, running ASTRAL on 2022 binned gene trees failed to recover Passerea and Columbea and placed Otidimorphae (a clade within Passerea) within Columbea (Fig. 6). Nevertheless, the localPP support [36] for this alternative relationship is low. Thus, using constrained searches, we now ask whether there is support for Columbea/Passerea in the binned gene trees.
Constraining the ASTRAL tree to include Passerea results in recovering Columbea and placing Otidimorphae as sister to other Passerea. The Columbea clade, absent from unconstrained ASTRAL, has high support (0.97 localPP) in the constrained tree. Moreover, the support of the Columbimorphae, a clade universally supported in modern analyses, increases from 0.9 in the unconstrained tree to 1.0. On the other hand, the localPP support for Passerea is only 0.37, which is barely above the expected support of a random resolution (0.33), and the total quartet score of the tree is reduced by 37230 quartets (0.015%). We then performed another constrained analysis forcing Otidimorphae to be with Caprimulgimorphae (as in TENT). This constraint leads to Passerea and Columbea both becoming monophyletic with 0.99 and 0.97 localPP (Fig. 6b). However, the Otidimorphae+Caprimulgimorphae clade itself has low localPP (0.13) and total quartet score reduces by 0.045%. Overall, while the unconstrained ASTRAL tree does not recover Columbea and Passerea, gene trees strongly support Columbea (if not Passerea).
We next tested four other clades recovered in TENT but absent from the unconstrained ASTRAL (Fig. 6b). Several patterns were observed. Forcing Afroaves to be monophyletic reveals a total lack of support for the monophyly of that clade (localPP =0 and 0.07% reduction in quartet score). Forcing Cuckoo to be sister to Bustard or Hoatzin to be sister to Cursorimorphae shows a case where neither the constrained nor the unconstrained tree have strong support, and thus, results are inconclusive. Most interestingly, owl fits quite well with Coraciimorphae (localPP 0.99 in constrained analyses) as well as its unconstrained position as sister to birds of prey (localPP 1.0); this observation creates a suspicion of gene tree discordance due to processes other than ILS such as hybridization.
Discussion
The method we introduced for honoring userprovided constraints relies on an extension of a tree completion algorithm by Bansal [22]. Our choice of the RFbased tree completion was driven by the availability of the fast BRF(+) algorithm, which we adopted. Note that our method of forming X is heuristic and the appropriateness of a criterion (such as RF) is an empirical question. However, there is no reason to think that other tree completion/resolution criteria could not have worked equally well or better.
From an algorithmic perspective, constrained search can be considered an extension of the phylogenetic placement problem [37–40]. Unlike placement, constrained search also infers the relationship between query genomes and hence is more informative. Like placement, constrained searches can be used to grow existing trees in an automated fashion and regular basis, without the need to redo the analysis from scratch each time. Taking this idea one step further, constrained ASTRAL can perhaps help develop new divideandconquer metamethods that allow ASTRAL to scale to much larger datasets than what it can currently handle.
By completing and resolving the constrained species tree using input gene trees, we produce a “hybrid” between individual gene trees and the incomplete/unresolved species tree. We showed these hybrid trees are effective in defining search space but we also wondered about accuracy of these trees. Recall that our estimated gene trees have high estimation error (25% to 47% mean RF). Inspired by tree fixing methods [41], we ask if the hybrid gene trees (which, by construction, have reduced RF to the species tree) are better estimates of the true gene trees than the original gene trees. To answer this question, we estimated unconstrained ASTRALIII trees, collapsed branches with ≤0.99 support, and used the resulting tree as T_{b} and each gene tree as T_{r} as input to the tree completion algorithm. The resulting hybrid trees had mixed accuracy (Fig. 7). With moderate ILS, hybrid gene trees have reduced error compared to original estimated gene trees; here, true gene trees are similar to the species tree (15% mean RF), and reducing distance to the (collapsed) species tree reduces the error. However, with higher ILS, original gene trees are more accurate perhaps because true gene trees are dissimilar to the species tree (69% mean RF) and making gene trees similar to the species tree is not beneficial. Using the hybrid gene trees as input to ASTRAL increases the species tree error in all cases (from 4.3%, 5.6%, 10.2% mean RF to 4.9%, 8.3%, 20.5%, respectively, for moderate, high, and very high ILS and 200 genes).
The constrained ASTRAL search opens the door for several downstream biological analyses. As shown in our reanalyses of the avian dataset, we can now perform hypothesisdriven analyses, as attempted often by systematists. These analyses have the potential to reveal support for some branches that are not recovered in the main ASTRAL tree. The presence of alternative support can also be visualized using tools such as DiscoVista [42]. Moreover, with constrained searches, we can now test if fixing parts of the tree can impact the resolution or support for the other parts. These analyses can help users understand how robust their estimated species trees are. Finally, methods of interrogating the impact of individual genes, such as Partitioned Coalescent Support (PCS) [43] can also benefit from constrained search. Currently, these methods limit themselves to scoring handcurated trees but they can instead use automatically generated constrained ASTRAL trees.
Conclusions
We described an algorithm for genomewide inference of species trees from gene trees while honoring userprovided constraint tree. We have implemented and tested this approach for ASTRAL; however, the same strategy should work for other similar DP methods. Our results showed that the constrained ASTRAL is effective in searching the tree topology space. More interestingly, we showed that constrained search using ASTRAL can help biologists obtain a better understanding of the complexities of genomewide evolution by revealing support for conflicting resolutions in the species tree.
Availability of data and materials
Constrained ASTRAL code along with a brief documentation and example for running the code is available on GitHub (https://github.com/maryamrabiee/Constrainedsearch). Datasets and results are also available on GitHub (https://github.com/maryamrabiee/Constrainedsearchdata)
Abbreviations
 ASTRAL:

Accurate species tree algorithm
 DP:

Dynamic programming
 FN:

False negative
 FP:

False positive
 ILS:

Incomplete lineage sorting
 LCA:

Least common ancestor
 MPEST:

Maximum Pseudolikelihood for estimatingsSpecies trees
 RF distance:

Robinsonfoulds distance
References
 1
Bryant D. Hunting for trees in binary character sets: efficient algorithms for extraction, enumeration, and optimization. J Comput Biol. 1996; 3(2):275–88.
 2
Hallett MT, Lagergren J. New algorithms for the duplicationloss model. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology  RECOMB ’00. New York: ACM Press: 2000. p. 138–46. https://doi.org/10.1145/332306.332359.
 3
Chang WC, Górecki P, Eulenstein O. Exact solutions for species tree inference from discordant gene trees. J Bioinforma Comput Biol. 2013; 11(05):1342005. https://doi.org/10.1142/S0219720013420055.
 4
Bayzid MS, Warnow T. Gene Tree Parsimony for Incomplete Gene Trees. In: 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), volume 88. Dagstuhl: Schloss Dagstuhl–LeibnizZentrum fuer Informatik: 2017. p. 2–1213. https://doi.org/10.4230/LIPIcs.WABI.2017.2.
 5
Than C, Nakhleh L. Species Tree Inference by Minimizing Deep Coalescences. PLoS Comput Biol. 2009; 5(9):1000501. https://doi.org/10.1371/journal.pcbi.1000501.
 6
Vachaspati P, Warnow T. FastRFS: fast and accurate RobinsonFoulds Supertrees using constrained exact optimization. Bioinformatics. 2017; 33(5):631–9. https://doi.org/10.1093/bioinformatics/btw600.
 7
Bryant D, Steel M. Constructing Optimal Trees from Quartets. J Algorithm. 2001; 38(1):237–59. https://doi.org/10.1006/jagm.2000.1133.
 8
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genomescale coalescentbased species tree estimation. Bioinformatics. 2014; 30(17):541–8. https://doi.org/10.1093/bioinformatics/btu462.
 9
Vachaspati P, Warnow T. SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol Phylogenet Evol. 2018; 124:122–36. https://doi.org/10.1016/j.ympev.2018.03.006.
 10
Mirarab S. Species Tree Estimation Using ASTRAL: Practical Considerations. Arxiv preprint. 2019; 1904.03826.
 11
Mirarab S, Warnow T. ASTRALII: Coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015; 31(12):44–52. https://doi.org/10.1093/bioinformatics/btv234.
 12
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRALIII: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 2018; 19(S6):153. https://doi.org/10.1186/s128590182129y.
 13
Medina M, Collins AG, Silberman JD, Sogin ML. Evaluating hypotheses of basal animal phylogeny using complete sequences of large and small subunit rRNA. Proc Natl Acad Sci. 2001; 98(17):9707–12. https://doi.org/10.1073/pnas.171316998.
 14
Planet PJ. Tree disagreement: measuring and testing incongruence in phylogenies,. J Biomed Inform. 2006; 39(1):86–102. https://doi.org/10.1016/j.jbi.2005.08.008.
 15
Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, King BL, O’Connell MJ, Asher RJ, Warnow T, Peterson KJ, Donoghue PCJ, Pisani D. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016; 8(2):330–44. https://doi.org/10.1093/gbe/evv261.
 16
Arcila D, Ortí G, Vari R, Armbruster JW, Stiassny MLJ, Ko KD, Sabaj MH, Lundberg J, Revell LJ, Betancur RR. Genomewide interrogation advances resolution of recalcitrant groups in the tree of life. Nat Ecol Evol. 2017; 1(January):0020. https://doi.org/10.1038/s415590160020.
 17
Poe S, Chubb AL. Birds in a Bush : Five Genes Indicate Explosive Evolution of Avian Orders. Evolution. 2004; 58(2):404–15.
 18
Burleigh JG, Mathews S. Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. Am J Bot. 2004; 91(10):1599–613. https://doi.org/10.3732/ajb.91.10.1599.
 19
Crandall KA, Fitzpatrick JF. Crayfish Molecular Systematics: Using a Combination of Procedures to Estimate Phylogeny. Syst Biol. 1996; 45(1):1–26. https://doi.org/10.1093/sysbio/45.1.1.
 20
Nguyen Np, Mirarab S, Liu B, Pop M, Warnow T. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics. 2014; 30(24):3548–55. https://doi.org/10.1093/bioinformatics/btu721.
 21
Christensen S, Molloy EK, Vachaspati P, Warnow T. OCTAL: Optimal Completion of gene trees in polynomial time. Algorithm Mol Biol. 2018; 13(1):6. https://doi.org/10.1186/s1301501801245.
 22
Bansal MS. Lineartime algorithms for some phylogenetic tree completion problems under robinsonfoulds distance. In: RECOMB International Conference on Comparative Genomics. Springer: 2018. p. 209–26.
 23
Warnow T. Textbook for 394C : Algorithms for Computational Biology.
 24
Christensen S, Molloy EK, Vachaspati P, Warnow T. Optimal completion of incomplete gene trees in polynomial time using OCTAL. In: Leibniz International Proceedings in Informatics, LIPIcs: 2017. https://doi.org/10.4230/LIPIcs.WABI.2017.27.
 25
Kupczok A. Splitbased computation of majorityrule supertrees. BMC Evol Biol. 2011; 11(1):205. https://doi.org/10.1186/1471214811205.
 26
Schieber B, Vishkin U. On finding lowest common ancestors: Simplification and parallelization. In: VLSI Algorithms and Architectures. Berlin/Heidelberg: Springer: 1988. p. 111–23. https://doi.org/10.1007/BFb0040379. http://www.springerlink.com/index/10.1007/BFb0040379.
 27
Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988; 5(5):568–83.
 28
Lafond M, Scornavacca C. On the Weighted Quartet Consensus problem. Theoret Comput Sci. 2019; 769:1–17. https://doi.org/10.1016/j.tcs.2018.10.005.
 29
Rabiee M, Sayyari E, Mirarab S. Multiallele species reconstruction using ASTRAL. Mol Phylogenet Evol. 2019; 130:286–96. https://doi.org/10.1016/j.ympev.2018.10.033.
 30
Mirarab S. Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction. PhD thesis. 2015.
 31
Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees,. Syst Biol. 2016; 65(2):334–44. https://doi.org/10.1093/sysbio/syv082.
 32
Price MN, Dehal PS, Arkin AP. FastTree2 – Approximately MaximumLikelihood Trees for Large Alignments. PLoS ONE. 2010; 5(3):9490. https://doi.org/10.1371/journal.pone.0009490.
 33
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SYW, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, da Fonseca RR, Li J, Zhang F, Li H, Zhou L, Narula N, Liu L, Ganapathy G, Boussau B, Bayzid MS, Zavidovych V, Subramanian S, Gabaldon T, CapellaGutierrez S, HuertaCepas J, Rekepalli B, Munch K, Schierup M, Lindow B, Warren WC, Ray D, Green RE, Bruford MW, Zhan X, Dixon A, Li S, Li N, Huang Y, Derryberry EP, Bertelsen MF, Sheldon FH, Brumfield RT, Mello CV, Lovell PV, Wirthlin M, Schneider MPC, Prosdocimi F, Samaniego JA, Velazquez AMV, AlfaroNunez A, Campos PF, Petersen B, SicheritzPonten T, Pas A, Bailey T, Scofield P, Bunce M, Lambert DM, Zhou Q, Perelman P, Driskell AC, Shapiro B, Xiong Z, Zeng Y, Liu S, Li Z, Liu B, Wu K, Xiao J, Yinqi X, Zheng Q, Zhang Y, Yang H, Wang J, Smeds L, Rheindt FE, Braun M, Fjeldsa J, Orlando L, Barker FKK, Jonsson KA, Johnson W, Koepfli KP, O’Brien S, Haussler D, Ryder OA, Rahbek C, Willerslev E, Graves GR, Glenn TC, McCormack J, Burt D, Ellegren H, Alstrom P, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MTPTP, Zhang G. Wholegenome analyses resolve early branches in the tree of life of modern birds. Science. 2014; 346(6215):1320–31. https://doi.org/10.1126/science.1253451.
 34
Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescentbased estimation of the avian tree. Science. 2014; 346(6215):1250463. https://doi.org/10.1126/science.1250463.
 35
Liu L, Yu L, Edwards SV. A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010; 10(1):302.
 36
Sayyari E, Mirarab S. Fast CoalescentBased Computation of Local Branch Support from Quartet Frequencies. Mol Biol Evol. 2016; 33(7):1654–68. https://doi.org/10.1093/molbev/msw079.
 37
Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximumlikelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010; 11(1):538. https://doi.org/10.1186/1471210511538.
 38
Balaban M, Sarmashghi S, Mirarab S. APPLES: Fast Distancebased Phylogenetic Placement. bioRxiv. 2018:475566. https://doi.org/10.1101/475566.
 39
Barbera P, Kozlov AM, Czech L, Morel B, Darriba D, Flouri T, Stamatakis A. EPAng: Massively Parallel Evolutionary Placement of Genetic Sequences. Syst Biol. 2019; 68(2):365–9. https://doi.org/10.1093/sysbio/syy054.
 40
Rabiee M, Mirarab S. INSTRAL: Discordanceaware Phylogenetic Placement using Quartet Scores. bioRxiv. 2018; 432906. https://doi.org/10.1101/432906.
 41
Wu YC, Rasmussen MD, Bansal MS, Kellis M. TreeFix: Statistically Informed Gene Tree Error Correction Using Species Trees. Syst Biol. 2013; 62(1):110–20. https://doi.org/10.5061/dryad.44cb5.
 42
Sayyari E, Whitfield JB, Mirarab S. DiscoVista: Interpretable visualizations of gene tree discordance. Mol Phylogenet Evol. 2018; 122:110–5. https://doi.org/10.1016/j.ympev.2018.01.019.
 43
Gatesy J, Sloan DB, Warren JM, Baker RH, Simmons MP, Springer MS. Partitioned coalescence support reveals biases in speciestree methods and detects gene trees that determine phylogenomic conflicts. Mol Phylogenet Evol. 2019; 139:106539. https://doi.org/10.1016/j.ympev.2019.106539.
Acknowledgements
The authors thank anonymous reviewers.
About this supplement
This article has been published as part of BMC Genomics Volume 21 Supplement 2, 2020: Proceedings of the 17th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume21supplement2.
Funding
MR and SM were both supported through the National Science Foundation (NSF) grant III1845967. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI1053575. The publication cost of this article was funded by the NSF grant IIS1845967.
Author information
Affiliations
Contributions
Both authors designed the algorithms and contributed to writing. M.R implemented the code and performed analyses. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Rabiee, M., Mirarab, S. Forcing external constraints on tree inference using ASTRAL. BMC Genomics 21, 218 (2020). https://doi.org/10.1186/s128640206607z
Published:
Keywords
 ASTRAL
 Constrained tree search
 Tree completion
 RF distance