Skip to main content

STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency

Abstract

Background

Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.

Results

We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.

Conclusions

Theoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.

Background

Estimated species trees are useful in many biological analyses, but accurate estimation of species trees can be quite complicated. Species tree inference can potentially result in accurate evolutionary history using data from multiple loci. Therefore, due to the advent of modern sequencing technologies, it is increasingly common to infer trees by analyzing sequences from multiple loci. However, combining multi-locus data is difficult, especially in the presence of gene tree discordance [1]. A traditional approach to species tree estimation is called concatenation (also known as combined analysis), which concatenates gene sequence alignments into a supergene matrix, and then estimates the species tree using a sequence based tree estimation technique (e.g., maximum parsimony, maximum likelihood etc.). Although it has been used in many biological analyses, concatenation can be problematic as it is agnostic to the topological differences among the gene trees, can be statistically inconsistent [2], and can return incorrect trees with high confidence [36].

Recent modeling and computational advances have produced methods that explicitly take the gene tree discordance into account while combining multi-locus data to estimate species trees. Incomplete lineage sorting (ILS) (also known as deep coalescence) is one of the most prevalent reasons for gene tree incongruence [1], which is modelled by the MSC [7]. Rapid radiation is known to happen in many organisms (for example, birds) where ILS is likely to occur [8]. Due to this growing awareness that ILS can be present in many phylogenomic datasets, ILS-aware species tree estimation techniques have gained substantial attention from systematists. These types of method are usually known as “summary methods” as they summarize a set of gene trees to estimate the species trees [9].

Several species-tree estimation methods have been proven to be statistically consistent under the multi-species coalescent model, meaning that they will return the true species tree with high probability given a sufficiently large number of true gene trees sampled from the distribution defined by the species tree. Methods that are statistically consistent under the MSC include ASTRAL [10], MP-EST [11], *BEAST [12], NJst [13], BUCKy [14], GLASS [15], STEM [16], SNAPP [17], SVDquartets [18], STEAC [19] and ASTRID [20]. *BEAST, which is a Bayesian technique, can co-estimate both gene trees and species trees, and simulation studies suggest that *BEAST can be considered the best of the co-estimation methods with excellent accuracy, but it is computationally very intensive to run on even moderately-sized dataset [9, 21]. SuperTriplets [22] is a triplet-based supertree method which tries to find an asymmetric median supertree according to a triplet dissimilarity. SuperTriplets is very fast and accurate and was used in a number of important phylogenomic studies [2328]. ASTRAL and MP-EST are two of the most accurate and widely used summary methods that are much faster than *BEAST. ASTRAL has been shown to be faster and more accurate than MP-EST and can handle large dataset containing hundreds of species which is not possible by MP-EST [10, 29].

ASTRAL finds a species tree that maximizes the number of consistent quartets between the species tree and the gene trees. MP-EST maximizes a pseudo-likelihood estimate utilizing the underlying triplet distribution of the gene trees. Quartet and triplet based methods are robust to the “anomaly zone” [30, 31] (a condition where the most probable gene tree topology may not be identical to the species tree topology) as there are no anomalous rooted three-taxon species trees [30, 32] and no anomalous unrooted four-taxon species trees [31, 33].

We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new coalescent-based method which finds a species tree that agrees with the largest number of triplets induced by the gene trees. STELAR is statistically consistent under the MSC model, fast (having a polynomial running time), and highly accurate – enabling genome wide phylogenomic analyses. We report, on an extensive experimental study, the performance of STELAR in comparison with ASTRAL, MP-EST and SuperTriplets. These quartet- and triplet-based methods are known to be highly accurate and reliable, and are being widely used. Our experiments showed that STELAR is as good as ASTRAL, and is better than MP-EST and SuperTriplets in most of the model conditions. Thus, with the desirable property of being statistically consistent and reliable performance on challenging realistic model conditions, STELAR will be a useful tool for genome-scale analyses of species phylogeny.

Implementation

Design of ASTRAL and other quartet-based methods [10, 34] and their proofs of being statistically consistent are based upon the fact that unrooted 4-taxon species trees do not contain anomaly zone [31, 33]. We use similar design in STELAR, utilizing the fact that rooted 3-taxon species trees do not contain anomaly zone [30, 32].

STELAR is provably statistically consistent under the MSC model but its statistical guarantee can fail under model conditions in which gene tree discordance is due to gene duplication and loss or horizontal gene transfer. Furthermore, MSC model requires representing each gene by a single tree. Therefore, MSC model does not allow recombination events inside a gene [32] and consequently the statistical guarantee can fail in the presence of recombination.

Definitions and notation

Let T be a binary rooted tree, leaf-labelled by species set \(\mathcal {X}\) with n taxa. Each internal node u in a tree T divides the set of taxa present in the subtree rooted at u into two distinct sets. We call this a subtree-bipartition, denoted by \(\mathcal {SBP}_{T}(u)\), which was originally defined in [35]. We denote by Tu the subtree under node u of tree T. We denote the leaves in T by L(T), an arbitrary set of three species \(\{a,b,c\} \subset \mathcal {X}\) by r, and a rooted topology on r by tr. We use a|bc to show that b and c are siblings. Possible three topologies on {a,b,c} are: a|bc,b|ca,c|ab. The probability, under the MSC model, of observing an induced gene tree triplet that matches the species tree topology is higher, and the other two triplet topologies are less probable and equal to each other [30]. The triplet tree topology that appears more frequently than the two alternative topologies is called the dominant topology. \(\mathcal {TR}(T)\) denotes the set of \(\binom {n}{3}\) triplet topologies induced by the tree T. We denote by T|r the triplet tree topology obtained by restricting T to the three species in r.

Problem definition

A natural optimization problem for estimating a species tree from a collection of gene trees under triplet consistency would be first decomposing the input gene trees into their induced triplets, and then estimating a species tree by amalgamating the dominant triplets [36]. Building such a consensus tree from a set of triplets has been shown to be an NP-hard problem, and exact and heuristic algorithms have been proposed [37, 38]. This approach considers only the dominant triplets which could be problematic when the frequencies of the alternate triplet topologies are very close to that of the dominant one, and when all the dominant triplets are not compatible with each other meaning that all of them do not agree with a single species tree. Moreover, this approach has to explicitly enumerate and consider \(3*\binom {n}{3}\) possible triplets, making it computationally expensive for larger values of n. An alternate approach would be to consider the relative frequencies of all the triplets, and infer a species tree that agrees with the largest number of triplets induced by the gene trees. Methods that do not explicitly decompose the gene trees into induced triplets and quartets are preferred over the ones that take a set of quartets or triplets as input, since the latter types of methods demand high computational time and memory, and thus are not scalable to large dataset [10, 34]. This is possibly one of the reasons why quartet amalgamation method like QFM [34], despite having excellent accuracy, is not as popular as ASTRAL. Taking these issues into consideration, we introduce a constrained version of the triplet consistency problem, which we call the Constrained Triplet Consensus (CTC) problem, and provide efficient exact and heuristic solutions to the CTC without having to decompose the gene trees into their induced triplets. We formalize the CTC problem as follows.

  • Input: a set \(\mathcal {G}\) of rooted gene trees on \(\mathcal {X}\) and a set \(\mathcal {SBP}\) of subtree-bipartitions.

  • Output: a species tree ST on \(\mathcal {X}\) that maximizes the triplet agreement with \(\mathcal {G}\) and all its subtree-bipartitions are in \(\mathcal {SBP}\).

Note that when \(\mathcal {SBP}\) is set to all possible subtree-bipartitions on \(\mathcal {X}\), the triplet consistency problem will find the globally optimal solution (exact solution). Otherwise, when \(\mathcal {SBP}\) contains only the subtree-bipartitions in the input gene trees, the search space will include only those species trees where every subtree-bipartition is present in at least one of the gene trees in \(\mathcal {G}\), and we call this the constrained version or heuristic version.

Theorem 1

Given a set \(\mathcal {G}\) of true gene trees, solution to the CTC problem (both exact and constrained version) is a statistically consistent estimator of the species tree topology under the MSC model.

Proof

Let \(\mathcal {G}\) be a set of sufficiently large number of true gene trees that has evolved within the model species tree T under the multi-species coalescent model. We know that a rooted 3-taxon species tree does not have any anomaly zone [32]. So, as we increase the number of gene trees, each triplet topology induced by the species tree will have more frequency in \(\mathcal {G}\) than its alternatives, with probability that approaches 1. Let \(w_{\mathcal {G}}(t_{r})\) be the number of trees in \(\mathcal {G}\) that agree with tr. Hence, for every triplet r and every possible tree \(T\prime, w_{\mathcal {G}}(T|r) \geq w_{\mathcal {G}}(T\prime |r)\) with probability that approaches 1 for a large enough number of gene trees. Therefore, if \(\mathcal {TR}\) is the set of all possible triplets on \(\mathcal {X}\), then

$$\sum_{r \epsilon \mathcal{TR}}w_{\mathcal{G}}(T|r) \geq \sum_{r \epsilon \mathcal{TR}}w_{\mathcal{G}}(T\prime|r).$$

By definition, the exact solution to the CTC problem will find a tree that will maximize the triplet agreement with \(\mathcal {G}\). Therefore, for a sufficiently large number of true gene trees, the optimal tree under CTC will be identical to the true species tree T with probability approaching one.

We now show that the solution to the constrained version is also statistically consistent. If we increase the number of gene trees, with probability approaching 1, at least one of the gene tree topologies in \(\mathcal {G}\) will be identical to the true species tree topology. Therefore, with probability that tends to 1, the set \(\mathcal {SBP}\) will contain all the subtree bipartitions from the true species tree, and therefore the solution to the CTC problem will be identical to the true species tree T. □

Algorithmic design of STELAR

We propose a dynamic programming (DP) solution to the CTC problem. The algorithmic design (and corresponding theoretical results) in STELAR is structurally similar to ASTRAL. This sort of DP-based approach, which implicitly finds a maximum or minimum clique in a graph modelled from the input gene trees, was first used by [39] and later was used in Phylonet [40, 41], DynaDup [35, 42] and ASTRAL. The key idea in STELAR is to find an equation for computing the number of triplets in the gene trees that agree with a given subtree in the species tree, which ultimately enables us to design a dynamic programming algorithm for estimating a species tree by maximizing the triplet agreement.

We call a triplet tr=a|bc to be mapped to a subtree-bipartition at an internal node x in T (\(\mathcal {SBP}(T_{x})=X_{1}|X_{2}\)) when {a,b,c}L(Tx), and the topology of tr is compatible (consistent) with Tx (see Fig. 1). Each triplet mapped to a subtree-bipartition (X1|X2) will have either two leaves from X1 and one leaf from X2 or two leaves from X2 and one leaf from X1. Therefore, we have the following results.

Fig. 1
figure1

Mapping of a triplet to a subtree-bipartition. Each internal node x in a rooted tree defines a subtree-bipartition (X1|X2). Each induced triplet in T maps to a subtree-bipartition in T. This figure shows the mapping of a triplet a|bc to a subtree-bipartition x=X1|X2

Lemma 1

Total number of triplets mapped to a subtree-bipartition x=(X1|X2) of a rooted binary tree T, where |X1|=n1 and |X2|=n2 is:

$$ {\begin{aligned} \mathcal{NT}(n_{1}, n_{2}) &= {n_{1}\choose 2}{n_{2}\choose 1} + {n_{2}\choose 2}{n_{1}\choose 1} \\ &= \frac{n_{1}n_{2}(n_{1} + n_{2} - 2)}{2}. \end{aligned}} $$
(1)

Lemma 2

For two subtree-bipartitions x=(X1|X2) and y=(Y1|Y2) where \((X_{1} \cup X_{2}) \subseteq \mathcal {X}\) and \((Y_{1} \cup Y_{2}) \subseteq \mathcal {X}\), the number of triplets mapped to both x and y is given by:

$$ {\begin{aligned} \mathcal{M}(x,y) &= \mathcal{NT}(|X_{1} \cap Y_{1}|, |X_{2} \cap Y_{2}|)\\ &\quad + \mathcal{NT}(|X_{1} \cap Y_{2}|, |X_{2} \cap Y_{1}|). \end{aligned}} $$
(2)

Proof

We can obtain two possible subtree bipartitions by taking pairwise intersection: z1=(X1Y1|X2Y2) and z2=(X1Y2|X2Y1). Note that each leaf in zi (i1,2) is also a leaf in both x and y, and so each triplet mapped to zi is also mapped to x and y and the number of mapped triplets can be computed by Eq. 1.

We now prove that Eq. 2 counts a mapped triplet only once. Without loss of generality, let tr=ab|c be mapped to both x and y where a,bX1,cX2,a,bY1 and cY2. Equation 2 will count tr for z1=(X1Y1|X2Y2) only as z2 will have no element from {a,b,c}. Thus the lemma follows. □

We now see how to compute the triplet consistency (TC) score of a species tree ST with respect to a set \(\mathcal {G}\) of gene trees, which we denote by \(TC_{\mathcal {G}}(ST)\). If \(\mathcal {SBP}(T)\) is the set of subtree-bipartitions in T, for a subtree-bipartition x and a set of rooted binary trees \(\mathcal {G}\), the total number of triplets in \(\mathcal {G}\) that are mapped to x is given by:

$$ TC_{\mathcal{G}}(x) = \sum_{gt \in \mathcal{G}}\sum_{y \in \mathcal{SBP}(gt)}\mathcal{M}(x, y). $$
(3)

Thus, the TC score of an estimated species tree ST is obtained by,

$$ TC_{\mathcal{G}}(ST) = \sum_{x \in \mathcal{SBP}(ST)}TC_{\mathcal{G}}(x) $$
(4)

Therefore, we can design a dynamic programming approach which starts with the whole species set \(\mathcal {X}\) and recursively divides it into smaller subsets and picks the partition that maximizes the triplet consistency. At each level, we compute V(A) which gives the score of an optimal subtree on leaf set \(A \subseteq \mathcal {X}\). At the end level, we have computed \(V(\mathcal {X})\) and sufficient information so that backtracking yields the corresponding subtree-bipartitions in the optimal species tree.

Dynamic programming

Base case: |A|=1;V(A)=0Recursive relation: \(V(A) = \max _{(A^{\prime }\vert A - A^{\prime }) \in \mathcal {X}}\{V(A^{\prime }) + V(A - A^{\prime }) + TC_{\mathcal {G}}(A^{\prime }\vert A - A^{\prime })\}\), where \( A^{\prime }|A- A^{\prime } \in \mathcal {SBP}\).

Running time analysis

Lemma 3

Given a set \(\mathcal {G}\) of k gene trees on n taxa and a subtree-bipartition x, computing the triplet consistency score \(TC_{\mathcal {G}}(x)\) takes O(n2k) time.

Proof

A gene tree with n taxa contains O(n) subtree-bipartitions. Thus, for a set of k input gene trees, there will be O(nk) subtree-bipartitions. To find the triplet consistency score \(TC_{\mathcal {G}}(x)\) of a subtree-bipartition x using Eq. 3, we need to score \(\mathcal {M}(x,y)\) for all subtree-bipartions y present in the input gene trees. Since we represent each subtree-bipartion by a bitset, calculating \(\mathcal {M}(x,y)\) (using Eq. 2) takes O(n) time. Since there are O(nk) subtree-bipartions in \(\mathcal {G}\), calculating \(TC_{\mathcal {G}}(x)\) takes O(n2k) time. □

Theorem 2

For a given set \(\mathcal {G}\) of k gene trees on n taxa and a set \(\mathcal {SBP}\) of subtree-bipartitions, STELAR runs in \(O\left (n^{2}k\vert \mathcal {SBP}\vert ^{2}\right)\) time.

Proof

For a specific cluster A, V(A) can be computed in \(O\left (|\mathcal {SBP}|*n^{2}k\right)\) time since at worst we need to look at every subtree-bipartition in \(\mathcal {SBP}\), and computing \(TC_{\mathcal {G}}(x)\) takes O(n2k) time. Note that this is a conservative analysis since the number of subtree-bipartitions that the DP algorithm needs to consider for computing V(A) for a specific cluster A is, in reality, much less than \(|\mathcal {SBP}|\) (especially for A where \(|A| < |\mathcal {X}|\)). There will be \(O(|\mathcal {SBP}|)\) possible clusters for which we need to compute V(A). Thus, the running time of STELAR is \(O\left (n^{2}k|\mathcal {SBP}|^{2}\right)\). □

Software implementation

STELAR is a java based cross-platform application. STELAR code uses code from the PhyloNet package by Luay Nakhleh (bioinfo.cs.rice.edu/phylonet). The Phylonet code base was previously used by DynaDup [35, 42] and ASTRAL (with permission from the authors), and STELAR is mostly based on the DynaDup code base.

Experimental studies

We compared STELAR with SuperTriplets, which is a very fast and highly accurate triplet based supertree technique and has been used in several important phylogenomic studies [2227]. We also evaluated STELAR in comparison with ASTRAL-III [43] and MP-EST, two of the most widely used and accurate coalescent-based methods. ASTRAL has already been compared and shown to outperform many of the existing species tree estimation methods (e.g., BUCKy-pop, MRP, greedy consensus, CA-ML, etc.), especially in the presence of moderate to high levels of ILS. So we did not include those methods in this study. Most of the datasets analyzed in this study are prohibitively large for *BEAST to analyze. We ran MP-EST with 10 random starting points and selected the species tree with the highest pseudo-likelihood value.

We used previously studied simulated and biological datasets to evaluate the performance of these methods. We compared the estimated species trees to the model species tree (for the simulated datasets) or to the scientific literature (for the biological datasets), to evaluate accuracy. We used normalized Robinson-Foulds distance (RF rate) [44] to measure the tree error. The RF distance between two trees is the sum of the bipartitions (splits) induced by one tree but not by the other, and vice versa. All the trees estimated by ASTRAL-III, STELAR and MP-EST in this study are binary and so RF rate, False Positive (FP) rate and False Negative (FN) rate are identical. However, the trees estimated by SuperTriplets may produce multifurcations (unresolved trees). We performed Wilcoxon signed-rank test (with α=0.05) to measure the statistical significance of the differences between two methods.

Datasets

We studied four collections of simulated datasets: one based on a biological dataset (37-taxon mammalian dataset) that was generated and subsequently analyzed in prior studies [10, 29, 45], and three other simulated datasets from [45], [46], and [47]. Table 1 presents a summary of the datasets analyzed in this paper, showing various model conditions with varying numbers of taxa (11 to 37), ILS levels (reflected in the average topological distance between true gene trees and true species tree) and gene sequence lengths. All datasets consist of gene sequence alignments generated under a multi-stage simulation process that begins with a species tree, simulates gene trees down the species tree under the multi-species coalescent model (and so can differ topologically from the species tree), and then simulates gene sequence alignments down the gene trees.

Table 1 Properties of the simulated datasets. Level of ILS is presented in terms of the average topological distance between true gene trees and true species tree

Mammalian dataset was simulated by taking the species tree estimated by MP-EST on the biological dataset studied in [48]. This species tree had branch lengths in coalescent units, which we used to produce a set of gene trees under the coalescent model. Thus, the mammalian simulation model tree has an ILS level based on a coalescent analysis of the biological mammalian dataset, and other properties of the simulation that are set to reflect the biological sequences [48] studied. We explored the impact of varying amounts of ILS, varying numbers of genes (25 800), the impact of phylogenetic signal by varying the sequence length (250, 500, and 1000 base pair (bp)) for the markers. Three levels of ILS were simulated. The basic model species tree has branch lengths in coalescent units, and we produced other model species trees by multiplying or dividing all internal branch lengths in the model species tree by two. This rescaling varies the amount of ILS (shorter branches have more ILS), and also impacts the amount of gene tree estimation error in the estimated gene trees. The basic model condition with moderate amount of ILS is referred to as 1X and the model conditions with higher and lower amounts of ILS are denoted by 0.5X and 2X, respectively.

We used three other simulated datasets: 11-taxon dataset (generated by [46] and subsequently studied by [9, 49]), 15-taxon dataset (generated and studied by [45]), and 500-taxon dataset (generated and studied by ASTRAL-II [47]). 11-taxon dataset was simulated under a complex process to ensure substantial heterogeneity between genes and to deviate from the molecular clock [46]. We analyzed the model condition with high amount of ILS as the model condition with lower amount of ILS is very easy to analyze and even methods without any statistical guarantee produced highly accurate tree with just 25 50 gene trees [9]. 15-taxon datasets vary in sequence lengths and has high amount of ILS. The 500-taxon dataset was generated under the MSC model by introducing rate heterogeneity between genes and deviations from molecular clock (see [47] for more details).

We used two biological datasets: the 37-taxon mammalian datasets studied by [48] with 424 genes, and the Amniota dataset from [50] containing 16 species and 248 genes.

Results

Results on mammalian simulated dataset

We analyzed the performance of SuperTriplets, ASTRAL-III, MP-EST and STELAR on various model conditions with varying amounts of ILS, numbers of genes and lengths of the sequences. Figure 2a shows the average RF rates of three methods on three model conditions with varying amounts of ILS. For all methods, RF rates increase as we increase the amount of ILS. SuperTriplets produced significantly less accurate trees than the other methods, and the trees are not fully resolved. For the dataset with varying amounts of ILS, SuperTriplets produced trees with RF rates 10% 18%, whereas the error rates of STELAR, MP-EST and ASTRAL-III range from 4% 6%. STELAR and ASTRAL-III are better than MP-EST on all amounts of ILS and, in some cases, the differences are statistically significant (P<0.05). STELAR and ASTRAL are comparable in terms of tree accuracy and no statistically significant difference was noticed.

Fig. 2
figure2

Comparison of ASTRAL-III, MP-EST, STELAR and SuperTriplets on 37-taxon simulated mammalian dataset. We show the average RF rates with standard error bars over 20 replicates. a We fixed the sequence length to 500 bp and number of genes to 200, and varied the amounts of ILS. 2X model condition contains the lowest amount of ILS while 0.5X refers to the model condition with the highest amount of ILS. b We varied the amount of gene tree estimation error by varying the sequence lengths from 250 to 1000 bp, while keeping the ILS level (moderate) and the number of genes (200) fixed. c We fixed the sequence length to 500 bp and amount of ILS to 1X, and varied the numbers of genes (25, 50, 100, 200, 400, 800)

Figure 2c shows the impact of the numbers of gene trees on the accuracy of the estimated species trees. STELAR, MP-EST and ASTRAL-III improved with increasing numbers of genes (this is expected as these methods are statistically consistent and thus increasing the numbers of genes has resulted into improved topological accuracy). However, SuperTriplets failed to leverage the increasing amount of data since it did not improve as we increased the numbers of genes. Moreover, even with 800 genes, SuperTriplets produced unresolved trees. STELAR and ASTRAL-III achieved similar accuracy and performed significantly better than MP-EST and SuperTriplets across all model conditions.

Results on 11-taxon dataset

We analyzed both the estimated and true gene trees. In both cases, we varied the number of genes from (5 100). Figure 3 shows the results on 11-taxon dataset for both estimated and true gene trees. Similar to the mammalian dataset, SuperTriplets produced the least accurate trees. As expected, all methods improved with increased numbers of genes (except for a few cases for SuperTriplets), and had their best accuracies on true gene trees. The accuracies of ASTRAL-III, MP-EST and STELAR are almost similar with no statistically significant difference. On true gene trees, ASTRAL, MP-EST and STELAR recovered the true species trees across all the replicates with 50 or higher numbers of genes. However, the accuracy of SuperTriplets did not achieve any notable improvement as we increased the number of genes beyond 25. Unlike mammalian dataset, MP-EST was slightly better than ASTRAL-III and STELAR on a couple of model conditions (albeit the differences are not statistically significant with P-values greater than 0.05).

Fig. 3
figure3

Average RF rates of ASTRAL-III, MP-EST, STELAR and SuperTriplets on the 11-taxon dataset with high amount of ILS. We varied the numbers of genes from 5 to 100. We analyzed both the estimated and true gene trees. We show the average RF rates with standard error bars over 20 replicates

Results on 15-taxon dataset

15-taxon dataset evolved under a caterpillar model species tree with very high level of ILS [45]. We explored the performance on varying sequence lengths (100 bp and 1000 bp), and numbers of genes (100 and 1000). Unlike other two dataset, SuperTriplets returned well resolved trees without multi-furcations and matched the accuracy of other methods on this particular dataset. All the methods improved as the the number of genes and sequence length increased, and had their best accuracy (RF rate =0) on true gene trees (Fig. 4). No statistically significant difference was observed between STELAR and ASTRAL-III. MP-EST consistently had the highest error rates across all the model conditions on estimated gene trees. On true gene trees, all the methods returned the true species tree on almost all the replicates.

Fig. 4
figure4

Comparison of ASTRAL-III, MP-EST, STELAR and SuperTriplets on the 15-taxon dataset. We varied the numbers of estimated gene trees (100 and 1000) and sequence lengths (100 bp and 1000 bp). We also analyzed the true gene trees. We show the average RF rates with standard errors over 10 replicates

Results on 500-taxon dataset

In order to assess the performance of STELAR on relatively larger datasets, we analyzed the 500-taxon dataset [47] with 1000 genes. We analyzed 10 replicates of data. Some of the replicates had a few non-binary gene trees which we discarded in order to run STELAR. After removing the non-binary gene trees, the numbers of genes in the replicates range from 9931000. MP-EST is computationally too demanding to run on this large dataset. Figure 5 shows the error rates of STELAR, ASTRAL-III and SuperTriplets. Both STELAR and ASTRAL produced competitive results with no statistically significant difference. However, similar to the other dataset analyzed in this study, SuperTriplets was much worse than ASTRAL-III and STELAR.

Fig. 5
figure5

Comparison of ASTRAL-III, MP-EST, STELAR and SuperTriplets on 500-taxon dataset. We show the average RF rates with standard errors over 10 replicates

Results on biological datasets

Amniota dataset: We analyzed the Amniota dataset from [50] containing 248 genes across 16 amniota taxa in order to resolve the position of turtles within the amniotes, especially relative to birds and crocodiles. We re-analyzed both the amino acid (AA) and DNA gene trees, obtained from the authors, using ASTRAL, MP-EST, STELAR and SuperTriplets. The sister relationship of crocodiles and birds (forming Archosaurs), and placement of the turtles as the sister clade to Archosaurs are considered reliable and were supported by previous studies [5153].

For amino acid dataset, all the methods placed birds and crocodiles as sister groups (forming Archosaurs) and turtles as sister to Archosaurs clade (see Fig. 6). The unrooted version of STELAR estimated tree is identical to ASTRAL. MP-EST also returned a highly similar tree except for different resolutions within the turtles. SuperTriplets produced an unresolved tree with several multifurcations, leaving bird/turtle/crocodile relationship uncertain. It also did not resolve the Squamates clade.

Fig. 6
figure6

Analyses of the Amniota dataset (both DNA and AA gene trees) using STELAR, ASTRAL, MP-EST and SuperTriplets. We show the rooted version of the ASTRAL estimated trees using the outgroup (Protopterus annectens)

STELAR and MP-EST on the DNA gene trees produced the same tree which resolved bird/turtle/crocodile relationship as (birds,(turtles,crocodiles)) and thus did not form Archosaurs. However, ASTRAL on the DNA gene trees resolved the relationships as (turtles,(birds,crocodiles)). Therefore, maximizing triplet agreement and quartet agreement both recovered the (birds,(turtles,crocodiles)) relationships on AA gene trees, but differed on the DNA gene trees. Similar to the AA dataset, SuperTriplets did not resolve the bird/turtle/crocodile relationship. We estimated the quartet scores and the triplet scores of these different tree topologies with respect to the corresponding gene trees (see Tables 2 and 3). As ASTRAL estimated trees need to be interpreted as unrooted trees, we root the ASTRAL trees using the correct outgroup (Protopterus annectens) in order to compute the triplet scores. As expected, ASTRAL achieved the highest quartet agreement scores and STELAR achieved the highest triplet agreement scores.

Table 2 Triplet and quartet scores of the species trees estimated by ASTRAL, MP-EST, SuperTriplets and STELAR on the amino acid (AA) gene trees in Amniota dataset
Table 3 Triplet and quartet scores of the species trees estimated by ASTRAL, MP-EST, SuperTriplets and STELAR on the DNA gene trees in Amniota dataset

Mammalian dataset: We re-analyzed the mammalian dataset from [48] containing 447 genes across 37 mammals after removing 21 mislabeled genes (confirmed by the authors), and two other outlier genes. The placement of tree shrews (Tupaia belangeri) is of great importance here as previous studies found that the phylogenetic placement of tree shrews is unstable as they can be classified as a sister group to Primates or to Glires [48, 5456]. Our analyses using ASTRAL, MP-EST and STELAR recovered the same topology which placed tree shrews as sister to Glires (see Fig. 7). This is consistent to the tree estimated by CA-ML (reported in [48]). However, previous studies with MP-EST, using multi-locus bootstrapping [57], on the same dataset recovered a tree which placed tree shrews as the sister to Primates [10, 56]. SuperTriplets produced a highly similar tree except that it resulted into an unresolved tree with several multi-furcations. It recovered the major clades such as Primates, Glires, Cetartiodactyla, Eulipotyphla, Chiroptera. However, SuperTriplets left the relationships within Primates/Glires/Scandentia unresolved. This is in line with the tree estimated by SuperTriplets on the OrthoMaM database [58], which was reported in [22]. We investigated the quartet and triplet scores of the two alternative hypotheses (regarding the placement of tree shrew) and found that the tree that suggests (tree shrew, Glires) clade has higher quartet and triplet scores (25526915 and 3042547, respectively) compared to the one that suggests (tree shrew, Primates) relationship which yields 25511503 and 3041747 quartet and triplet scores respectively. Therefore, our analyses strongly support the placement of tree shrews as sister to Glires.

Fig. 7
figure7

Analyses of the mammalian dataset using STELAR, ASTRAL, MP-EST and SuperTriplets

Running time

We performed the experiments on small- to moderate-scale dataset (11 37 taxa) on a Linux machine with 7.5 GiB RAM and an i5 1.6 GHz processor. For the 500-taxon dataset, we used a Linux machine with 15.6 GiB RAM and an i7 3.60 GHz processor. Our experiments show that SuperTriplets is usually significantly faster than the other methods, and MP-EST is computationally more demanding than the others. For 11-, 15-, and 37-taxon datasets, SuperTriplets took a fraction of a second to run on a single replicate. STELAR and ASTRAL-III are substantially faster than MP-EST. The running time of MP-EST is more sensitive to the number of species than the number of genes, and grows rapidly as we increase the number of taxa [29]. On 37-taxon simulated dataset with 200 genes (moderate ILS), SuperTriplets took around half a second, both ASTRAL-III and STELAR took around 4 s whereas MP-EST (with 10 random starting points) took 3300 s (55 min). Running time of all these methods (except for SuperTriplets) increased with increasing amounts of ILS. On 37-taxon, 0.5X dataset (high ILS) with 200 genes, ASTRAL-III took around 10 s, STELAR took 15 s and MP-EST took 4800 s (80 min). This is because the numbers of distinct quartets and triplets in the gene trees increase as we increase the amount of discordance among the gene trees. When we increased the numbers of genes from 200 to 800 in 1X model condition, the running times of ASTRAL-III and STELAR increased from 4 s to 10 and 12 s respectively, whereas SuperTriplets took only 0.8 s. MP-EST took about 55 min and thus was not affected by the increased number of genes. On 11-taxon dataset, we ran the exact versions of ASTRAL-III and STELAR, and yet the running time is much less than MP-EST. Exact versions of ASTRAL-III and STELAR finished in less than a second on 550 genes and less than two seconds on 50100 genes, whereas the running times of MP-EST range from 13 to 20 s. SuperTriplets took only 0.1 s irrespective of the numbers of genes. For 15-taxon dataset, MP-EST took 90100 s. ASTRAL-III and STELAR finished in less than 2 s on 100 gene dataset, and for 1000 genes they took about 50 and 80 s, respectively. The running time of SuperTriplets did not vary much with the change in the number of genes and finished in only around 0.4 s.

There is no significant difference in running time between ASTRAL-III and STELAR on relatively smaller datasets. However, on the larger dataset with 500 taxa, ASTRAL-III is much faster than STELAR. ASTRAL-III took around 18 min whereas STELAR took about 2 h. Interestingly, SuperTriplets – which is much faster than ASTRAL-III and STELAR on relatively smaller datasets – is not the fastest on 500-taxon dataset, taking around 26 min. ASTRAL has evolved over the years (ASTRAL [10], ASTRAL-II [47], ASTRAL-III [43], etc.) with efficient heuristics to explore the tree space and highly optimized implementation. Although STELAR is not as fast as ASTRAL at its current form, it is sufficiently fast to analyze large dataset with hundreds of taxa. We will continue to improve STELAR to make it fast enough to run on ultra large dataset with thousands of taxa.

Discussion

Extensive experimental studies on simulated dataset with varying model conditions suggest that STELAR, which is an efficient DP-based algorithm for solving the triplet consistency problem, can produce substantially better trees than SuperTriplets (which is a triplet-based supertree method and tries to find a median supertree with respect to a triplet dissimilarity measure), and MP-EST (which tries to maximize a pseudo-likelihood measure based on the triplet distribution of the gene trees). Thus, within the scope of the experiments conducted in this study, maximum triplet agreement has been shown to be a better optimization criterion than the triplet frequency based maximum pseudo-likelihood used in MP-EST, and the triplet dissimilarity based maximum parsimony criterion used in SuperTriplets, and as good as the maximum quartet agreement criterion used in ASTRAL. Experiments on Amniota and mammalian biological dataset suggest that STELAR can produce reliable trees on real biological dataset.

Most of the existing methods have assumed that the gene tree discordance is due to only a particular biological reason, for example, ILS. Therefore, both orthologs and paralogs are not being included in many studies such as 1KP project [59], where many genes were discarded that evolved with duplications and losses. Furthermore, existing methods try to optimize a single optimization criterion (e.g., quartet consistency [10], and pseudo-likelihood [11]). Although these criteria are statistically consistent, the performance of various methods may vary on practical model conditions with limited numbers of genes and in the presence of gene tree estimation error. These factors underscore the need for modeling species tree estimation as a multi-objective optimization problem, where multiple explanations for observed data can be conjectured, and multiple optimization criteria are simultaneously optimized. Therefore, developing new scalable and statistically consistent estimators of species trees is of great importance. We believe this study advances the state-of-the-art in summary methods, and STELAR would be considered as a useful tool in phylogenomics.

STELAR can consistently estimate the species tree topology and is fast enough to handle large numbers of genes and species, making it suitable for large scale phylogenomic analyses. However, STELAR can be expanded in several directions. Future work will need to investigate how the estimation of branch lengths (in coalescent unit) can be incorporated in STELAR. STELAR, at this stage, cannot handle unrooted and non-binary gene trees. The CTC problem for unrooted gene trees can be formulated as follows. The input is a set \(\mathcal {G} = \{gt_{1}, gt_{2}, gt_{3}, \ldots, gt_{k}\}\) of k unrooted gene trees, and the output is a species tree ST that maximizes the triplet consistency score with respect to a set \(\mathcal {G'} = \{gt'_{1}, gt'_{2}, gt'_{3}, \ldots, gt'_{k}\}\) where \(gt^{\prime }_{i}\) is a rooted version of gti. That means, the idea is to find the optimal rooted versions of the unrooted gene trees so that the triplet agreement is maximized. Similar approach was previously applied [41] for estimating species trees from a set of unrooted gene trees by minimizing “extra lineages” resulting from deep coalescence. Similarly, the problem can be extended for non-binary gene trees as well. Thus, future works on formalizing the CTC problem for unrooted and non-binary gene trees and developing appropriate algorithms will be of great importance. The current version of STELAR is reasonably fast and can handle hundreds of taxa. However, more efficient algorithms and implementation are required to make STELAR scalable to much larger dataset with thousands of taxa. Furthermore, while we have studied various practical and challenging model conditions with varying numbers of genes, and varying amounts of ILS, and gene tree estimation error, future work should investigate the relative performance of STELAR and the widely used existing methods on various other challenging model conditions, for example, model conditions with missing taxa (i.e., incomplete gene trees). This study investigated relatively long sequences (250 2000 bp); subsequent studies should investigate the relative performance of STELAR and other methods on very short sequences, since recombination-free loci can be very short [60].

Conclusions

We developed STELAR – a method for estimating species trees from a set of estimated gene trees which is statistically consistent under the multi-species coalescent model. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimator of the species tree under the MSC model. STELAR is a dynamic programming solution to the CTC problem. STELAR has an exact version (which provides the exact solution and takes exponential amount of time) as well as an efficient polynomial time heuristic version. Considering the high accuracy and scalability, we believe STELAR will be considered as a potential tool for constructing species phylogenies from genes sampled throughout the whole genome.

Availability and requirements

Project name: STELARProject home page: https://islamazhar.github.io/STELAR/Operating system(s): Platform independentProgramming language: JavaOther requirements: Java 9.0 or higherLicense: Apache License 2.0Any restrictions to use by non-academics: Not applicable.

Availability of data and materials

STELAR is freely available as an open source software at https://islamazhar.github.io/STELAR/. All the dataset analyzed in this paper are from previously published studies and are publicly available.

Abbreviations

CTC:

Constrained triplet consensus

DP:

Dynamic programming

ILS:

Incomplete lineage sorting

MSC:

Multi-species coalescent

SBP:

Subtree-bipartition

References

  1. 1

    Maddison WP. Gene trees in species trees. Syst Biol. 1997; 46:523–36.

    Article  Google Scholar 

  2. 2

    Roch S, Steel M. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015; 100:56–62.

    Article  Google Scholar 

  3. 3

    Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007; 56:17.

    CAS  PubMed  Article  Google Scholar 

  4. 4

    Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc Nat Acad Sci. 2007; 104(14):5936–41.

    CAS  PubMed  Article  Google Scholar 

  5. 5

    Leaché AD, Rannala B. The accuracy of species tree estimation under simulation: a comparisonof methods. Syst Biol. 2011; 60(2):126–37.

    PubMed  Article  Google Scholar 

  6. 6

    DeGiorgio M, Degnan JH. Fast and consistent estimation of species trees using supermatrix rooted triples. Mol Biol Evol. 2009; 27(3):552–69.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  7. 7

    Kingman JFC. The coalescent. Stoch Proc Appl. 1982; 13:235–48.

    Article  Google Scholar 

  8. 8

    Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT, et al.Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014; 346(6215):1320–31.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9

    Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses. Bioinformatics. 2013; 29(18):2277–84.

    CAS  PubMed  Article  Google Scholar 

  10. 10

    Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014; 30(17):541–8.

    Article  CAS  Google Scholar 

  11. 11

    Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010; 10(1):302. https://doi.org/10.1186/1471-2148-10-302.

    PubMed  PubMed Central  Article  Google Scholar 

  12. 12

    Heled J., Drummond A. J.Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010; 27:570–80.

    CAS  PubMed  Article  Google Scholar 

  13. 13

    Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011; 60(5):661–7. https://doi.org/10.1093/sysbio/syr027.

    PubMed  Article  Google Scholar 

  14. 14

    Larget B, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinf. 2010; 26(22):2910–1.

    CAS  Article  Google Scholar 

  15. 15

    Mossel E, Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE Comp Biol Bioinform. 2011; 7(1):166–71.

    Article  CAS  Google Scholar 

  16. 16

    Kubatko LS, Carstens BC, Knowles LL. Stem: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinf. 2009; 25:971–3.

    CAS  Article  Google Scholar 

  17. 17

    Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 2012; 29(8):1917–32.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18

    Chifman J, Kubatko L. Quartet inference from snp data under the coalescent model. Bioinformatics. 2014; 30(23):3317–24.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. 19

    Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009; 58(5):468–77.

    CAS  PubMed  Article  Google Scholar 

  20. 20

    Vachaspati P, Warnow T. Astrid: accurate species trees from internode distances. BMC Genomics. 2015; 16(10):3.

    Article  Google Scholar 

  21. 21

    Smith BT, Harvey MG, Faircloth BC, Glenn TC, Brumfield RT. Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. Syst Biol. 2013; 63(1):83–95.

    PubMed  Article  Google Scholar 

  22. 22

    Ranwez V, Criscuolo A, Douzery EJ. Supertriplets: a triplet-based supertree approach to phylogenomics. Bioinformatics. 2010; 26(12):115–23.

    Article  CAS  Google Scholar 

  23. 23

    Romiguier J, Ranwez V, Delsuc F, Galtier N, Douzery EJ. Less is more in mammalian phylogenomics: At-rich genes minimize tree conflicts and unravel the root of placental mammals. Mol Biol Evol. 2013; 30(9):2134–44.

    CAS  PubMed  Article  Google Scholar 

  24. 24

    Cornillot E, Hadj-Kaddour K, Dassouli A, Noel B, Ranwez V, Vacherie B, Augagneur Y, Bres V, Duclos A, Randazzo S, et al.Sequencing of the smallest apicomplexan genome from the human pathogen babesia microti. Nucleic Acids Res. 2012; 40(18):9102–14.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. 25

    Romiguier J, Cameron SA, Woodard SH, Fischman BJ, Keller L, Praz CJ. Phylogenomics controlling for base compositional bias reveals a single origin of eusociality in corbiculate bees. Mol Biol Evol. 2015; 33(3):670–8.

    PubMed  Article  CAS  Google Scholar 

  26. 26

    Scornavacca C, Galtier N. Incomplete lineage sorting in mammalian phylogenomics. Syst Biol. 2017; 66(1):112–20.

    CAS  PubMed  Google Scholar 

  27. 27

    Mavrodiev EV, Martínez-Azorín M, Dranishnikov P, Crespo MB. At least 23 genera instead of one: The case of iris l. sl (iridaceae). PLoS One. 2014; 9(8):106459.

    Article  Google Scholar 

  28. 28

    Douzery EJ, Scornavacca C, Romiguier J, Belkhir K, Galtier N, Delsuc F, Ranwez V. Orthomam v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. Mol Biol Evol. 2014; 31(7):1923–8.

    CAS  PubMed  Article  Google Scholar 

  29. 29

    Bayzid MS, Hunt T, Warnow T. Disk covering methods improve phylogenomic analyses. BMC genomics. 2014; 15(6):7.

    Article  Google Scholar 

  30. 30

    Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006; 2:762–8.

    CAS  Article  Google Scholar 

  31. 31

    Degnan JH. Anomalous unrooted gene trees. Syst Biol. 2013; 62(4):574–90.

    CAS  PubMed  Article  Google Scholar 

  32. 32

    Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009; 24(6):332–340. https://doi.org/10.1016/j.tree.2009.01.009.

    PubMed  Article  Google Scholar 

  33. 33

    Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011; 62(6):833–62.

    PubMed  Article  PubMed Central  Google Scholar 

  34. 34

    Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: A heuristic approach. PLoS One. 2014; 9(8):104008.

    Article  CAS  Google Scholar 

  35. 35

    Bayzid MS, Mirarab S, Warnow T. Inferring optimal species trees under gene duplication and loss. In: Proc. of Pacific Symposium on Biocomputing (PSB). Singapore: World Scientific: 2013. p. 250–61.

    Google Scholar 

  36. 36

    Ewing G. B., Ebersberger I., Schmidt H. A., Von Haeseler A.Rooted triple consensus and anomalous gene trees. BMC Evol Biol. 2008; 8(1):118.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  37. 37

    Wu BY. Constructing the maximum consensus tree from rooted triples. J Comb Optim. 2004; 8(1):29–39.

    Article  Google Scholar 

  38. 38

    Byrka J, Guillemot S, Jansson J. New results on optimizing rooted triplets consistency. Discret Appl Math. 2010; 158(11):1136–47.

    Article  Google Scholar 

  39. 39

    Hallett MT, Lagergren J. New algorithms for the duplication-loss model. In: Proc ACM Symp Comput Biol RECOMB2000. New York: ACM Press: 2000. p. 138–146.

    Google Scholar 

  40. 40

    Than CV, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comp Biol. 2009; 5(9):e1000501. https://doi.org/10.1371/journal.pcbi.1000501.

    Article  CAS  Google Scholar 

  41. 41

    Yu Y., Warnow T., Nakhleh L.Algorithms for mdc-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles. Journal of Computational Biology. 2011; 18(11):1543–1559.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42

    Bayzid MS, Warnow T. Gene tree parsimony for incomplete gene trees: addressing true biological loss. Algoritm Mol Biol. 2018; 13:1.

    Article  Google Scholar 

  43. 43

    Zhang C, Rabiee M, Sayyari E, Mirarab S. Astral-iii: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinforma. 2018; 19(6):153.

    Article  Google Scholar 

  44. 44

    Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(1-2):131–47.

    Article  Google Scholar 

  45. 45

    Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science. 2014; 346(6215):1250463.

    PubMed  Article  CAS  Google Scholar 

  46. 46

    Chung Y, Ané C. Comparing two Bayesian methods for gene tree/species tree reconstruction: A simulation with incomplete lineage sorting and horizontal gene transfer. Syst Biol. 2011; 60(3):261–75.

    CAS  PubMed  Article  Google Scholar 

  47. 47

    Mirarab S, Warnow T. Astral-ii: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015; 31(12):44–52.

    Article  CAS  Google Scholar 

  48. 48

    Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Nat Acad Sci. 2012; 109(37):14942–7.

    CAS  PubMed  Article  Google Scholar 

  49. 49

    Bayzid MS, Warnow T. Estimating optimal species trees from incomplete gene trees under deep coalescence. J Comput Biol. 2012; 19(6):591–605.

    CAS  PubMed  Article  Google Scholar 

  50. 50

    Chiari Y, Cahais V, Galtier N, Delsuc F. Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (archosauria). Bmc Biol. 2012; 10(1):65.

    PubMed  PubMed Central  Article  Google Scholar 

  51. 51

    Zardoya R, Meyer A. Complete mitochondrial genome suggests diapsid affinities of turtles. Proc Nat Acad Sci. 1998; 95(24):14226–31.

    CAS  PubMed  Article  Google Scholar 

  52. 52

    Hugall AF, Foster R, Lee MS. Calibration choice, rate smoothing, and the pattern of tetrapod diversification according to the long nuclear gene rag-1. Syst Biol. 2007; 56(4):543–63.

    CAS  PubMed  Article  Google Scholar 

  53. 53

    Iwabe N, Hara Y, Kumazawa Y, Shibamoto K, Saito Y, Miyata T, Katoh K. Sister group relationship of turtles to the bird-crocodilian clade revealed by nuclear dna–coded proteins. Mol Biol Evol. 2004; 22(4):810–3.

    PubMed  Article  CAS  Google Scholar 

  54. 54

    Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013; 23(2):323–30.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  55. 55

    Kumar V, Hallström BM, Janke A. Coalescent-based genome analyses resolve the early branches of the euarchontoglires. PLoS One. 2013; 8(4):60019.

    Article  CAS  Google Scholar 

  56. 56

    Mirarab S, Bayzid MS, Warnow T. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst Biol. 2014; 65(3):366–80.

    PubMed  Article  Google Scholar 

  57. 57

    Seo T-K. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol Biol Evol. 2008; 25(5):960–71.

    CAS  PubMed  Article  Google Scholar 

  58. 58

    Ranwez V, Delsuc F, Ranwez S, Belkhir K, Tilak M-K, Douzery EJ. Orthomam: a database of orthologous genomic markers for placental mammal phylogenetics. BMC Evol Biol. 2007; 7(1):241.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  59. 59

    Wickett NJ, Mirarab S, Nguyen N, Warnow T, Carpenter E, Matasci N, Ayyampalayam S, Barker MS, Burleigh JG, Gitzendanner MA, et al. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Nat Acad Sci. 2014; 111(45):4859–68.

    Article  CAS  Google Scholar 

  60. 60

    Gatesy J, Springer MS. Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol Phylogenet Evol. 2014; 80:231–66.

    PubMed  Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

The authors received no financial support for this research.

Author information

Affiliations

Authors

Contributions

MSB and RR conceived and designed the study; MI, KS and TD implemented STELAR; KS, TD, MI, MSB and RR proved the theoretical results; MI, KS and TD performed the experiments; MSB, KS, TD and MI wrote the first draft and all the authors took part in finalizing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Md. Shamsuzzoha Bayzid.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Islam, M., Sarker, K., Das, T. et al. STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics 21, 136 (2020). https://doi.org/10.1186/s12864-020-6519-y

Download citation

Keywords

  • Phylogenomics
  • Multi-species coalescent process
  • Gene tree incongruence
  • Incomplete lineage sorting