- Open Access
Correction to: The performance of coalescent-based species tree estimation methods under models of missing data
BMC Genomics volume 21, Article number: 133 (2020)
Correction to: BMC Genomics
After publication of , the authors were informed by John A. Rhodes of a counterexample to Theorem 11 of . The counterexample and its consequences with respect to the theoretical properties of NJst  and ASTRID  are provided in  and summarized here. The authors of  apologize for the mistake in the proof.
The question of interest in  is whether several species tree estimation methods that operate by combining gene trees (e.g., ASTRAL , ASTRID , and NJst ) remain statistically consistent when data are missing due to random taxon deletion, under the assumption that the gene trees are generated by the multi-species coalescent (MSC) model  and so can differ from the true species tree due to incomplete lineage sorting (ILS). Theorem 11 addresses this issue for NJst and ASTRID with the Miid model of taxon deletion, which assumes that taxa are deleted independently and identically from the gene trees. NJst and ASTRID estimate the species tree in two steps. In the first step, each calculates the internode distance matrix (of average pairwise distances between species, computed from the gene trees), and in the second step each computes a tree from the distance matrix using either neighbor joining  or balanced minimum evolution (BME) with FastME , respectively.
Furthermore, neighbor joining and FastME are both guaranteed to return a tree T when given a matrix that is sufficiently close to an additive matrix for T (where a matrix A is additive for T if the edges of T can be assigned non-negative lengths so that for all i, j, Aij is the sum of the edge lengths in the path from i to j in T) . While it is established that the internode distance matrix converges to an additive matrix for the species tree if there is no taxon deletion , it was not known if it converged to an additive distance matrix in the presence of taxon deletion. In the attempted proof of Theorem 11, Nute et al. argued that the internode distance matrix computed for gene trees that evolve under the MSC and then have taxa deleted under the Miid model converges to an additive matrix for the species tree.
Were their argument correct, then both NJst and ASTRID would be statistically consistent under the combination of the MSC and Miid models, which is what Theorem 11 of  claims. However, Rhodes et al.  presented an example of a model species tree and taxon deletion probability so that the internode distance matrix does not converge, as the number of genes increases, to a matrix that is additive for the model species tree topology. Furthermore, they prove that as the number of gene trees increases, NJst and ASTRID will converge to a tree other than the true species tree. Therefore, neither NJst nor ASTRID are statistically consistent under the combination of MSC and Miid taxon deletion, and in fact are positively misleading. Here we describe the counterexample from  and sketch the proof that shows that Theorem 11 is incorrect; the details of the proof that ASTRID and NJst are not statistically consistent under the MSC + Miid model are available in .
Consider the balanced ultrametric species tree on six taxa a, b, c, d, e, f
σ = ((a: L + 1, (b: 1, e: 1): L): E, (c: L + 1, (d: 1, f: 1): L): E),
where E and L are measured in coalescent units. Rhodes et al.  showed that when L = ∞, E = 0, and p ∈ (0, 1) (where p gives the probability of taxon presence under Miid), the expected internode distance matrix under the combined MSC + Miid model is additive for a tree with a topology different from σ; in particular, it will display quartet tree (ac, bd) (which is the tree with the leaves for a, c separated from the leaves for b, d by one or more edges) whereas σ displays (ab, cd).
Therefore, by continuity of the expected distances, when E > 0 is sufficiently small and L is finite but sufficiently large, the expected distance matrix will be sufficiently close to the additive matrix inducing quartet tree (ac, bd) that both neighbor joining and BME within FastME will return a tree that displays (ac, bd).
In summary,  provides a construction of binary model species trees with finite edge lengths (in coalescent units) on which the expected internode distance matrix will be close to an additive matrix for a tree other than the model species tree, and NJst and ASTRID will converge to a tree other than the model species tree, thus establishing that Theorem 11 in  is incorrect. We note that  did not provide counterexamples for any theorem regarding statistical consistency for ASTRAL under models of missing data, so the counterexample in  is applicable to only NJst and ASTRID.
Nute MG, Molloy EK, Chou J, Warnow T. The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics. 2018;19(Suppl 5):286. https://doi.org/10.1186/s12864-018-4619-8 Special issue for selected papers from RECOMB-CG.
Liang L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011;60(5):661–7.
Vachaspati P, Warnow T. ASTRID: accurate species TRees from internode distances. BMC Genomics. 2015;16(10):S3.
Rhodes JA, Nute MG, Warnow T. NJst and ASTRID are not statistically consistent under a random model of missing data. arXiv 2001.07844; 2020.
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014;30(17):i541–8.
Kingman JFC. The coalescent. Stoch Process Appl. 1982;13(3):235–48.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4(4):406–25.
Lefort V, Desper R, Gascuel O. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol. 2015;32(10):2798–800.
Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press; 2017.
Allman ES, Degnan JH, Rhodes JA. Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(1):337–42.
About this article
Cite this article
Nute, M., Chou, J., Molloy, E.K. et al. Correction to: The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics 21, 133 (2020). https://doi.org/10.1186/s12864-020-6540-1