Species tree estimation methods
Maximum Quartet Support Species Tree Problem
ASTRAL, ASTRAL-2, and wQMC all address the same optimization problem, which we now explain. Given an input set of gene trees on a species set S and a quartet tree q on four species from S, we let denote the number of gene trees in that induce the quartet tree q. Then, the quartet support of T given G, denoted , is , where Q(T ) denotes the set of all quartet trees in T . Hence, we can define the Maximum Quartet Support Species Tree Problem (MQSST), as follows.
-
Input: a set of gene trees on a species set S.
-
Output: a tree T on the species set S maximizing , the quartet support of T given .
MQSST is NP -hard when the input set of gene trees induce only one tree for each set of four taxa in S [52], and is of unknown computational complexity when all the gene trees are complete (i.e., have all the species in S).
Weighted Quartets MaxCut
The quartet amalgamation method wQMC [40] is a greedy heuristic for a weighted version of the MQSST problem, in which the input can have weights on each quartet tree. The wQMC heuristic uses a greedy strategy to find good solutions to its optimization problems, but is not guaranteed to solve its optimization problem (weighted MQSST) exactly. To use wQMC as a summary method, we define the weight of a quartet tree q to be the quartet support of q in the input set of gene trees .
We wrote scripts (available in our supporting online material) that use a previously published code [53] to compute the weights of each quartet tree. After we calculate these weights (saving them in a file called <quartetscores>), we run wQMC version 3.1 using the following command:
./max-cut-tree qrtt=<quartetscores> weights=on otre=<speciestree>
ASTRAL and ASTRAL-2
ASTRAL [18] and its improved version, ASTRAL-2 [19], also attempt to solve the MQSST problem. Both have exact versions that provably solve the MQSST problem but run in exponential time, and faster versions that constrain the search space (using the input set of gene trees), and then provably solve the constrained problem exactly. ASTRAL and ASTRAL-2 differ in how they constrain the search space (ASTRAL-2 searches a larger part of tree space than ASTRAL) and how they are implemented (ASTRAL-2 is faster). Here we focus on ASTRAL-2, since it is faster and more accurate than ASTRAL.
Given the input set of gene trees, ASTRAL-2 defines a set X of bipartitions on the taxon set S; when all the gene trees are complete (i.e., have no missing taxa), then X will contain all the bipartitions from the input gene trees as well as potentially other bipartitions. ASTRAL-2 runs in O(nk|X|2) time, where n is the number of species and k is the number of genes, and thus can be fast whenever |X| is not too large. While |X| is not theoretically bounded by a polynomial in n and k, for many datasets |X| is not very large, so that ASTRAL-2 is able to complete analyses within 24 hours on 1000 species and 1000 genes [19].
ASTRAL-2 finds a globally optimal solution to the constrained optimization problem where we restrict the output species tree to draw its bipartitions from X. ASTRAL and ASTRAL-2, run in their default versions (which use the constrained search), are both statistically consistent under the multispecies coalescent model when all the gene trees are complete (i.e., this restriction to the set X of bipartitions does not change their statistical guarantees) [19].
We now provide a proof for Theorem 3, establishing that ASTRAL and ASTRAL-2, run in default mode, are statistically consistent under the MSC model and also under the bounded HGT models.
Proof for Theorem 3. As proved in [18, 19], ASTRAL and ASTRAL-2 are guaranteed to find globally optimal solutions to the constrained MQSST problem. The default settings for the constraint set X of bipartitions allowed in the output species tree always includes all bipartitions from the input gene trees; hence, as the number of genes increases, with probability converging to 1, every bipartition from the species tree will be in the set X. Therefore, with probability converging to 1, the true species tree will be a feasible solution (i.e., within the constrained search space) as the number of loci and number of sites per locus both increase (as established in [18, 19]). Recall that the quartet support score of a tree T is the total, over all quartet trees in T, of the number of gene trees that contain that quartet tree. As shown in [36], under the bounded HGT models in [36], the most probable quartet tree on any four taxon set A is topologically identical to the quartet tree on X induced by the true species tree. Hence, with probability converging to 1, under these bounded HGT models, the most frequent quartet tree on any set A of four leaves will be the true species tree on A. Given any set of gene trees in which for all four-leaf sets A the most frequent quartet tree on A is the true species tree on A, the quartet support score of the true species tree T* will be the maximum possible quartet support score (since any other species tree T cannot have larger quartet support for any quartet tree). Furthermore, given any set of gene trees in which the most frequent quartet tree is unique for all four taxa and equal to the species tree on the four taxa, the true species tree T* will have the unique maximum quartet support score. Hence, as the number of loci and number of sites per locus both increase, the tree returned by an exact solution to the constrained MQSST problem, using default settings for X, will converge in probability to the true species tree T*. Therefore, ASTRAL and ASTRAL-2 are statistically consistent under the bounded HGT models of [36].
We ran ASTRAL-2 version 4.7.6 on the simulated data using the following command:
java -jar astral.4.7.6.jar -i <genetrees> -o <speciestree>
where <genetrees> is a file containing the gene trees in newick format, and
<speciestree> is the output.
For the biological data, we used ASTRAL-2 with multi-locus bootstrapping (MLBS), using the following commands:
java -jar astral.4.7.6.jar -i < bootstrap replicates >
-o <species replicate>
where <bootstrap replicates> is the collection of 1128 gene trees generated by taking the nth line of the gene tree file n = {1, ... , 100}, and <species replicate> is the nth bootstrap replicate species tree T
n
. To calculate the final species tree T with bootstrap support values, we computed the majority consensus tree using Dendropy version 3.12.2 [54].
NJst
NJst is a summary method that has two steps. In the first step, it computes a distance matrix on the species set, where D[x, y] is the average leaf-to-leaf topological distance between x and y among all the gene trees. In the second step, it runs neighbor joining [55], a popular distance-based phylogeny estimation method. NJst is statistically consistent under the MSC model because the distance matrix it computes converges in probability to an additive matrix defining the true species tree, and neighbor joining will return the true species tree once the computed distance matrix is sufficiently close to the additive matrix for the species tree; see [17] for this proof.
To run NJst, we used phybase version 1.4 [56] and custom scripts, available in our supplementary material.
Gene tree estimation
To compute gene trees, we ran FastTree-2 version 2.1.4, using the following command:
fasttree -nt -gtr -quiet -nopr -gamma -n 1000 [input] > [output]
where [input] is a file that includes all the alignments of all 1000 genes and [output] will be one file with all 1000 estimated gene trees.
CA-ML
To perform the concatenated analyses under maximum likelihood, we ran FastTree-2 version 2.1.4, with the following command:
fasttree -nt -gtr -nopr [input] > [output]
Computing Error Rates
The coalescent-based methods ASTRAL-2, wQMC, and NJst used in this study all return binary species trees. We also verified that all trees returned in our CA-ML analysis were binary, and all simulated data used in this study contained only binary model species trees. The Robinson-Foulds (RF) distance [44] between two trees T1 and T2 on the same set of n taxa measures the number of bipartitions that appear in only one of T1 or T2. Therefore, if T1 and T2 are identical, the RF distance is 0, and the maximum RF distance between T1 and T2 is 2n−6. The RF distance can be converted to an error rate by dividing by 2n − 6. When comparing only binary trees, false negative rates, false positive rates, and normalized Robinson-Foulds distances are all equivalent. Therefore, we computed missing branch rates to establish error rates, but we report RF rates. Error rates were computed by finding the missing branch rate using custom scripts available in our supporting online materials.
Measuring Quartet Support Scores of ASTRAL-2 and wQMC
The command used to measure the quartet support score was
java -jar astral.4.7.6.jar -q <speciestreefile> -I <genetreesfile>
Data
HGT+ILS Simulated Data The simulated dataset was simulated using SimPhy [57] version 1.0 (downloaded January 20, 2015). There are 6 data sets containing 50 replicates apiece: each replicate has its own 51-taxon species tree. For every model species tree, one taxon is an outgroup, and so is actually a 50-taxon rooted species tree. These model trees were simulated under a Yule process, with birth rates set to 0.000001 (per generation) and the maximum tree length set to 2 million generations.
Then, on each species tree, 1000 locus trees are simulated, where each can differ from the species tree due to HGT events, and we used HGT rates (1)-(6) given by 0, 2 × 10−9, 5 × 10−9, 2 × 10−8, 2 × 10−7, and 5 × 10−7. These values correspond to expected numbers of HGT events per gene of 0, 0.08, 0.2, 0.8, 8, and 20. Thus, HGT rate (1) is no HGT events, HGT rate (2) is 0.08 HGT events per gene, up to HGT rate (6) of 20 HGT events per gene. Note that in our simulations, for each HGT event, the probability of a branch being chosen as the receptor of the transfer is proportional to its distance from the donor.
Once locus trees are simulated, a gene tree is simulated for each locus tree according to the MSC model, with population size parameter set to 200,000. Thus, at the end, we have 1000 true genes that differ from the species tree due to both ILS and also potentially HGT (when the HGT rate is positive).
The SimPhy command used to generate a model replicate in the data sets is simphy -rs 50 -rl U:1000,1000 -rg 1 -st U:2000000,2000000 -si U:1,1 -sl U:50,50 -sb U:0.000001,0.000001 -cp U:200000,2000000 -hs L:1.5,1 -hl L:1.2,1 -hg l:1.4,1 -cu E:10000000 -so U:1,1 -od 1 -or 0 -v 3 -cs 293745 -o model.50.2000000.0.000001.<transferrate> -lt U:<transferrate>,<transferrate> -lk 1
On each simulated true gene tree, we used INDELible [58] v. 1.03 to simulate sequence alignments according to the GTR+Gamma model, with model parameters estimated from three different real datasets (these parameters are identical to those used in [19]). This simulation produces GTR parameters that vary from one gene to another, where the parameters are drawn for each gene from a distribution at random. See [19] for details about the simulation process. The alignment length is set to 1000 bp for all genes. After simulating gene alignments, we used FastTree-2 [41] to estimate gene trees under the GTR model. Thus for each replicate, we have both true and estimated gene trees.
For HGT rate (1) (where all the discordance is due to ILS), the average RF [44] distance between true gene trees and the species tree is 30.4%. Therefore, the amount of ILS in these data sets is moderately high.
Cyanobacterial Data The cyanobacterial data set has 1128 genes on 11 taxa, and was first analyzed in [42], which suggested that the 11 genome sequences may have acquired between 9.5% and 16.6% of their genes through HGT. We obtained 100 bootstrap replicate gene trees for each of the 1128 genes from the first author of [43], and computed an ASTRAL-2 tree on these data using multi-locus bootstrapping.