Disk covering methods improve phylogenomic analyses

Motivation With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows. Results We present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.


Additional figures and tables
Additional figures and tables omitted from the main paper due to space constraints are presented here.  Figure S3: Impact of different starting trees on DACTAL-based boosting with MP-EST. We show the average FN rates of the best trees, with respect to the quartet support, after 5 iterations of DACTAL-based boosting using MP-EST and using the starting trees estimated by MRP and MP-EST on the simulated mammalian datasets with varying amount of ILS (200genes and 500bp). We ran MP-EST on the subsets produced by DACTAL-based decomposition with maximum subset size 15 using different starting trees. dactal,15,q) refers to the results obtained by using the MRP-estimated starting tree, while MP-EST dactal,15,q) refers to the results obtained by using the starting tree estimated by MP-EST. We also show the FN rates of concatenation and the starting trees estimated by MP-EST and MRP.  Figure S4: Impact of different starting trees on DACTAL-based boosting with MP-EST. We show the average FN rates of the best trees, with respect to the quartet support, after 5 iterations of DACTAL-based boosting using MP-EST and using the starting trees estimated by MRP and MP-EST on the simulated mammalian datasets with varying numbers of genes (500bp, moderate amount of ILS). We ran MP-EST on the subsets produced by DACTAL-based decomposition with maximum subset size 15 using different starting trees. MP-EST (MRP,dactal,15,q) refers to the results obtained by using the MRP-estimated starting tree, while MP-EST(MP-EST,dactal,15,q) refers to the results obtained by using the starting tree estimated by MP-EST. We also show the FN rates of concatenation and the starting trees estimated by MP-EST and MRP.  Figure S5: Impact of how the final tree is selected (using quartet support or pseudo-likelihood) in boosted versions of MP-EST. We show average FN rates of MP-EST (with and without boosting) on the simulated mammalian datasets with varying numbers of gene trees, using two different ways of selecting the final tree: quartet support (q) or pseudo-likelihood (l). We fixed the amount of ILS to moderate level (1X) and sequence length to 500bp, and varied the number of genes from 100 to 800. We show the results for SSG-and DACTAL-based decompositions with maximum subset size 15.  Figure S6: Impact of how the final tree is selected (using quartet support or pseudo-likelihood) in boosted versions of MP-EST. We show average FN rates of MP-EST (with and without boosting) on the simulated mammalian datasets with varying numbers of gene trees, using two different ways of selecting the final tree: quartet support (q) or pseudo-likelihood (l). We fixed the amount of ILS to moderate level (1X) and number of genes to 200, and varied the sequence lengths from 250bp to 1000bp. We show the results for SSG-and DACTAL-based decompositions with maximum subset size 15. Varying number of gene trees Varying ILS

Protocol for DACTAL-boosting
Here we describe the protocol for DACTAL-based boosting for MP-EST. Necessary scripts and softwares for this protocol are avaialable at: http://www.cs.utexas.edu/users/phylo/software/dcm-protocol/ The input to DACTAL-boosting is the set of rooted gene trees T = {t 1 , t 2 , . . . , t k } on species set S. The user must provide values for the following parameters: • I, the number of iterations (default is I = 5) • p, the padding size (default is p = 4) • ms, the maximum subset size (default is ms = 15) Step 1: Compute starting tree.
The first step requires that the starting tree be computed. The user can select any starting tree they prefer, including one that is based on a previous estimate of the species tree for the dataset. In the paper we used two different starting trees -MRP (matrix representation with parsimony) and MP-EST. p, run the prd decomp.py script by setting padding size = p/4. We have provided an example output file ("dactal subsets") of dactal decomposition in the "scripts" directory.
The output of this command (dactal subsets) x subsets of taxa (one subset in each line). You should make x files ("subset 1, subset 2, subset 3, ..., subset x") containing these x subsets, using extract subsets.pl as follows. This script also creates the species lists for each of the subsets, which will be required to run MP-EST on the restricted gene trees. The command is as follows.
perl extract subsets.pl -i dactal subsets Step 2b, part 1: Next we compute T i which is the set of gene trees T restricted to the set of leaves in subset i, for all i = 1, 2, . . . , x.
Let "inFile" is a file containing the set T of gene trees. To compute T i , use the script induced subtree from taxa.py with the following command: python induced subtree from taxa.py inFile subset i This script will create the following files: inFile.subset 1, inFile.subset 2,..., inFile.subset x Step 2b, part 2: For each i = 1, 2, . . . , x, we estimate a species tree speciestree i on subset i by running MP-EST on the set T i (inFile.subset i) of rooted gene trees.
Step 2c: Combine all the trees (speciestree 1, speciestree 2, ..., speciestree x), that are returned in Step 2b, part 2, in a single file called "all sp trees". We use Su-perFine+MRL to compute the spertree on the full set of taxa from the set of species trees on the subsets of taxa. Instructions for installing SuperFine can be found at: http://www.cs.utexas.edu/∼phylo/software/superfine/submission We use the following command: python runReup.py -r rml -i all sp trees -o new sptree Save the output of this command as new sptree. Repeat Step 2 for a given number of iterations (3 to 5 iterations should be enough). new sptree computed in iteration i is used as the guide tree (starting tree) in Step 2a for (i + 1)-th iteration.
Step 3: Selecting one tree Take the list of trees you produced in Step 2c in different iterations. Score each tree with respect to the quartet support, using the script score tree quartet support.pl as follows. (This script requires 64 bit machine.) perl score tree quartet support.pl -g inFile -s candidate species tree -o score Here, inFile contains the set T of input gene trees in Newick format (one tree in each line), candidate species tree is a species tree you want to score. The score (total number of satisfied quartets) will be saved in a file named "score".