Scaling statistical multiple sequence alignment to large datasets

Background Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. Results We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. Conclusions Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3101-8) contains supplementary material, which is available to authorized users.

Additional File 1 --temporaries=<temps_folder> --max-subproblem-size=100 -k --keepalignmenttemps --iter-limit=1 For PASTA+BAli-Phy, there are two commands. The first starts PASTA and does the decomposition, and then exits after outputting a list of files that need to be aligned and saving its state. The second resumes from the previously saved state under the assumption that the alignments have been completed and are in their target locations on disk and proceeds with the remainder of the algorithm. The commands are:

BAli-Phy
BAli-Phy was run on Blue Waters by setting up a job with a wall-clock time limit of 24 hours and submitting it with the following shell script, which starts a background BAli-Phy process once for every processor found by the nproc Linux command (on Blue Waters, it is 32). In the following, the variable ${id} is assumed to be a name that identifies the particular alignment being run.
for iteration in $(seq 1 $(nproc)) do bali-phy <subset-fasta> -- After the job has completed the following is run, which removes the first 10 samples from the output for each processor, then concatenates the rest into a single file. Then the posterior-decoding alignment is computed in the final line with the alignment-max command.

UPP
Other that specifying the backbone alignment and tree, UPP was run with default settings. The command is: In this command, <raw_query_seqs> is a fasta file with the unaligned sequences, excluding the sequences included in the backbone alignment. <tree> refers to a specified phylogenetic tree on the backbone alignment and <backbone_alignment is the corresponding alignment itself. Both were the default output by PASTA for that particular backbone. By default, PASTA uses FastTree-2 to find its final tree estimate using the parameters -nt -gtr -gamma -fastest, and this was the tree used in the command above.

Additional Scatter Plots for 1000-sequence Data
In this section, we present all three pairwise comparisons of all metrics on the 1000-sequence data. The three methods considered are, briefly: a. PASTA(Default) PASTA under default settings: 3 iterations, all with MAFFT (L-Ins-I) as the subset aligner and decomposition to a maximum size of 200 sequences.
For each of the next two methods, we run PASTA for 1 iteration with the initial tree taken from the output of this method, which is equivalent to running PASTA for 4 iterations with slightly different settings on the final cycle.
b. PASTA+BAli-Phy PASTA for 1 iteration, with the BAli-Phy Posterior Decoding alignment on subsets and decomposition to a maximum size of 100 sequences. Using the PASTA(default) output phylogeny as an initial tree.
c. PASTA+MAFFT-L PASTA for 1 iteration, with MAFFT L-Ins-I as the subset aligner and decomposition to a maximum size of 100 sequences. Using the PASTA(default) output phylogeny as an initial tree. Figure 1 shows full results (for all five criteria) corresponding to Figure 1 from the main paper, which is partially duplicated here. In all subfigures in this section, each point represents one replicate and a position above or below the 45-degree line is interpreted as favoring one method or the other consistently across the page. This requires inverting the axes for the bottom panels compared to the upper three, but maintains a consistent interpretation.
Note that the figures for SP-score and Modeller score are nearly identical. Delta-FN results using FastTree-2 and RAxML have some differences in terms of magnitude, but not in terms of relative performance. The remaining figures are the equivalents of Figure 1 in the discussion section of the paper, with the difference that each compares a different pair of methods. The second compares PASTA+MAFFT-L to default PASTA . The fourth iteration with MAFFT L-INS-i improves the TC score but does not improve either precision or recall. The comparison with respect to precision and recall shows that PASTA+MAFFT-L is better than default PASTA on the Indelible datasets, but about the same on the RoseDNA data, and less accurate than default PASTA on the RNAsim data.
The final figure compares PASTA+BAli-Phy to PASTA+MAFFT-L directly. Consistent with the previous figure, PASTA+BAli-Phy seems to be much better than PASTA+MAFFT-L in much the way that it was better on PASTA run in default setting.   to a single iteration of PASTA run with BAli-Phy as the subset aligner on subset-size 100, using the tree from PASTA as the starting tree. PASTA+MAFFT-L is analogous with MAFFT in place of BAli-Phy.