The fastest program was lastal
The highest sensitivity options offered by diamond, very-sensitive and ultra-sensitive, were introduced with version 2.0.0 of the program, which was released while this report was under review. Thus, this might be the first article showing results using them. As mentioned in the “Results” section, the fast mode, which is the default, was the closest in runtime to lastal. The other options took increasingly longer to run. However, with bacterial proteomes the very-sensitive mode ran in a time between that taken by the fast and sensitive modes (Fig. 1, left). The ultra-sensitive mode was the slowest to run, breaking the “staircase-step” length by a large gap (Fig. 1, overall).
With mmseqs, the 5.7 option saved the least time in most cases, except in S. cerevisiae, where diamond with the ultra-sensitive setting was slower (Fig. 1, top-right).
Note that we ran mmseqs with the “easy-search” function. This function produces any output format desired without much user intervention. The easy-search function accepts the target either in plain fasta format, or as a formatted database, but the query has to remain in plain fasta format. Another way to produce pairwise alignments with mmseqs would use the “search” function instead of easy-search. The search function requires databases built for both, query and target. The results of the search function is also in database format. This database can then be used to extract results into other output formats as necessary.
Also note that the mmseqs software can also precompute indexes for its databases. We decided not to build indexes because they take very long to be built and use too much space. For example, the database for the largest bacterial proteome (12,103 annotated proteins) used 5.7M of space, which increased to 898M when building the index. Runtimes might vary if the user preferred to build a database index and use the search function instead of the easy-search one.
Finally, note that mmseqs has a “rbh” function, with a future version offering an “easy-rbh” function, which should take care of producing a table without much user intervention (Martin Steinegger, personal communication). However, we decided not to use the “rbh” function because we preferred to keep control of the parameters used to produce RBH.
The best compromise for obtaining reciprocal best hits was diamond with the very-sensitive option
As mentioned under results, both diamond and mmseqs were run with different sensitivity options, which, as expected, resulted in different proportions of RBH found (Fig. 2, Supplementary Figures S1–S3). At the lower end of genomic similarity, the differences in results among the tested sensitivity options became more apparent. The results discussed below refer to those obtained with the E. coli reference genome (Fig. 2). However, the results with other reference genomes showed similar tendencies (Supplementary Figures S1–S3).
In the case of diamond, the lower proportions found using the fast setting, the program’s default, was apparent even when comparing very similar proteomes (Fig. 2, top-left). The rest of the options behaved noticeably better, which suggests that diamond with the sensitive mode would already be a good substitute for blastp. The increase in RBH between the sensitive and the more-sensitive options was small, with a somewhat larger gap between the more-sensitive and the very-sensitive modes and, finally, another slight increase from the very-sensitive to the ultra-sensitive mode. These tendencies are more obvious at the lowest GSS, where the proportion of RBH found by diamond with the very-sensitive and ultra-sensitive modes were above 0.96 (Fig. 2, top-left).
The UpSet plots showed that the sensitive to ultra-sensitive settings had the highest RBH in common with blastp for a total of 87.6% (66.9% + 20.7%) (Fig. 2, bottom-left), with 20.7% representing the difference in results compared to the fast option. Thus, diamond with the fast option would perform very poorly compared to blastp. It also appears that the ultra-sensitive mode had very little advantage over the very-sensitive option, considering the much longer time it took to run (Fig. 1). This setting took a median of 7.4% of the blastp time to run with bacterial proteomes (Fig. 1, left) and 5.2% with eukaryotic ones (Fig. 1, right). These results are the reason why we selected this option to represent diamond in the overall software comparison.
The top options tested for mmseqs, 4 and 5.7, shared the most results with blastp (Fig. 2), with The 5.7 option producing the best results, with 9.5% more RBH shared with blastp than the other options. We thus chose the 5.7 setting, which is the program’s default, for comparisons against results obtained with the other fast programs.
At the sensitivity levels selected above, both diamond and mmseqs detected a higher proportion of RBH than lastal (Figs. 3 and 4). This was true even at the lowest GSS values, meaning that even in the worst cases, neither diamond, nor mmseqs, would miss more than 10% of the RBH produced by blastp. The diamond results were the best in this regard.
With bacterial proteomes, close to 70% of all RBH were detected by all programs (Fig. 3, bottom). The second most important intersections, for both E. coli and B. subtilis, shows that diamond and mmseqs shared the majority of RBH with blastp (see “Results” section). Unlike our previous analysis [17], which showed evidently higher percentages of RBH detected solely by blastp, the proportion of RBHs exclusive to each program were somewhat similar. These exclusive RBH seem to represent differences in sensitivity, which might correspond to a mixture of differentially detected true and false positives.
In contrast to what we observed in bacteria, both diamond and mmseqs produced a higher proportion of RBH than blastp in eukaryotes (Fig. 4, top). The proportion of RBH found by mmseqs was the highest. Since these proportions are above the RBH found by blastp, it is difficult to decide if these results are an improvement or a problem. The error rate estimates did not help deciding (see section below and Fig. 5, right).
The results of all programs shared a lower proportion of RBH in the eukaryotic results than in the prokaryotic ones, with the intersection of all programs covering close to 62% of the RBH detected (Fig. 4, bottom).
Error rate estimates were very similar among all programs
Despite genomic rearrangements and horizontal gene transfer result in divergence of gene order, a few regions are preserved even between the genomes of evolutionarily distant organisms [23–25]. Thus, despite conservation of adjacency is a very limited source for correction of misidentified orthologs, pairs of adjacently conserved genes can still be used to estimate error rates [12].
As expected, error rates increased with proteome divergence (Fig. 5). The error rates were very similar among all programs. The mmseqs results consistently showed the lowest error rate estimates. These results suggest that the quality of orthologs remains as high, if not better, when using software that produces results faster than blastp. The reason why mmseqs showed the best quality could be that this program uses a very efficient implementation of the Smith-Waterman algorithm to produce its final alignments.
The contrast between bacterial and eukaryotic RBH results might due to the complex dynamics of eukaryotic chromosomes, resulting in complex homology relationships (e.g. [26, 27]). Such complexity might result in differences in paralog/ortholog resolutions. Problems resolving ortholog/paralog relationships might also result in differences in error rates. Accordingly, the error rate estimates were higher for eukaryotes (Fig. 5, right). Besides difficulties for such a simple method as RBH for solving ortholog/paralog relationships, the simple concepts of orthology and paralogy might not be easily applicable to complex situations, where gene conversions, duplications, and loses, complicate the picture [26, 27]. Thus, though we expected to find higher error estimates in eukaryotes, these estimates might reflect both, authentic mistakes, as well as the complexity of eukaryotic genome dynamics.