A simple and economical method for improving whole genome alignment

Background The recent advancement of whole genome alignment software has made it possible to align two genomes very efficiently and with only a small sacrifice in sensitivity. Yet it becomes very slow if the extra sensitivity is needed. This paper proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without paying much extra running time. Results and conclusions We have applied our method to a popular whole genome alignment tool LAST, and we called the resulting tool LASTM. Experimental results showed that LASTM could find more high quality alignments with a little extra running time. For example, when comparing human and mouse genomes, to produce the similar number of alignments with similar average length and similarity, LASTM was about three times faster than LAST. We conclude that our method can be used to improve the sensitivity, and the extra time it takes is small, and thus it is worthwhile to be implemented in existing tools.


Background
Because of the recent advances in sequencing technology and genome assembly software, it is now feasible to sequence the whole genome of many species in nature. To elicit useful information from these multi-gigabase long sequences of As, Cs, Gs and Ts, one common approach is to make connections by comparing them to each other. Whole genome alignment is the computational problem of finding similar regions between two different genomes.
Current genome alignment software tools are able to align two genomes very efficiently and with only a small sacrifice in sensitivity [1][2][3][4][5][6][7][8][9]. Yet, it becomes very slow if the extra sensitivity is needed. This paper proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without paying much extra running time. . ], we insert spaces into the two subsequences to make their length equal, and put one subsequence on top of the other. We assign a score to each column of the alignment depending on whether the characters in that column are match, mismatch, or any of them is space, and the score of an alignment is the sum of the scores assigned to its columns. The similarity of A[ i..j] and B[ k.. ] is measured by the highest score given by their best alignment. Our task is to find all subsequences of A and B whose alignment scores are no less than some score threshold.

Seed-and-extend
Most existing software tools for whole genome alignment use the "seed-and-extend" heuristic [5]. To explain this heuristic, we first note that a simple approach to solve © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
the problem for the whole genomes A[ 1..n] and B [ 1..m] is to use dynamic programming to find, for each positions 1 ≤ i ≤ n and 1 ≤ j ≤ m, whether there is an alignment around A[ i] and B[ j] with score no less than the score threshold. Obviously, this approach is too slow. The idea of the "seed-and-extend" heuristic is to use an efficient method to find a set of promising position pairs such that for those not in this set, it is very likely that there are no similar regions around them. Then, it uses dynamic programming to find, for each promising position pair, whether there is an alignment around them with high enough score. Below, we briefly describe the basic steps of "seed-and-extend".
(i) Index building: We distinguish the two input sequences such that one is the reference sequence and the other is the query sequence. We first build an indexing structure such as suffix array, hash table or BWT [10] for the reference sequence, and the index will enable us to find efficiently, for every short subsequence of the query, all the matching subsequences in the reference. (ii) Seeds finding: Use the index to find all the short matching subsequence pairs of A and B. These matching pairs, which we called seeds, suggest a much smaller set of positions from which we find the promising positions. (Note that there are some other seeding methods such as spaced-seeds [8], adaptive seeds [5], and subset seeds [11], but in essence, they all use the index to find efficiently short matching subsequences with good scores.) (iii) Gapless extension: The set of seeds found in the previous step suggests a much smaller set of positions for us to find similar regions. However, they are still too large for us to handle each of them using dynamic programming. This step further reduces its size as follows: for each seed, we extend it at both ends by repeatedly adding the next characters in both sequences to the alignment. Note that we will not introduce any gap in the extension, and thus this step is very fast. During the extension process, we will remember the maximum score seen so far, and if the current score is much lower than this maximum score, it means that we have gone too far, and we stop and return the alignment with the maximum score seen so far.
To reduce the number of promising positions, we examine the set of gapless alignments returned by this step, and discard those whose score are smaller than some pre-defined threshold D. (iv) Gap extension: For each of the remaining gapless alignment, we use dynamic programming to find whether there is an alignment around its starting positions with high enough scores.

The problem of choosing the threshold D
It is obvious that the threshold D has critical effect on the efficiency and sensitivity: a low threshold allows more promising position pairs to be checked, and thus it is likely to find more similar regions, but we will also waste more time in checking more gapless alignments that do not lead to any similar regions. To demonstrate this effect, we have used the popular whole-genome alignment tool LAST [5] to align the human and mouse genome with three different thresholds, namely the default threshold D, and two smaller thresholds 0.9D and 0.85D. For example, as shown in the first sub -table of Table 1, by reducing the threshold from D to 0.85D, LAST can find 144,000 more alignments, but the time required is more than tripled.

Filtering
This paper proposes a simple method for increasing the sensitivity of whole genome alignment without paying much extra processing time. To justify our method, we have applied it to improve LAST, and our results show that it can increase the number of LAST's reported alignments to the one that LAST gets with threshold 0.85D, but using time similar to that LAST uses with threshold D.
The idea of our method is to use a lower threshold, say 0.7D and thus we will have a larger number of gapless alignments. However, we will not pass all of them to the gap extension step. Those with score no less than D will be passed as usual. But for those with score between 0.7D and D, we do not have enough confidence in them, and they need to go through another test first.
We use Fig. 1 to explain the motivation of our test. The figure shows a pair of similar regions. Note that if we remove the insertion GCCG in the middle of the input sequence A, the two subsequences will be identical and will give a gapless extension with high score. However, the insertion breaks the upper sequence and the gapless extension step would stop at the first eight characters ATGCCGTA. Thus, unless we set a low enough threshold, this gapless alignment will be discarded without further examination, which is not correct because it is easy for dynamic programming to remove the insertion GCCG to find the similar regions. However, our example has some property that other useless gapless alignments miss; there is another gapless alignment (the one with the nine characters AATGCCGTA) follows closely. Hence, two close gapless alignments suggests the possibility that they may come from the same similar regions, and they should be passed to the gap extension step for further checking.
We now explain how to make use of this idea to increase the sensitivity without much extra processing time. Let A be the query sequence and B be the reference Then, those gapless alignments with threshold no less than D, together with those still in S will be passed to the gap extension step. Note that the set of gapless alignments found by our method is not smaller than that found by LAST with threshold D, and our method will not introduce any error because all gapless alignment will be further checked by the gap extension step. As shown below, the filtering step of our method is efficient and easy to implement.
Step 1: Select the set S o of gapless alignments with score greater than or equal to D, and select the set S of gapless alignments with score between D and D.

Results and discussion
To justify our method, we have modified LAST as described in the previous section, and used it to align the human genome (≈ 3G) to the genome of mouse (≈ 2.8G), of dog (≈ 2.4G), of cat (≈ 2.5G), of cow (≈ 2.7G), and of rat (≈ 2.8G). We use the human genome as the reference, and LAST will build an index for it. These genomes were all downloaded from [12].
Our experiments are run on the Intel Core i7-3930K 3.2 GHz processor, and we use 6 cores with 12 threads. For verification of our results, our program can To evaluate the sensitivity of LAST (version 588) and our modification LASTM, we compared the number of alignments produced by each software and measured their quality based on the length and similarity. We suppose that the more alignments with similar quality produced, the higher the sensitivity. Table 1 compares LAST's performance (with different threshold on the minimum score of gapless alignments) with that of our method with threshold is set to 0.7D, and the distance d for the filtering step is set to 2000. The table shows that our method runs in time similar to that of LAST with threshold D, but the number of alignments reported is significantly larger, which is close to that reported by LAST with threshold 0.85D. Also, the table shows that the quality of the alignments returned by our method is similar to that returned by LAST; they have similar average length, and similar average similarity (i.e., the percentage of identical columns over the length of the alignments).
Furthermore, Table 2 showed the number of reported alignments that fall in some gene regions downloaded from [12] (i.e., at least 50% of each of the subsequences in the alignment overlaps with some gene regions of the corresponding input sequences). We find that our method has similar increase in output alignments when we focus on gene regions.

Conclusions
This paper proposes a method to increase the sensitivity of whole genome alignment tools that use the "seedand-extend" heuristic. Our method is simple and easy to implement, and the extra time it takes is small, and thus it is worthwhile to be implemented in existing tools.