The one-to-one mapping of a DNA molecule to a sequence of letters suggests that sequence comparison is a prerequisite to virtually all comparative genomic analyses. Due to this, sequence comparison has been used to identify regions of similarity which may be a byproduct of evolutionary, structural, or functional relationships between the sequences under study [1]. Sequence comparison is also useful in fields outside of biology, for example, in pattern recognition [2] or music analysis [3]. Several techniques exist for sequence comparison; alignment techniques consist of either global alignment [4, 5] or local alignment [6] techniques. Alignment-free techniques also exist; they are based on measures referring to the composition of sequences in terms of their constituent patterns [7]. Pairwise sequence alignment algorithms analyse a pair of sequences, commonly carried out using dynamic-programming techniques [5]; whereas multiple sequence alignment (MSA) involves the simultaneous comparison of three or more sequences (see [8] for a comprehensive review).
Analysing multiple sequences simultaneously is fundamental in biological research and MSA has been found to be a popular method for this task. One main application of MSA is to find conserved patterns within protein sequences [9] and also to infer homology between specific groups of sequences [10]. MSA may also be used in phylogenetic tree reconstruction [11] as well as in protein structure prediction [12].
Using a generalisation of the dynamic-programming technique for pairwise sequence alignments works efficiently for MSA for only up to a few short sequences. Specifically, MSA with the sum-of-pairs score (SP-score) criterion is known to be NP-hard [13]; and, therefore, heuristic techniques are commonly used [14–16], which may not always lead to optimal alignments. As a result, suboptimal alignments may lead to unreliable tree estimation during phylogenetic inference. To this end, several methods aimed to have shown that removing unreliable sites (columns) of an alignment may lead to better results [17].
Several discussions of existing filtering methods provide evidence that the removal of blocks in alignments of sufficient length leads to better phylogenetic trees. These filtering methods take a variety of mathematical and heuristic approaches. Most of the methods are fully automated and they remove entire columns of the alignment. A few of these programs, found in [18, 19], are based on site-wise summary statistics. Several filtering programs, found in [20–24], are based on mathematical models. However, experimental results found in [17] oppose these findings, suggesting that generally, not only do the current alignment filtering methods not lead to better trees, but there also exist many cases where filtering worsened the trees significantly.
Circular molecular structures are present, in abundance, in all domains of life: bacteria, archaea, and eukaryotes; and in viruses. They can be composed of both amino and nucleic acids. Exhaustive reviews can be found in [25] (proteins) and [26] (DNA). The most common examples of such structures in eukaryotes are mitochondrial DNA (mtDNA). mtDNA is generally conserved from parent to offspring and replication of mtDNA occurs frequently in animal cells [27]. This is key in phylogenetic analysis and the study of evolutionary relationships among species [11]. Several other example applications exist including MSA of viroid or viral genomes [28] and MSA of naturally-occurring circular proteins [29].
A fundamental assumption of all widely-used MSA techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts (left-most) or ends (right-most) can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. In these cases, existing MSA programs, such as Clustal
Ω [30], MUSCLE [31], or T-Coffee [16], may produce an MSA with a higher average pairwise distance than the expected one for closely-related sequences. A rather surprising such instance is the published human (NC_001807) and chimpanzee (NC_001643) mtDNA sequences, which do not start in the same genetic region [32]. It may be more relevant to align mtDNA based on gene order [33], however, the tool we present in this paper may be used to align sequences of a broader type. Hence, for a set of input sequences, a solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; the sequences output would in turn produce an MSA with a lower average pairwise distance.
Due to the abundance of circular molecular structures in nature as well as the potential presence of inconsistencies in sequence databases, it becomes evident that multiple circular sequence alignment (MCSA) techniques for analysing such sequences are desirable. Since MCSA is a generalisation of MSA it is easily understood that MCSA with the SP-score criterion is also NP-hard. To this end, a few programs exist which aim to improve MCSA for a set of input sequences. These programs can be used to first obtain the best-aligned rotations, and then realign these rotations by using conventional alignment programs, such as Clustal
Ω, MUSCLE, or T-Coffee. Note that unlike other filtering programs, these programs do not remove any information from the sequences or from their alignment: they merely refine the sequences by means of rotation.
The problem of finding the optimal (linear) alignment of two circular sequences of length n and m≤n under the edit distance model can be solved in time O(n
m logm) [34]. The same problem can trivially be solved in time O(n
m
2) with substitution matrices and affine gap penalty scores [5]. To this end, alignment-free methods have been considered to speed-up the computation [35, 36]. The more general problem of searching for a circular pattern in a text under the edit distance model has also been studied extensively [37], and an average-case optimal algorithm is known [38].
Progressive multiple sequence alignments can be constructed by generalising the pairwise sequence alignment algorithms to profiles, similar to Clustal
Ω [30]. This generalisation is implemented in Cyclope [39], a program for improving multiple circular sequence alignment. The cubic runtime of the pairwise alignment stage becomes a bottleneck in practical terms. Other fast heuristic methods were also implemented in Cyclope, but they are only based on some (e.g. the first two) sequences from the input dataset.
Another approach to improve MCSA was implemented in CSA [32]; a program that is based on the generalised circular suffix tree construction [40]. The best-aligned rotations are found based on the largest chain of non-repeated blocks that belong to all sequences. Unfortunately, CSA is no longer maintained; it also has the restriction that there can be only up to 32 sequences in the input dataset, and that there must exist a block that occurs in every sequence only once.
BEAR [41] is another program aimed to improve MCSA computation in terms of the inferred maximum-likelihood-based phylogenies. The authors presented two methods; the first extends an approximate circular string matching algorithm for conducting approximate circular dictionary matching. A matrix M is outputted from this computation. For a set of d input sequences s
0,…,s
d−1, M holds values e and r between circular sequences s
i
and s
j
, where M[i,j].e holds the edit distance between the two sequences and M[i,j].r holds the rotation of sequence s
i
which will result in the best alignment of s
i
with s
j
. Agglomerative hierarchical clustering is then used on all values M[i,j].e, to find sufficiently good rotations for each sequence cluster. The second method presented is suitable for more divergent sequences. An algorithm for fixed-length approximate string matching is applied to every pair of sequences to find most similar factors of fixed length. These factors can then determine suitable rotations for all input sequences via the same method of agglomerative hierarchical clustering.
Our contributions. We design and implement MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS is based on a non-trivial coupling of a state-of-the-art pairwise circular sequence comparison algorithm [35] with the classic progressive alignment paradigm [42]. Experimental results presented here, using real and synthetic data, show that MARS improves the alignments and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Specifically, to support our claims, we analyse these results with respect to standard genetic measures as well as with respect to the inferred maximum-likelihood-based phylogenies. For instance, we show here that the average pairwise distance in the MSA of a dataset of widely-studied mtDNA sequences is reduced by around 5% when MARS is applied before MSA is performed.
Definitions and notation
We begin with a few definitions, following [43], to allow further understanding. We think of a string (or sequence) x of length
m as an array x[0.. m−1] where every x[i], 0≤i<m, is a letter drawn from some fixed alphabet
Σ of size |Σ|=O(1). String ε denotes the empty string which has length 0. Given string y, a string x is considered a factor of y if there exist two strings u and v, such that y=u
x
v. Consider the strings x,y,u, and v, such that y=u
x
v. We call x a prefix of y if u=ε; we call x a suffix of y if v=ε. When x is a factor of y, we say that x
occurs in y. Each occurrence of x can be denoted by a position in y. We say that x occurs at the starting position
i in y when y[ i.. i+m−1]=x; alternatively we may refer to the ending position
i+m−1 of x in y.
A circular string of length m may be informally defined as a standard linear string where the first- and last-occurring letters are wrapped around and positioned next to each other. Considering this definition, the same circular string can be seen as m different linear strings, which would all be considered equivalent. Given a string x of length m, we denote by x
i=x[i.. m−1]x[0.. i−1], 0<i<m, the ith rotation of x and x
0=x. By looking at the string x=x
0=baababac; this string has the following rotations: x
1=aababacb, x
2=ababacba, x
3=babacbaa, etc.
Given a string x of length m and a string y of length n, the edit distance [44], denoted by δ
E
(x,y), is defined as the minimum total cost of operations required to transform string x into string y. In general, the allowed edit operations are as follows:
-
Insertion: insert a letter in y, not present in x; (ε,b), b≠ε
-
Deletion: delete a letter in y, present in x; (a,ε), a≠ε
-
Substitution: replace a letter in y with a letter in x; (a,b), a≠b,and a,b≠ε.
A q-gram is defined as any string of length q over alphabet Σ. The set of all such q-grams is denoted by Σ
q. The q-gram profile of a string x of length m is the vector G
q
(x), where q>0, and G
q
(x)[v] denotes the total number of occurrences of q-gram v∈Σ
q in x.
Given strings x of length m and y of length n≥m and an integer q>0, the q-gram distance
D
q
(x,y) is defined as:
$$ \sum\limits_{v \in \Sigma^{q}} \left\vert G_{q}(x)[v] - G_{q}(y)[v] \right\vert. $$
(1)
For a given integer parameter β≥1, a generalisation of the q-gram distance can be defined by partitioning x and y in β
blocks as evenly as possible, and computing the q-gram distance between each pair of blocks, one from x and one from y. The rationale is to enforce locality in the resulting overall distance [35]. Given strings x of length m and y of length n≥m and integers β≥1 and q>0, the β
-blockwise q-gram distance
D
β,q
(x,y) is defined as:
$${} \sum_{j=0}^{\beta-1}D_{q}\left(\!x\left[\!\frac{jm}{\beta} \ldots \frac{(j+1)m}{\beta}-1\!\right], y\left[\!\frac{jn}{\beta} \ldots \frac{(j+1)n}{\beta}-1\!\right]\right). $$
(2)
We assume that the lengths m of x and n of y are both multiples of β, so that x and y are partitioned into β blocks, each of size \(\frac {m}{\beta }\) and \(\frac {n}{\beta }\), respectively.