- Research
- Open Access
- Published:

# Identifying gene clusters by discovering common intervals in indeterminate strings

*BMC Genomics*
**volume 15**, Article number: S2 (2014)

## Abstract

### Background

Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of co-localized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction.

### Results

In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene family-free model against a gene family-based model on a dataset of 93 bacterial genomes.

### Conclusions

Our model is able to detect gene clusters that would be also detected with well-established gene family-based approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene family-based methods due to wrong or deficient gene family assignments.

## Background

Gene clusters are sets of functionally associated genes in prokaryotic and fungal genomes that are located close to each other over a longer period of evolutionary time, despite the genome undergoing significant rearrangements. Consequently, gene clusters may be found in several related species by means of comparative gene order analysis. Over the past years several such methods have been proposed and subsequently improved in their sensitivity. Initial gene cluster models considered only completely conserved genomic segments that retain gene order and orientation [1, 2]. Later models still required gene clusters to be contiguous and complete but dropped the requirement for co-linearity [3–5]. The most powerful class of approaches can handle imperfect conservation of gene content by allowing to some extent either inserted [6–8] or both inserted and deleted genes [9–11].

All above methods require prior knowledge of homology relations between genes, using either a one-to-one mapping between the gene sets of different genomes [3, 6, 5], or a general partitioning into gene families [4, 7–11]. In the latter, a genome is modeled as a set of sequences over the alphabet of gene families, where each sequence corresponds to a particular chromosome of the organism.

Most commonly, gene families are predicted computationally on the basis of sequence similarity. Various databases exist that offer information of precomputed gene families [12–14]. Furthermore, several software tools are freely available that allow for direct computation of gene family assignments in a dataset of interest [15–17]. Typically, these approaches assume that gene families naturally cluster into densely connected subgraphs in the gene similarity network. However, multi-domain proteins can have strong ties not only to their own family but also to other families they share a domain with. Some of these proteins may not be at all traceable back to a single gene family. While some recent approaches can deal with the ambiguities caused by multi-domain proteins [18, 19], it is still a major challenge to define cut-offs in the clustering process that at the same time eliminate spurious edges and keep gene families at a reasonable granularity[20, 21].

In this paper, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure between the genes of different genomes. We do this by introducing the concept of *common intervals* to *indeterminate strings*, which are a class of strings that can have more than one character at every position. We then extend this concept to allow for a limited number of insertions and deletions. Furthermore, we present algorithms that solve related discovery problems of finding all *weak common intervals* and *approximate weak common intervals* in indeterminate strings. Finally, we propose a new method for gene family-free discovery of gene clusters based on (approximate) weak common intervals in indeterminate strings.

## Methods

### Definitions

Indeterminate strings, also known as *degenerate strings* are formally defined as [22]:

**Definition 1 (indeterminate string)** *For a given finite alphabet* \sum, *let* \mathcal{P}\left(\sum \right)*be the power set of* \sum*. An* indeterminate string *is a sequence of* character sets*, which are elements of* \mathcal{P}\left(\sum \right)\backslash \left(\varnothing \right).

In other words, for an indeterminate string *S* with *n* index positions must hold that for every *i*, 1\le i\le n, S\left[i\right]\subseteq \sum and S\left[i\right]\ne \varnothing, where S\left[i\right] denotes the character set associated with the *i*-th position in *S*. In the special case where every position of indeterminate string *S* holds a singleton set, *S* is equivalent to an ordinary string. We denote the *length* of an indeterminate string *S* with *n* index positions by \left|S\right|\equiv n and its *cardinality*, i.e. the number of *all* elements in *S*, by \parallel S\parallel \equiv {\sum}_{i=1}^{n}\left|S\left[i\right]\right|. Two positions *a* and *b*, 1\le a\le b\le \left|S\right|, induce the indeterminate *substring* S\left[a,b\right]\equiv S\left[a\right]\phantom{\rule{2.77695pt}{0ex}}S\left[a+1\right]\dots S\left[b\right]. To distinguish intervals in different indeterminate strings, we indicate the affiliation of an interval \left[i,j\right] to indeterminate string *S* by the subscript notation {\left[i,j\right]}_{S}.

**Example 1** S=\left\{a,d,g\right\}\left\{c\right\}\left\{a,d\right\}\left\{e,f\right\}\left\{b\right\}\left\{c,g\right\}*is an indeterminate string of length* \left|S\right|=6*and cardinality* \left|\right|S\left|\right|=11*over alphabet* \sum =\left\{a,b,c,d,e,f,g\right\}. *The third element of S is given by character set* S\left[3\right]=\left\{a,b\right\}. *Interval* \left[2,4\right]*induces the substring* S\left[2,4\right]=\left\{c\right\}\left\{a,d\right\}\left\{e,f\right\}.

In this work, we generalize the concept of common intervals, which were initially introduced on permutations [23] and subsequently extended to strings [24]. The idea behind common intervals is to compare strings, or rather substrings, based on their character sets. The character set of an ordinary string *S* is defined as \mathcal{C}(S)\equiv \left\{S\left[i\right]|1\le i\le |S|\right\}. The equivalent concept on indeterminate strings is the following:

**Definition 2 (character set)** *The* character set of an indeterminate string Sis defined by \mathcal{C}\left(S\right)\equiv {\bigcup}_{i=1}^{n}S\left[i\right].

In two ordinary strings *S* and *T* over a finite alphabet Σ, two intervals, [*i, j*] in *S* and [*k, l*] in *T*, are called *common intervals* if \mathcal{C}\left(S\left[i,j\right]\right)=\mathcal{C}\left(T\left[k,l\right]\right)). The analogon for indeterminate strings is:

**Definition 3 (strict common intervals)** *Given two indeterminate strings* S*and* T, *two intervals*, \left[i,j\right]*in* S*and* \left[k,l\right]*in* T, *are said to be* strict common intervals *if and only if their character sets* \mathcal{C}\left(S\left[i,j\right]\right)*and* \mathcal{C}\left(T\left[k,l\right]\right)*are equal*.

A weaker definition based on the intersection relation between character sets is:

**Definition 4 (weak common intervals)** *Given two indeterminate strings* S*and* T, *two intervals*, \left[i,j\right]*in* S*and* \left[k,l\right]*in* T, *are* weak common intervals *with* common character set C=\mathcal{C}\left(S\left[i,j\right]\right)\cap \mathcal{C}\left(T\left[k,l\right]\right)*if for each* x, i\le x\le j, *it holds that* C\cap S\left[x\right]\ne \varnothing*and for each* y, k\le y\le l, *it holds that* C\cap T\left[y\right]\ne \varnothing.

In all our use cases, in particular when dealing with conflicting gene family assignments as well as gene family-free gene cluster detection, the concept of weak common intervals appears to be more appropriate. Thus, in the following, we restrict ourselves to the study of weak common intervals.

Furthermore, continuing a previous line of research initially proposed by Schmidt and Stoye in [4], we further extend weak common intervals by allowing a limited number of insertions and deletions:

**Definition 5 (approximate weak common intervals)** *Given two indeterminate strings* S*and* T*and a threshold* \delta \in {\mathbb{N}}_{0}, *two intervals*, \left[i,j\right]*in* S*and* \left[k,l\right]*in* T, *are* approximate weak common intervals *with* common character set C=\mathcal{C}\left(S\left[i,j\right]\right)\cap \mathcal{C}\left(T\left[k,l\right]\right)*if the number of* positions with no intersection with C is limited by δ, i.e. \left|\left\{x|i\le x\le j:S\left[x\right]\cap C=\varnothing \right\}\right|+\left|\left\{y|k\le y\le l:T\left[y\right]\cap C=\varnothing \right\}\right|\le \delta. These positions are called indels.

Generally, algorithms for discovering common intervals of ordinary strings only report pairs of intervals that both are *maximal*, whereby *maximality* is defined as follows: An interval \left[i,j\right] in string *X* is called *maximal* if its immediate left and right neighboring characters, X\left[i-1\right] and X\left[j+1\right] (if such exist), are not contained in \mathcal{C}\left(X\left[i,j\right]\right). In other words, interval \left[i,j\right] cannot be extended to its left or right without expanding the character set of the interval.

In terms of weak common intervals, we introduce the following property derived from [11]:

Definition 6 (C-closed) *Given an indeterminate string* S, *an interval* \left[i,j\right], *and a character set* C\subseteq \sum, *interval* \left[i,j\right]*is* C-closed *if* S\left[i\right], S\left[i\right]\subseteq C, *and if* i=1*or* S\left[i-1\right]\cap C=\varnothing, *and if* j=n*or* S\left[j+1\right]\cap C=\varnothing.

A reasonable balance between omitting expedient and including apparently redundant weak common intervals is found by the subset of those that are *mutually-closed*, as defined as follows:

**Definition 7 (mutually-closed)** *Given a pair of intervals* \left([i,j{]}_{S},[k,l{]}_{T}\right)*of indeterminate strings* S and T, \left([i,j{]}_{S},[k,l{]}_{T}\right)*and* [i,j{]}_{S}*are* mutually-closed *if* \left([i,j{]}_{S},[k,l{]}_{T}\right)*is* \mathcal{C}\left(T\left[k,l\right]\right)-*closed and* [i,j{]}_{S}\mathcal{C}\left(S\left[i,j\right]\right)-*closed*.

We consequently restrict the enumeration of weak common intervals and approximate weak common intervals to those that are mutually-closed.

*Combinatorial complexity*. The maximal number of mutually-closed weak common intervals of two indeterminate strings *S* and *T* of length *n* and *m*, respectively, is bounded by *nm*. This result follows from the fact that the number of intervals [*k, l*] in *T* that are mutually-closed weak common intervals with any interval with fixed left bound *i* in *S* is bounded by *m*. Likewise, the maximal number of mutually-closed approximate weak common intervals of indeterminate strings *S* and *T* is bounded by (*δ* + 1)^{2}*nm*.

*Gene family-free analysis*. In absence of gene family assignments, each gene in the dataset is represented by a unique *character*, linearly ordered along a *chromosomal string*. Therefore, the *n* characters of a chromosomal string can be identified by their integer *index set* {1, 2*, . . . , n*}. Relating characters of one chromosomal string to characters of another, we presume that we are given a symmetric *similarity measure σ*_{
AB
} : *A* × *B →* ℝ_{≥0} for any two index sets *A* and *B*.

In gene family-free gene cluster analysis we aim at finding pairs of intervals in two chromosomal strings, whose characters are similar. We can reduce this problem to finding (approximate) weak common intervals between two indeterminate strings. To this end, we construct an *index mapping B*_{
A
}:

Thus, *B*_{
A
} is an indeterminate string over alphabet \left\{1,2,\dots ,\left|A\right|,\infty \right\}. Let {I}_{A}=\left\{1\right\}\left\{2\right\}\cdots \left\{\left|A\right|\right\} represent the indeterminate string of *A*, a position in *I*_{
A
} shares a character with a position in *B*_{
A
} if and only if the similarity of the two corresponding characters is non-zero. Thus, finding intervals in chromosomal strings *A* and *B*, whose characters are similar, is equivalent to finding (approximate) weak common intervals of indeterminate strings *I*_{
A
} and *B*_{
A
}. Note that the set of (approximate) weak common intervals of *I*_{
A
} and *B*_{
A
} is identical to the one of *I*_{
B
} and *A*_{
B
}. The (approximate) weak common intervals differ in size and, most substantially, in the similarities between characters within the interval pairs. Therefore, we introduce a simple scoring scheme by which the solution space can be arranged into a landscape of highs and lows of (approximate) weak common intervals, ranked by taking into account the number and the similarities of the contained characters. We define a score function *µ*_{xy} over an index *x* in index set *X* and an interval [*a, b*]_{
Y
} in index set *Y* as

so that *µ*_{xy} takes values between 0 and 1, being 1 if the gene with highest similarity to *x* lies within interval [*a, b*]_{
Y
}. The overall score of two interval pairs ([*i, j*]_{
A
}, [*k, l*]_{
B
}) is then

We now describe three algorithms to compute all mutually-closed weak common intervals and all mutually-closed approximate weak common intervals with at most *δ* indels in two indeterminate strings. Note that mutually-closed weak common intervals are a special subclass of mutually-closed approximate weak common intervals for *δ* = 0.

In the following, we consider two indeterminate strings *S* of length *n* and *T* of length *m*.

### Discovering weak common intervals

We now describe the algorithm *W**eak* *C**ommon* *I**ntervals on* *I**ndeterminate* *S**trings* (WCII) as presented in Figure 1. It solves the following problem:

**Problem 1** *Given two indeterminate strings* S*and* T, *discover all mutually-closed weak common intervals of* S*and* T.

To tackle this problem we make use of the following constructs:

**Definition 8 (index string)** *Given an indeterminate string* S*of length* n, {I}_{S}\equiv \left\{1\right\}\left\{2\right\}\cdots \left\{n\right\}*denotes the* index string *of* S.

**Definition 9 (index mapping)** *Given two indeterminate strings* S*and* T*of lengths* n*and* m*respectively, the* index mapping *of* S*onto* T*is given by* {\left({T}_{S}\left[y\right]\right)}_{y=1,\dots ,m}, *where*

Note that index strings and index mappings are again indeterminate strings. The key idea of WCII arises from the following observation:

**Observation 1** *Given two indeterminate strings* S*and* T*with index string* {I}_{S}*and index mapping* {T}_{S}, *two intervals* \left[i,j\right]*in* S*and* \left[k,l\right]*in* T*are weak common intervals if and only if* {\left[i,j\right]}_{{I}_{S}}*and* {\left[k,l\right]}_{{T}_{S}}*are weak common intervals*.

This equivalence holds because any two positions, *x* in *S* and *y* in *T* intersect if and only if *I*_{
S
}[*x*] and *T*_{
S
}[*y*] intersect. Since it holds that *I*_{
S
}[*x*] = {*x*} for all *x* = 1*, . . . , n*, we simplify the notation of single character set *I*_{
S
}[*x*] to just *x* and character set \mathcal{C}\left({I}_{S}\left[i,j\right]\right) to just \left[i,j\right]. Note that character c\in \mathcal{C}\left({I}_{S}\left[i,j\right]\right) serves subsequently both as character *c* ϵ [*i, j*] as well as index in *I*_{
S
}.

WCII is an adaptation of Didier's Algorithm [24] of enumerating maximal common intervals in ordinary strings. Didier's strategy can be described as follows: The algorithm iterates over all positions *i* in *S* as possible left interval bounds. In each iteration all mutually-closed weak common interval pairs are reported that share the same left bound *i* in *I*_{
S
}. For each possible right bound *j* ≥ *i*, the algorithm iterates through the set of positions in *T*_{
S
} that contain *j* in their character set. To this end, we make use of an array POS, where POS[*j*], 1 ≤ *j* ≤ *n*, is a sorted list of positions in *T*_{
S
} containing character *j*. Each position *y* ϵ POS[*j*] is associated with an interval {\left[k,l\right]}_{{T}_{S}}, *k* ≤ *y* ≤ *l*, called the *min-rank interval* of character *j* for position *y*. It is the largest interval around *y* for which every position in {\left[k,l\right]}_{{T}_{S}} contains at least one character in [*i, j*]. Obviously, {\left[k,l\right]}_{{T}_{S}} is [*i, j*]-closed. It remains to be checked if {\left[i,j\right]}_{{I}_{S}} is closed w.r.t. \mathcal{C}\left({T}_{S}\left[k,l\right]\right) and that every position in {\left[i,j\right]}_{{I}_{S}} and {\left[k,l\right]}_{{T}_{S}} contains a character from C=\mathcal{C}\left({I}_{S}\left[i,j\right]\right)\cap \mathcal{C}\left({T}_{S}\left[k,l\right]\right). To show the latter, it is sufficient to show that \left[i,j\right]\subseteq \mathcal{C}\left({T}_{S}\left[k,l\right]\right), because the character set of each position in *I*_{
S
} corresponds to the single element set of its index. The details of both tests are explained below, after relevant data structures are introduced. If both conditions are satisfied, a mutually-closed weak common interval pair is found and subsequently reported.

Like in Didier's Algorithm, we employ two tricks that improve the performance: precomputing *min-rank intervals* and following paths of *rank-nearest successors*.

*Precomputing min-rank intervals*. In order to identify min-rank intervals, it is sufficient to observe the smallest character *c* ≥ *i* in each position. To this end, we make use of the following construct:

**Definition 10 (** *i***-reduced string)** *Given index mapping* {T}_{S}, {\left({T}_{S}^{i}\left[y\right]\right)}_{y\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}1,...,m}*is the* i-reduced string of {T}_{S} of the *i*th iteration, *where* {T}_{S}^{i}\left[y\right]=\mathsf{\text{min}}\left(\left\{c\phantom{\rule{2.77695pt}{0ex}}|\phantom{\rule{2.77695pt}{0ex}}c\in {T}_{S}\left[y\right]\cup \left\{\infty \right\}:c\ge i\right\}\right).

Min-rank intervals in {T}_{S}^{i} are identical to *rank intervals* as initially defined by Dider *et al*. [24]. Interestingly, rank intervals in {T}_{S}^{i} correspond directly to min-rank intervals in *T*_{
S
}:

**Lemma 1** *The set of min-rank intervals in* {T}_{S}*is identical to the set of rank intervals in* {T}_{S}^{i}.

*Proof:* Didier *et al*.[24] show that rank intervals in a string are nested and that their number is bounded by the length of the string.

Observe that for any position *y* in {T}_{S}^{i} the rank interval of character j={T}_{S}^{i}\left[y\right] is identical to the min-rank interval of *j* at position *y* in *T*_{
S
}. Let *y* be a position in *T*_{
S
} and *j* ϵ *T*_{
S
}[*y*] such that j>{T}_{S}^{i}\left[y\right]. Further, let {\left[k,l\right]}_{{T}_{S}} be the min-rank interval of *j* at *T*_{
S
}[*y*], j\prime =\mathsf{\text{max}}\left(\left\{c\phantom{\rule{2.77695pt}{0ex}}|\phantom{\rule{2.77695pt}{0ex}}c\in \mathcal{C}\left({T}_{S}^{i}\left[k,l\right]\right):c\le j\right\}\right), and {\left[k\prime ,l\prime \right]}_{{T}_{S}} be the min-rank interval of *j*' at its corresponding position in *T*_{
S
}. Because *j*' ≤ *j* it consequently holds that {\left[k\prime ,l\prime \right]}_{{T}_{S}}\subseteq {\left[k,l\right]}_{{T}_{S}}. Now, according to the definition of min-rank intervals, {T}_{S}^{i}\left[k\prime -1\right]>j\prime, if such position exists. Since *j*', is the largest character in {T}_{S}^{i}\left[k,\phantom{\rule{2.77695pt}{0ex}}l\right] that is smaller than or equal to *j*, it must also hold that {T}_{S}^{i}\left[k\prime -1\right]>j. The same argument holds for {T}_{S}^{i}\left[l\prime +1\right] if such position exists, therefore {\left[k,l\right]}_{{T}_{S}}={\left[k\prime ,l\prime \right]}_{{T}_{S}} is the min-rank interval of both characters *j*' and *j*. We conclude that all min-rank intervals for any character in *T*_{
S
} at iteration *i* are contained in the set of rank intervals of {T}_{S}^{i}. □

Consequently, all min-rank intervals in *T*_{
S
} in the *i*th iteration (i.e. for a fixed left bound *i* in *I*_{
S
} ) can be precomputed in \mathcal{O}\left(m\right) time using the algorithm given by Didier *et al*. [24]. They are stored in table INT. For a currently processed character *j* at position *y* in *T*_{
S
}, INT[*y*] contains its corresponding min-rank interval. Unlike Didier's Algorithm, INT must be updated after each iteration such that all positions in INT accessed in the following (*j* + 1)th iteration contain the corresponding min-rank intervals of character *j* + 1. Details of the update step can be found in Additional file 1 Section 1.1.

*Following paths of rank-nearest successors*. The second trick in the algorithm consists in increasing the right bound *j* in *I*_{
S
} while walking through positions and characters of *T*_{
S
}. Thereby the algorithm jumps from a current position *y* that contains character *j* to its *rank-nearest successor*, which is the position *y*' containing character *j* + 1 with the smallest *min-rank distance* to *y* as defined as follows:

**Definition 11 (min-rank distance)** *The* min-rank distance of any two positions k and l in indeterminate string {T}_{S} for the ith iteration is given by:

If several co-optimal positions are available, the tie is broken by choosing the leftmost one as rank-nearest successor. In case no position with character *j* + 1 exists, or the smallest min-rank distance is '*∞*', *j* has no successor. For the *i*th iteration, all rank-nearest successors are precomputed and stored in table SUCC which is explained in more detail in Additional file 1 Section 1.2.

Connecting characters larger than or equal to *i* at their corresponding positions in *T*_{
S
} with their rank-nearest successors through directed edges results in a forest of rooted trees. Nodes (across all trees) sharing the same character are said to reside on the same *level*. In lines 8-28 of Figure 1, the algorithm traverses along paths through this forest in a bottom-up procedure, from one level to the next, starting at those leaves with character *i*. Besides the currently visited nodes of the level, the algorithm keeps track of the *path bounds*, which are the outermost positions in *T*_{
S
} a path has visited thus far. The currently visited nodes of the paths and their corresponding path bounds are stored in a list labeled LIST. Only after all nodes of the same level *j* are processed, the algorithm follows all current paths to nodes of the next level *j* + 1, thereby ensuring that each character in *T*_{
S
} is processed at most once. To this end, for all positions containing character *j* that have the same rank-nearest successor *y*', the algorithm discontinues the paths of all but the leftmost one with shortest min-rank distance to *y*' (line 19). Traversing along paths of rank-nearest successors in WCII differs from Didier's Algorithm by the fact that a position in *T*_{
S
} may be visited by the same path several times on different levels.

For any given min-rank interval {\left[k,l\right]}_{{T}_{S}} there cannot be more than one weak common interval partner in *I*_{
S
} starting at position *i*. Therefore it is sufficient to track at least one path in each min-rank interval to find all mutually-maximal intervals of *I*_{
S
} and *T*_{
S
}. Positions in POS are sorted, thus paths leading to the same weak common interval pair appear adjacent to each other in LIST and the common interval pair is reported only for the first (lines 15-17).

For each node in LIST, associated with character *j* and position *y*, the algorithm checks if the min-rank interval {\left[k,l\right]}_{{T}_{S}} of *j* encloses the path bounds up to position *y* (see condition in line 15). If validated, a weak common interval pair has been found, given by {([i,j]}_{{I}_{S}},\phantom{\rule{2.77695pt}{0ex}}{\left[k,l\right]}_{{T}_{S}}). To ensure mutual closedness, the interval pair is only reported if *i −* 1 is not contained in the character set \mathcal{C}\left({T}_{S}|k,l|\right) and the successor of *y* is not within the current bounds of its path (see conditions in lines 13 and 15). Checking for the former can be achieved in \mathcal{O}\left(1\right) time after \mathcal{O}\left(m\right) time preprocessing by performing a range minimum query on an array of size \mathcal{O}\left(m\right) where each position containing character *i −* 1 is assigned 0 and 1 otherwise.

The overall complexity of the algorithm can be summarized as follows: Each position in *I*_{
S
} is regarded exactly once as left bound *i* for all weak common intervals that are reported in one iteration. Once {T}_{S}^{i} is computed for *i* = 1 it can be up-dated using array POS, taking overall \mathcal{O}\left(\left|\right|{T}_{S}\left|\right|\right) time for all left bounds *i* = 1*, . . . , n*. Further, for each left bound the algorithm performs \mathcal{O}\left(m\right) steps to precompute all min-rank intervals and \mathcal{O}\left(\left|\right|{T}_{S}\left|\right|\right) steps to precompute all rank-nearest successors. The subsequent bottom-up procedure and the reporting of weak common intervals requires again \mathcal{O}\left(\left|\right|{T}_{S}\left|\right|\right) time. Therefore we have:

**Theorem 1** *Given two indeterminate strings* S*and* T, *Algorithm WCII finds all pairs of mutually-closed weak common intervals of* S*and* T*in* \mathcal{O}\left(n\left|\right|{T}_{S}\left|\right|\right) time.

### Discovering approximate weak common intervals

We now present the algorithm *A**pproximate* *W**eak* *C**ommon* *I**ntervals on* *I**ndeterminate* *S**trings* (AWCII) as presented in Figure 2, thus line numbers mentioned in this subsection refer to Figure 2. AWCII solves the following problem:

**Problem 2** *Given two indeterminate strings* S*and* T*and indel threshold* \delta \in {\mathbb{N}}_{0}, *discover all mutually-closed approximate weak common intervals of* S*and* T*with no more than* \delta indels.

Following a strategy similar to WCII, AWCII solves Problem 2 for index mappings *I*_{
S
} and *T*_{
S
}, instead of *S* and *T*. As before, in each iteration the algorithm maintains a fixed left bound *i* in *I*_{
S
}. For each character *j* ϵ [*i, n*] all positions *y* in *T*_{
S
} are processed that contain character *j* (lines 5-25). Thereby character *j* at position *y* in *T*_{
S
} can be associated with several different intervals around *y* that are candidates of mutually-closed approximate weak common interval partners for interval {\left[i,j\right]}_{{I}_{S}}. Only intervals surrounding one (or several) positions *y* can be mutually-closed. However, for an interval {\left[k,l\right]}_{{T}_{S}} containing indels, it no longer holds that the minrank distance of any two positions within the interval is always smaller than the min-rank distance from any position inside to any position outside the interval. As a result, neither precomputed min-rank intervals nor following paths of ranknearest successors can be used for improving the algorithm's performance. Instead we pursue a different approach, thereby making AWCII an adaptation of the RGC algorithm of Jahn [11].

For each *d*_{
k
} = 1*,..., δ* (lines 7-23) AWCII identifies the leftmost position *k* in *T*_{
S
} such that at most *d*_{
k
} indels are contained in interval {\left[k,y\right]}_{{T}_{S}} and *T*_{
S
}[*k*] ∩ [*i, j*] ≠ ∅. Let {d\prime}_{k}\le {d}_{k} be the observed number of indels in {\left[k,l\right]}_{{T}_{S}} (see line 9), the algorithm then finds for each *d*_{
l
} = 1*,..., δ − d*'_{
k
} (lines 14-21) the rightmost position *l* such that again *T*_{
S
}[*l*] ∩ [*i, j*] ≠ ∅ and the number of indels in {\left[y,l\right]}_{{T}_{S}} does not exceed *d*_{
l
}. Each (*k, l*) of the at most (*δ* + 1)^{2} combinations of leftmost and rightmost positions gives rise to a candidate pair of mutually-closed approximate weak common intervals {(\left[i,j\right]}_{{I}_{S}},\phantom{\rule{2.77695pt}{0ex}}{\left[k,l\right]}_{{T}_{S}}). For each candidate pair it is checked that the number of characters in [*i, j*] not contained in \mathcal{C}\left({T}_{S}\left[k,l\right]\right) plus the already consumed number of indels in {\left[k,l\right]}_{{T}_{S}} does not exceed *δ*. Finally, it is tested if {\left[i,j\right]}_{{I}_{S}} is \mathcal{C}\left({T}_{S}\left[k,l\right]\right)-closed. If both conditions are satisfied, a mutually-closed approximate weak common interval pair is found and is subsequently reported (line 18).

Runtime improvements are achieved by precomputing right and left bounds of candidate intervals {\left[k,l\right]}_{{T}_{S}} for each character *j* in *T*_{
S
}. These bounds are computed within \mathcal{O}\left(\left(\delta +1\right)\left|\right|{T}_{S}\left|\right|\right) time for a fixed left bound *i* in *I*_{
S
} and stored in tables L and R respectively. Further, for each such candidate interval {\left[k,l\right]}_{{T}_{S}} the number of characters that are within [*i, j*] can also be precomputed. This number is used to determine *δ*_{
S
} in line 16. The construction of the corresponding table, called RANGECONTENT, is achieved within \mathcal{O}\left({\left(\delta +1\right)}^{2}n\left|\right|{T}_{S}\left|\right|\right) time for a fixed left bound *i*. The details of constructing tables L, R, and RANGECONTENT can be found in Additional file 1 Section 2. Note that RANGECONTENT differs significantly from the data structure NUM used in RGC to count characters in intervals.

In terms of overall runtime, for each fixed bound *i* in *I*_{
S
} the algorithm spends \mathcal{O}\left({\left(\delta +1\right)}^{2}n\left|\right|{T}_{S}\left|\right|\right) time on computation of the above mentioned auxiliary tables. Thereafter, AWCII requires \mathcal{O}\left({\left(\delta +1\right)}^{2}\left|\right|{T}_{S}\left|\right|\right) time to iterate through all combinations of candidate intervals in L and R and to identify approximate weak common intervals.

Testing for \mathcal{C}\left({T}_{S}\left[k,l\right]\right)-closedness of interval {\left[i,j\right]}_{{I}_{S}} can be achieved in \mathcal{O}\left(1\right) time by precomputing a table for all candidate intervals in *T*_{
S
} of the *i*th iteration, where each entry indicates if a character *i −* 1 or *j* + 1 is contained in the corresponding candidate interval. Such a table can be constructed within \mathcal{O}\left(\left(\delta +1\right)\cdot \left|\right|{T}_{S}\left|\right|\right) time using again a simple sweep algorithm. We conclude with the following theorem:

**Theorem 2** *Given two indeterminate strings* S*and* T*and indel threshold* \delta \in {\mathbb{N}}_{0}, *algorithm AWCII computes all pairs of mutually-closed approximate weak common intervals of* S*and* T*in* \mathcal{O}\left({\left(\delta +1\right)}^{2}\cdot {n}^{2}\left|\right|{T}_{S}\left|\right|\right)*time*.

### A runtime heuristic for discovering approximate weak common intervals

Our third algorithm, ACSI (see Figure 3) represents a runtime heuristic that solves Problem 2 exactly and in practice outperforms both WCII and AWCII in gene family-free analysis by orders of magnitude.

Just as the two algorithms before, ACSI operates on index string *I*_{
S
} and index mapping *T*_{
S
} instead of indeterminate strings *S* and *T* directly. For every fixed interval [*i, j*] in *I*_{
S
}, ACSI identifies mutually-closed approximate weak common interval partners [*k, l*] in *T*_{
S
}. To this end, it iterates through elements of POS[*i*], i.e. positions in *T*_{
S
} that contain character *i* (lines 3-7 of Figure 3). For each such position *y* ϵ POS[*i*] the algorithm calls a recursive procedure, denoted EXTEND (line 5). This recursive procedure requires 5 parameters, corresponding to fixed bounds {\left[i,j\right]}_{{I}_{S}}, the currently processed interval [*k, l*] in *T*_{
S
}, and the current number of allowed indels, *d*. In the initial call, interval {\left[k,l\right]}_{{T}_{S}} is set to {\left[y,y\right]}_{{T}_{S}} and *d* = *δ*. EXTEND then increases interval {\left[k,l\right]}_{{T}_{S}} to both sides until [*i, j*] ∩ *T*_{
S
}[*k −* 1] = ∅ and [*i, j*] ∩ *T*_{
S
}[*l* + 1] = ∅ (line 10). If in this process the algorithm observes characters *i −* 1 or *j* + 1 in \mathcal{C}\left({T}_{S}\left[k,l\right]\right), EXTEND returns to the previous call (lines 11-13). Otherwise, it verifies if {(\left[i,j\right]}_{{I}_{S}},\phantom{\rule{2.77695pt}{0ex}}{\left[k,l\right]}_{{T}_{S}}) is a mutually-closed approximate weak common interval pair by testing if the number of characters in [*i, j*] that are missing in \mathcal{C}\left({T}_{S}\left[k,l\right]\right) is less than or equal to the current *d* and if i\in \mathcal{C}\left({T}_{S}\left[k,l\right]\right) (line 14). The interval pair is reported if both conditions are validated. EXTEND then increases {\left[k,l\right]}_{{T}_{S}} to the left, thereby consuming indel positions as long as their overall number remains less than or equal to the current *d* (line 17). If a position *k*' *< k −* 1 has been found such that [*i, j*] ∩ *T*[*k*'] ≠ ∅, EXTEND is called recursively with parameter values {\left[i,j\right]}_{{I}_{S}}, {\left[k\prime ,l\right]}_{{T}_{S}}, and the remaining number of allowed indels, given by *d* + *k*' + 1 *− k* (lines 18-20). This step is also symmetrically executed for the right side of {\left[k,l\right]}_{{T}_{S}} (lines 21-24).

The actual heuristic speed-up of the algorithm is achieved by calling procedure EXTEND in line 5 not for all intervals [*i, j*] in *I*_{
S
} but only for those that have chances of success for being a weak common intervals pair with an interval [*k, l*] around a position *y* ϵ POS[*i*]. Thus, the neighborhood around position *y* is scanned for suitable values of *j* prior to the execution of EXTEND. The details are described in Additional file 1 Section 3.

## Results and discussion

In the following, we highlight the benefit of our dynamic model in comparison with present approaches. Although conflicting gene family assignments are extremely common in practice, this scenario is difficult to evaluate. Assuming the existence of an ultimately true gene family assignment, conflicts arise by incorrect gene family assignments. Therefore an evaluation would inevitably result in benchmarking gene family prediction tools, rather than scrutinizing our model.

Instead, we decided to evaluate our gene family-free model against the traditional gene family-based approach. To this end, we chose a genomic dataset of bacterial genomes that has been used in a prior gene cluster study [8] and was originally obtained from [25]. The dataset features 133 chromosomal sequences, of which we removed all sequences originating from plasmids.

In practice ACSI outperforms both WCII and AWCII as shown by Figure 4. Thus, in all subsequent results, we used ACSI to compute mutually-closed (approximate) weak common intervals.

*Gene family-based dataset*. Genes in this dataset are annotated with COG (Clusters of Orthologous Groups) identifiers [12] which are used to establish homology relationships between genes. The set of genes in the dataset was revised by the latest available gene information under the accession numbers of the respective genomes at NCBI. To this end, genes that are meanwhile marked as pseudo genes were removed from the dataset. No genes were added, since COG annotations of new genes are not available. We further omitted all genomes from subsequent analyses of which more than 10 pseudo genes were removed in this process. 93 genomes remained, comprising on average 2726 genes (minimum/ maximum number of genes: 784/8317).

*Gene family-free dataset*. Pairwise similarities between genes in the dataset were obtained using the relative reciprocal BLAST score (RRBS) [26]. Genes were compared on the basis of their encoding protein sequence using BLASTP+ [27] with an e-value threshold of 0.1 and disabled composition-based score adjustments.

For evaluation purposes, we produced different degrees of pruned gene similarity sets by filtering spurious gene similarities. For this, we employed an undirected variant of the stringency criterion (parameterized by *f* ϵ [0, 1]) for gene similarities proposed in [28], which is described in more detail in Additional file 1 Section 4.1.

To evaluate the gene family-free model, we ran an implementation of ACSI for *δ* = 0 on the unpruned gene similarity graph of our dataset and compared the 4015841 interval pairs with respect to the contained COG identifiers. We discarded all pairs for which at least one interval contained less than two genes with a COG identifier. In the remaining 1194036 interval pairs, we observed that the similarity in the set of COG identifiers depends strongly on the intervals' score (Table 1). Among the clusters with a score greater or equal 10, 95% have the same set of identifiers in both intervals. While this number decreases for smaller scores, still a quarter of the interval pairs with a score lower than 1 do not differ in their COG identifiers. This shows that our approach is able to detect gene clusters that would also be detected with well-established gene family based approaches.

This is not a surprise, as weak common intervals are in fact a generalization of the classic common intervals model: A run of ACSI on a dataset where similarity scores are only set between members with the same COG identifiers finds the exact same set of clusters as the common intervals based approach.

To evaluate the predictive power of our approach, we compare the output of our program to gene clusters predicted by the *reference gene cluster algorithm* (RGC) [11]. While this algorithm is capable of multiple genome comparison and the detection of faint conservation patterns, we use it in this context for pairwise genome comparison to detect interval pairs (*I*_{1},*I*_{2}) whose gene sets have a symmetric set distance of at most 2. It has been previously observed that the generalization to approximate conservation underlying the reference gene cluster approach is not only a way to find imperfectly conserved clusters, but also a means to add robustness against errors in gene family assignment. For example, an interval pair may appear to have a set distance of two because besides the shared genes, there is one gene unique to *I*_{1} and one gene unique to *I*_{2}. However a closer inspection of the genes reveals that these genes are in fact homologs that were not recognized in the preceding partitioning of genes into homology families. We ran RGC on all pairs of the 93 genomes setting parameters *δ* = 2 (maximal tolerated symmetric set distance) and *s* = 3 (the minimum cluster size). The program returned among others 192900 "single-mismatch clusters", i.e. clusters that have exactly one extra gene in each interval. In 47453 (24.60%) of the single-mismatch clusters, we observe a similarity score between the two extra genes in our BLAST dataset. ACSI found 89.84% of the single-mismatch clusters and for 75.24% the extra genes turned out to be pairwise best hits. Moreover we observe that in 18143 among the single-mismatch gene clusters predicted by RGC the two extra genes have exactly the same annotation string. (Annotations containing the word "hypothetical" were ignored.) ACSI finds 90.19% of these clusters. Surprisingly, 4.59% of the single-mismatch clusters in which the two extra genes had best hits to each other were not found by ACSI. This is because for one or more of the other genes in the cluster our BLAST results did not return any similarity score to a gene in the other interval. Apparently the elements of a cluster of orthologous groups can be very faintly related in terms of sequence similarity.

*Comparison with RegulonDB data*. Among other information about transcriptional regulation, RegulonDB [29] provides a list of operon locations in *Escherichia coli* K12. While the majority of operons in RegulonDB are computationally predicted, some are also experimentally confirmed. From 2649 operons reported in RegulonDB, 846 span two or more genes. We mapped these operons to the annotation of the *E. coli* K12 genome in our data set. However, 104 operons contain genes that are not annotated in our dataset and thus were omitted from subsequent analysis. The remaining 742 operons span between 2 and 16 genes, 71.83% of which span 2 or 3 genes. The number of detected gene clusters depends strongly on the degree of evolutionary relatedness between the *E. coli* K12 genome and other genomes in the dataset. While ACSI and RGC predicted many occurrences in other close related *γ*-proteobacteria in our dataset, for the majority of genomes only few occurrences of operons were reported. Additional file 1 Section 4.2, gives an overview of the number of found gene clusters in the dataset. The sets of reported operons found by ACSI and RGC are not entirely overlapping. Instead, ACSI finds operons which RGC does not find and vice versa. A complete overview of unique findings for algorithms and parameter settings is shown in Table 2.

## Conclusions

In this work we introduced a new model to detect gene clusters based on the study of (approximate) weak common intervals in indeterminate strings. In context of gene family-free analysis, we presented a scoring scheme for (approximate) weak common intervals which rates both interval size and the degree of similarity between the contained genes of an (approximate) weak common interval pair. We use our gene family-free model to predict gene clusters between pairs of genomes. This approach is evaluated in comparison with the common intervals-based reference gene cluster model.

In addition to the use case of detecting gene clusters, our algorithms can also be helpful to identify synteneous blocks in a gene family-free analysis. The hierarchical nature of common intervals is maintained in our weak common intervals model, which makes it ideal for studying potential synteneous blocks of arbitrary resolution. The basic concept of common intervals in strings has seen many generalizations in the past years which have greatly increased its utility for biological studies, in particular the simultaneous consideration of more than two strings, requiring common intervals to occur in all or at least a certain number of them. This generalization of (approximate) weak common intervals in indeterminate strings is undoubtedly an interesting direction for future work.

## References

Tamames J, et al: Evolution of gene order conservation in prokaryotes. Genome Biol. 2001, 2 (6): 1-0020.

Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 708-713. 10.1038/42711.

Heber S, Stoye J: Algorithms for finding gene clusters. Proceedings of WABI 2001 LNCS. 2001, 2149: 252-263.

Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two and more sequences. Proc of CPM 2004 LNCS. 2004, 3109: 347-358.

Heber S, Mayr R, Stoye J: Common intervals of multiple permutations. Algorithmica. 2011, 60 (2): 175-206. 10.1007/s00453-009-9332-1.

Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. Proceedings of WABI 2002 LNCS. 2002, 2452: 464-476.

He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. J Comp Biol. 2005, 12 (6): 638-656. 10.1089/cmb.2005.12.638.

Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics. 2009, 25 (5): 571-10.1093/bioinformatics/btp027.

Rahmann S, Klau GW: Integer linear programs for discovering approximate gene clusters. Proceedings of WABI 2006 LNBI. 2006, 4175: 298-309.

Böcker S, Jahn K, Mixtacki J, Stoye J: Computation of median gene clusters. J Comput Biol. 2009, 16 (8): 1085-1099. 10.1089/cmb.2009.0098.

Jahn K: Efficient computation of approximate gene clusters based on reference occurrences. J Comput Biol. 2011, 18 (9): 1255-1274. 10.1089/cmb.2011.0132.

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.

Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2012, 40 (Database): 284-9.

Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV: OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res. 2011, 39 (Database): 283-8. 10.1093/nar/gkq930.

Shi G, Peng MC, Jiang T: MultiMSOAR 2.0: an accurate tool to identify ortholog groups among multiple genomes. PLoS one. 2011, 6 (6): 20892-10.1371/journal.pone.0020892.

Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13 (9): 2178-2189. 10.1101/gr.1224503.

Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, 38 (Database): 196-203. 10.1093/nar/gkp931.

Song N, Sedgewick RD, Durand D: Domain architecture comparison for multidomain homology identification. J Comput Biol. 2007, 14 (4): 496-516. 10.1089/cmb.2007.A009.

Joseph JM, Durand D: Family classification without domain chaining. Bioinformatics. 2009, 25 (12): 45-53. 10.1093/bioinformatics/btp207.

Frech C, Chen N: Genome-wide comparative gene family classification. PLoS one. 2010, 5 (10): 13409-10.1371/journal.pone.0013409.

Liu J, Rost B: Domains, motifs and clusters in the protein universe. Current Opinion in Chemical Biology. 2003, 7 (1): 5-11. 10.1016/S1367-5931(02)00003-0.

Holub J, Smyth WF: Algorithms on indeterminate strings. Proc of AWOCA 2003. 2003, 36-45.

Uno T, Yagiura M: Fast algorithms to enumerate all common intervals of two permutations. Algorithmica. 2000, 26 (2): 290-309. 10.1007/s004539910014.

Didier G, Schmidt T, Stoye J, Tsur D: Character sets of strings. J Discr Alg. 2007, 5 (2): 330-340. 10.1016/j.jda.2006.03.021.

Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science. 2006, 311 (5765): 1283-1287. 10.1126/science.1123061.

Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008, 9 (Suppl 5): 4-10.1186/1471-2105-9-S5-S4.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.

Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ: Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics. 2011, 12: 124-10.1186/1471-2105-12-124.

Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, Del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res. 2013, 41 (Database): 203-13.

## Acknowledgements

DD receives a scholarship from the CLIB Graduate Cluster Industrial Biotechnology. KJ is funded by DFG grant ST 431/5-1.

**Declarations**

We acknowledge support for the Article Processing Charge by the German Research Foundation and the Open Access Publication Fund of Bielefeld University Library.

This article has been published as part of *BMC Genomics* Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

All authors were involved in the early conception of the project. DD, KJ and JS developed the methods and designed the analysis. DD and KJ performed the evaluation and wrote the manuscript; all authors discussed the results, commented on the manuscript, and read and approved its final version.

## Electronic supplementary material

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Doerr, D., Stoye, J., Böcker, S. *et al.* Identifying gene clusters by discovering common intervals in indeterminate strings.
*BMC Genomics* **15, **S2 (2014). https://doi.org/10.1186/1471-2164-15-S6-S2

Published:

DOI: https://doi.org/10.1186/1471-2164-15-S6-S2

### Keywords

- common intervals
- indeterminate strings
- gene cluster detection