Identifying gene clusters by discovering common intervals in indeterminate strings

Background Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of co-localized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction. Results In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene family-free model against a gene family-based model on a dataset of 93 bacterial genomes. Conclusions Our model is able to detect gene clusters that would be also detected with well-established gene family-based approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene family-based methods due to wrong or deficient gene family assignments.


.1.1 Computing min-rank intervals
Didier et al. [2] give a linear time algorithm to identify all rank intervals in a string, that we use to precompute all (min-) rank intervals for characters in T i S . Their bounds are stored in table Int s.t. entry Int[y] stores the bounds of the min-rank interval for character T i S [y]. Array Int must be updated when proceeding from one level to the next, so that Int[y], always points to the min-rank interval of the currently processed character in T S [y].

Updating array Int
Algorithm S1 updates array Int for the next level. In the first part of the algorithm the min-rank interval of each position y containing the currently processed character j is copied to the position of its rank-nearest successor y if such exists and if Int[y ] ⊂ Int [y]. Now, in the proof of Lemma 1 we showed that for any character j at any position y in T S for which T i S [y] = j, there exists some j ≤ j at some position y such that T i S [y ] = j and the min-rank intervals of j and j are identical. Let us for now assume that j < j. Then j is visited prior to j, therefore the min-rank interval of j is passed from j to a position containing j + 1, to j + 2, etc., eventually reaching character j on position y. Note that such a path must exist if ([i, j] I S , [k, l] T S ) are weak common intervals.
The case where j = j is treated in lines 7 onwards. To this end we iterate once through all positions Pos[j + 1] of the character j + 1 that will be processed next. Whenever the algorithm finds a position y such that T i S [y] = j + 1, previous and subsequent positions in Pos[j + 1] that reside in the corresponding min-rank interval are updated to Int[y].

Computing Succ
In order to identify weak common interval pair ([i, j] I S , [k, l] T S ), the algorithm follows a path of rank-nearest successors from i to j within the bounds of min-rank interval [k, l] T S of character j at a position y, k ≤ y ≤ l. By doing so, the algorithm avoids the costly computation of determining the character set in an interval, which would be otherwise necessary in verifying that all characters in i, . . . , j are contained in a min-rank interval [k, l] T S .
We can adopt Remark 3 from Didier et al. [2]: Algorithm S1 Update algorithm for Int Input: Arrays Int, Pos, and Succ, min-rank string T i S , and current level j Output: Updated Int array for level j + 1 // push rank intervals of j to successor positions 1 end if 25: end for Observation 1 For each min-rank interval [k, l] T S holds that the min-rank distance of any two positions inside the interval is strictly smaller than any min-rank distance of a position inside [k, l] T S to a position outside the interval.
Observation 1 ensures us that a path of rank-nearest successors from i to j along positions in T S will always stay within bounds of min-rank interval [k, l] T S of j for position y as long as Let D T S be the table that stores indeterminate string T S . In each iteration, the rank-nearest successors of all characters in T S are precomputed and stored in a table Succ, which is an analogous table to D T S . That is, the rank-nearest successor of character c at position y in T S is stored at the same index position in Succ as character c in D T S . Thus, the size of Succ is O( T S ).
Since Succ is traversed parallel to D T S , the look-up of the rank-nearest successor of a character  2 Precomputed data structures in the AWCII algorithm Computing RangeContent can be achieved by sweeping again for each j ∈ [i, n] once from left to right and once from right to left through all characters in T S . At first, table RangeContent is initialized with zeros. Every entry in RangeContent serves as counter for the number of characters that are within [i, j] of a candidate interval in T S . In each sweep we keep track of the most recent position p j in which character j has been observed. We process every character c of each position in T S for which i ≤ c ≤ j. Going through all combinations (k, l) of interval bounds in L and R for character c, we increase the counter of [k, l] T S as long as p j ∈ [k, l] T S and the counter has not yet been increased for character j. The latter condition prevents that a character is counted twice, once from a sweep from left to right and then again when sweeping from right to left.

Performance improvements of ACSI
We now describe the runtime heuristic improvements that were made in ACSI (see Algorithm ACSI in main article). For a broad description of the algorithm, see Section A runtime heuristic for discovering approximate weak common intervals in the main article. Instead of iterating through all j ∈ [i, n] in lines 4-6, a heuristic improvement is achieved by choosing only those values J ⊆ [i, n] that are reachable within at most δ indels around position y in T S . To this end, the algorithm finds the leftmost position k <= y in T S such that |{p | k ≤ p ≤ y :    Given chromosomal strings A and B, the stringency filter takes into account gene similarities incident to genes x ∈ A and y ∈ B, when computing a local threshold value for gene similarity σ AB (x, y).
The maximal similarity of a gene x in A is given by σ max AB (x) = max y ∈B (σ AB (x, y )). Likewise for a gene y in B, σ max AB (y) = max x ∈A (σ AB (x , y)).
The stringency parameter f ∈ [0, 1] allows to restrict the set of gene similarities to those with a certain degree of locally minimal weight: The effects in the choice of parameter values for f captured on 3 different properties of our dataset in pairwise comparisons are shown in Table 4.1. Figure S1 gives an overview of the number of found gene clusters in the dataset, in which genomes are ordered according to the cumulative sum of operon findings by ACSI. We differentiated between findings, where the gene set of a predicted gene cluster was a superset of genes of an operon (labeled "100% overlap"), more than 50%, or at most 50% of genes overlapped. In cases where multiple gene clusters in different locations of the other genomes contained operon genes, we chose the prediction with the highest overlap. The graph visualizing the sum over all 100%, > 50%, and ≤ 50% overlaps is labeled as "cumulative". Out of the 742 operons, 633 are entirely overlapped by gene clusters reported by ACSI in all genome comparisons against E. coli K12 in the unpruned graph and 612 operons for a pruned graph with f = 0.9. In comparison, RGC with δ = 0 and s = 2 finds 504 operons, whereas 622 operons are found with parameter settings δ = 2 and s = 3. Figure S1: Number of operons in genomes reported by RGC with parameter settings δ = 0, s = 2 (top left) and δ = 2, s = 3 (bottom left) and ACSI in unpruned graphs (middle) and pruned graphs with f = 0.9 (right) for δ = 0 (top) and δ = 2 (bottom). Genomes are ordered according to the cumulative sum of reported operons.