Notation and outline
Let ∑ = {A, C, G, T} be the alphabet of nucleotides (BAYES HAMMER discards kmers with uncertain bases denoted N). A kmer is an element of ∑^{k}, i.e., a string of k nucleotides. We denote the i^{th} letter (nucleotide) of a kmer x by x[i], indexing them from zero: 0 ≤ i ≤ k  1. A subsequence of x corresponding to a set of indices I is denoted by x[I]. We use interval notation [i, j] for intervals of integers {i, i + 1,..., j} and further abbreviate x[i, j] = x [{i, i + 1,..., j}]; thus, x = x[0, k  1]. Input reads are represented as a set of strings R ⊂ Σ* along with their quality values {\left({q}_{r}\left[i\right]\right)}_{i=0}^{\leftr\right1} for each r ∈ R. We assume that q_{
r
}[i] estimates the probability that there has been an error in position i of read r. Notice that in practice, the fastq file format [11] contains characters that encode probabilities on a logarithmic scale (in particular, products of probabilities used below correspond to sums of actual quality values).
Below we give an overview of BAYES HAMMER workflow (Figure 2) and refer to subsequent sections for further details. On Step (1), kmers in the reads are counted, producing a triple statistics(x) = (count_{
x
}, quality_{
x
}, error_{
x
}) for each kmer x. Here, count_{
x
} is the number of times x appears as a substring in the reads, quality_{
x
} is its total quality expressed as a probability of sequencing error in x, and error_{
x
} is a kdimensional vector that contains products of error probabilities (sums of quality values) for individual nucleotides of x across all its occurrences in the reads. On Step (2), we find connected components of the Hamming graph constructed from this set of kmers. On Step (3), the connected components become subject to Bayesian subclustering; as a result, for each kmer we know the center of its subcluster. On Step (4), we filter subcluster centers according to their total quality and form a set of solid kmers which is then iteratively expanded on Step (5) by mapping them back to the reads. Step (6) deals with reads correction by counting the majority vote of solid kmers in each read. In the iterative version, if there has been a substantial amount of changes in the reads, we run the next iteration of error correction; otherwise, output the corrected reads. Below we describe specific algorithms employed in the BAYES HAMMER pipeline.
Algorithms
Step (1): computing kmer statistics
To collect kmer statistics, we use a straightforward hash map approach [12] that does not require storing instances of all kmers in memory (as excessive amount of RAM might be needed otherwise). For a certain positive integer N (the number of auxiliary files), we use a hash function h: ∑^{k} →ℤ_{
N
} that maps kmers over the alphabet Σ to integers from 0 to N  1.
Algorithm 1 Count kmers
for each kmer x from the reads R: do
compute h(x) and write x to File_{h(x)}.
for i ∈ [0, N  1]: do
sort File_{
i
} with respect to the lexicographic order;
reading File_{
i
} sequentially, compute statistics(s) for each kmer s from File_{
i
}.
Step (2): constructing connected components of Hamming graph
Step (2) is the essence of the HAMMER approach [8]. The Hamming distance between kmers x, y ∈ ∑^{k} is the number of nucleotides in which they differ:
\mathsf{\text{d}}\left(x,y\right)=\left\left\{i\in \left[0,k1\right]:x\left[i\right]\ne y\left[i\right]\right\}\right.
For a set of kmers X, the Hamming graph HG_{
τ
}(X) is an undirected graph with the set of vertices X and edges corresponding to pairs of kmers from X with Hamming distance at most τ, i.e., x, y ∈ X are connected by an edge in HG_{
τ
}(X) iff d(x, y) ≤ τ (Figure 3). To construct HG_{
τ
}(X) efficiently, we notice that if two kmers are at Hamming distance at most τ, and we partition the set of indices [0,k  1] into τ + 1 parts, then at least one part corresponds to the same subsequence in both kmers. Below we assume with little loss of generality that τ + 1 divides k, i.e., k = σ (τ + 1) for some integer σ.
For a subset of indices I ⊆ [0, k  1], we define a partial lexicographic ordering ≺_{
I
} as follows: x ≺_{
I
} y iff x[I] ≺ y[I], where ≺ is the lexicographic ordering on Σ*. Similarly, we define a partial equality =_{
I
} such that x =_{
I
} y iff x[I] = y[I]. We partition the set of indices [0, k  1] into τ + 1 parts of size σ and for each part I, sort a separate copy of X with respect to ≺_{
I
}. As noticed above, for every two kmers x, y ∈ X with d(x, y) ≤ τ, there exists a part I such that x =_{
I
} y. It therefore suffices to separately consider blocks of equivalent kmers with respect to =_{
I
} for each part I. If a block is small (i.e., of size smaller than a certain threshold), we go over the pairs of kmers in this block to find those with Hamming distance at most τ. If a block is large, we recursively apply to it the same procedure with a different partition of the indices. In practice, we use two different partitions of [0, k  1]: the first corresponds to contigious subsets of indices (recall that \sigma =\frac{k}{\tau +1}):
Algorithm 2 Hamming graph processing
procedure HGPROCESS(X, max_quadratic)
Init components with singletons \mathcal{X}=\left\{\left\{x\right\}:x\in X\right\}.
for all ϒ ∈ FindBlocks\left(X,{\left\{{I}_{s}^{\mathsf{\text{cnt}}}\right\}}_{s=0}^{\tau}\right)do
if ϒ > max_quadratic then
for all Z ∈ FindBlocks \left(\Upsilon ,{\left\{{I}_{s}^{\mathsf{\text{str}}}\right\}}_{s=0}^{\tau}\right)do
ProcessExhaustively\left(Z,\mathcal{X}\right)
else
ProcessExhaustively\left(\Upsilon ,\mathcal{X}\right).
function FindBlocks \left(X,{\left\{{I}_{s}\right\}}_{s=0}^{\tau}\right)
for s = 0,...,τ do
sort a copy of X with respect to {\prec}_{{I}_{s}}, getting X_{
s
}.
for s = 0,...,τ do
output the set of equiv. blocks \left\{\Upsilon \right\}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{w}}.\mathsf{\text{r}}.\mathsf{\text{t}}.{=}_{{I}_{s}}.
procedure PROCESS EXHAUSTIVELY\left(\Upsilon ,\mathcal{X}\right)
for each pair x, y ∈ ϒ do
if d(x, y) ≤ τ then join their sets in \mathcal{X}:
for all x\in {Z}_{x}\in \mathcal{X},y\in {Z}_{y}\in \mathcal{X} do
\mathcal{X}:=\mathcal{X}\cup \left\{{Z}_{x}\cup {Z}_{y}\right\}\backslash \left\{{Z}_{x},{Z}_{y}\right\}.
{I}_{s}^{\mathsf{\text{cnt}}}=\left\{s\sigma ,s\sigma +1,\dots ,s\sigma +\sigma 1\right\},\phantom{\rule{1em}{0ex}}s=0,\dots ,\tau ,
while the second corresponds to strided subsets of indices:
{I}_{s}^{\mathsf{\text{str}}}=\left\{s,s+\tau +1,s+2\left(\tau +1\right),\dots ,s+\left(\sigma 1\right)\left(\tau +1\right)\right\},\phantom{\rule{1em}{0ex}}s=0,\dots ,\tau .
BAYES HAMMER uses a twostep procedure, first splitting with respect to {\left\{{I}_{s}^{\mathsf{\text{cnt}}}\right\}}_{s=0}^{\tau} (Figure 4) and then, if an equivalence block is large, with respect to {\left\{{I}_{s}^{\mathsf{\text{str}}}\right\}}_{s=0}^{\tau}. On the block processing step, we use the disjoint set data structure [12] to maintain the set of connected components. Step (2) is summarized in Algorithm 2.
Step (3): Bayesian subclustering
In HAMMER's generative model [8], it is assumed that errors in each position of a kmer are independent and occur with the same probability ε, which is a fixed global parameter (HAMMER used ε = 0.01). Thus, the likelihood that a kmer x was generated from a kmer y under HAMMER's model equals
{L}_{{\mathsf{\text{H}}}_{\mathsf{\text{AMMER}}}}\left(xy\right)={\left(1\epsilon \right)}^{kd\left(x,y\right)}{\epsilon}^{d\left(x,y\right)}.
Under this model, the maximum likelihood center of a cluster is simply its consensus string [8].
In BAYES HAMMER, we further elaborate upon HAMMER's model. Instead of a fixed ε, we use reads quality values that approximate probabilities q_{
x
}[i] of a nucleotide at position i in the kmer x being erroneous. We combine quality values from identical kmers in the reads: for a multiset of kmers X that agree on the j^{th} nucleotide, it is erroneous with probability Π_{x∈X}q_{
x
}[j].
The likelihood that a kmer x has been generated from another kmer c (under the independent errors assumption) is given by
L\left(xc\right)={\displaystyle \prod _{j:x\left[j\right]\ne c\left[j\right]}}{q}_{x}\left[j\right]{\displaystyle \prod _{j:x\left[j\right]=c\left[j\right]}}\left(1{q}_{x}\left[j\right]\right),
and the likelihood of a specific subclustering C = C_{1} ∪... ∪ C_{
m
} is
{L}_{m}\left({C}_{1},\dots ,{C}_{m}\right)=\prod _{i=1}^{m}\prod _{x\in {C}_{i}}L\left(x{c}_{i}\right)
where c_{
i
} is the center (consensus string) of the subcluster C_{
i
}.
In the subclustering procedure (see Algorithm 3), we sequentially subcluster each connected component of the Hamming graph into more and more clusters with the classical kmeans clustering algorithm (denoted mmeans since k has different meaning). For the objective function, we use the likelihood as above penalized for overfitting with the Bayesian information criterion (BIC) [13]. In this case, there are C observations in the dataset, and the total number of parameters is 3 km + m  1:

m  1 for probabilities of subclusters,

km for cluster centers, and

2 km for error probabilities in each letter: there are 3 possible errors for each letter, and the probabilities should sum up to one. Here error probabilities are conditioned on the fact that an error has occurred (alternatively, we could consider the entire distribution, including the correct letter, and get 3 km parameters for probabilities but then there would be no need to specify cluster centers, so the total number is the same).
Algorithm 3 Bayesian subclustering
for all connected components C of the Hamming graph do
m := 1
ℓ_{1} := 2 log L_{1}(C) (likelihood of the cluster generated by the consensus)
repeat
m := m + 1
do mmeans clustering of C = C_{1} ∪...∪ C_{
m
} w.r.t. the Hamming distance; the initial approximation to the centers is given by kmers that have the least error probability
ℓ_{
m
} := 2 · log L_{
m
}(C_{1},...,C_{
m
}) (3 km + m  1) · log C
until
ℓ
_{
m
}
≤
ℓ
_{m1}
output the best found clustering C = C_{1} ∪...∪ C_{m1}
Therefore, the resulting objective function is
{\ell}_{m}:=2\cdot \mathsf{\text{log}}{L}_{m}\left({C}_{1},\dots ,{C}_{m}\right)\left(3km+m1\right)\cdot \mathsf{\text{log}}\leftC\right
for subclustering into m clusters; we stop as soon as ℓ_{
m
} ceases to increase.
Steps (4) and (5): selecting solid kmers and expanding the set of solid kmers
We define the quality of a kmer x as the probability that it is errorfree: {p}_{x}={\prod}_{j=0}^{k1}\left(1{q}_{x}\left[j\right]\right). The kmer qualities are computed on Step (1) along with computing kmer statistics. Next, we (generously) define the quality of a cluster C as the probability that at least one kmer in C is correct:
{p}_{C}=1\prod _{x\in C}\left(1{p}_{x}\right).
In contrast to HAMMER, we do not distinguish whether the cluster is a singleton (i.e., C = 1); there may be plenty of superfluous clusters with several kmers obtained by chance (actually, it is more likely to obtain a cluster of several kmers by chance than a singleton of the same total multiplicity).
Initially we mark as solid the centers of the clusters whose total quality exceeds a predefined threshold (a global parameter for BAYES HAMMER, set to be rather strict). Then we expand the set of solid kmers iteratively: if a read is completely covered by solid kmers we conclude that it actually comes from the genome and mark all other kmers in this read as solid, too (Algorithm 4).
Step (6): reads correction
After Steps (1)(5), we have constructed the set of solid kmers that are presumably errorfree. To construct corrected reads from the set of solid kmers, for each base of every read, we compute the consensus of all solid kmers and solid centers of clusters of all nonsolid kmers covering this base (Figure 5). This step is formally described as Algorithm 5.
Algorithm 4 Solid kmers expansion
procedure ITERATIVE EXPANSION(R, X)
while ExpansionStep(R, X) do
function EXPANSION STEP(R, X)
for all reads r ∈ R do
if r is completely covered by solid kmers then
mark all kmers in r as solid
Return TRUE if X has increased and FALSE otherwise.
Algorithm 5 Reads correction
Input: reads R, solid kmers X, clusters \mathcal{C}.
for all reads r ∈ R do
init consensus array υ: [0, r  1] × {A, C, G, T} → ℕ with zeros: υ(j, x[i]):= 0 for all i = 0,...,r  1 and j = 0,...,k  1
for i = 0,...,r  k do
if r[i, i + k  1] ∈ X (it is solid) then
for j ∈ [i, i + k  1] do
υ(j, r[i]):= υ(j, r[i]) + 1
if r[i, i + k  1] ∈ C for some C ∈ \mathcal{C}then
let x be the center of C
if x ∈ X (r belongs to a cluster with solid center) then
for j ∈ [i, i + k  1] do
υ(j, x[i]):= υ(j, x[i]) + 1
for i ∈ [0, r  1] do
r[i]:= arg max_{a∈Σ}υ(i, a).