Algorithms and implementation
Notation
Throughout the paper, we denote the number of elements of any set A by A, and the direct product set \underset{n}{\underset{\u23df}{A\times \cdots \times A}} by A^{⊗n}. Let X = {1, 2, . . . , M} be the set of SNP loci, and \mathcal{H}=\left\{0,1\right\} be the two haplotypes. It is convenient to introduce a phase vector Φ = φ_{1} ... φ_{
M
}. The pair φ_{
j
} = (φ_{
j0
}, φ_{
j
}_{1}) is referred to as phase, and represents the two alleles of haplotype 0 and 1 at site j, respectively. Because the haplotype assembly problem is trivial for homozygous sites, and because it is usually much easier to determine the genotype than to determine the haplotypes, it is often convenient to restrict the SNP loci X to heterozygous sites. Furthermore, if sequencespecific sequencing errors are not considered, it is convenient to use a simple binary representation of alleles; we randomly assign 0 to one of the two alleles at each heterozygous site j, and 1 to the other allele. In this case, the set of alleles is denoted by Σ = {0, 1}, and the set of possible phases is denoted by Δ = {(0, 1), (1, 0)}. We assume this binary representation throughout the paper.
Let F = {f_{
i
}i = 1, . . . , N} be the set of input fragments which are supposed to be aligned to the reference genome, and each fragment f_{
i
} takes value f_{
ij
} ∈ Σ at locus j ∈ X if a nucleotide is aligned and equal to one of two alleles, and f_{
ij
} = ∅ if fragment f_{
i
} is unaligned, gapped, ambiguous, or a base different from the two alleles, at site j. For any subset X' ⊆ X, we say fragment f_{
i
} spans the sites X' if f_{
ij
} ≠ ∅ for all j ∈ X'. We refer to the subset of X spanned by fragment f as X(f). We say fragment f_{
i
} covers site j if there exists a pair of spanning two different (possible non consecutive) SNP sites j_{1}, j_{2} ∈ X(f_{
i
}) such that j_{1} < j ≤ j_{2}. The set of fragments that cover site j is denoted by F^{c}(j). Further, we refer to the set of all the possible haplotypes for sites X(f_{
i
}) as \mathrm{\Delta}\left({f}_{i}\right)={\mathrm{\Delta}}^{\otimes \leftX\left({f}_{i}\right)\right}.
The SIH problem takes a set of aligned SNP fragments F as input and outputs a hidden phase vector Φ (Figure 1). Because the SIH problem does not associate the inferred haplotypes \mathcal{H}with the real paternal and maternal chromosomes, the switched configuration \stackrel{\u0304}{\mathrm{\Phi}}={\stackrel{\u0304}{\phi}}_{1}\cdots {\stackrel{\u0304}{\phi}}_{M},\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\stackrel{\u0304}{\phi}}_{j}=\left({\stackrel{\u0304}{\phi}}_{j\stackrel{\u0304}{0}},{\stackrel{\u0304}{\phi}}_{j\stackrel{\u0304}{1}}\right) with \stackrel{\u0304}{0}=1 and \stackrel{\u0304}{1}=0, must be regarded as a completely equivalent prediction. Therefore, SIH has no meaning if there is only one heterozygous site, and it is only meaningful if one considers cooccurrences of alleles on the same haplotype for two or more heterozygous sites.
Mixture model
We model the probabilistic distribution of the observed fragments F by
P\left(F\mathrm{\Theta}\right)=\sum _{H\in {\mathcal{H}}^{\otimes N}}\prod _{i=1}^{N}\sum _{{\mathrm{\Phi}}^{\left(i\right)}\in \mathrm{\Delta}\left({f}_{i}\right)}P\left({f}_{i}{h}_{i},\phantom{\rule{2.77695pt}{0ex}}{\mathrm{\Phi}}^{\left(i\right)}\right){p}^{m}\left({h}_{i}\right)P\left({\mathrm{\Phi}}^{\left(i\right)}\right),
P\left({\mathrm{\Phi}}^{\left(i\right)}\right)=\prod _{j\in X\left({f}_{i}\right)}{p}_{j}^{\mathrm{\Phi}}\left({\phi}_{j}^{\left(i\right)}\right),
where Θ represents a set of parameters defined later, Φ^{(i)}∈ Δ(f_{
i
}) represents a partial haplotype reconstruction over the sites X(f_{
i
}) spanned by fragment f_{
i
}, H = h_{1} . . . h_{
N
} where {h}_{i}\in \mathcal{H} represents the haplotype origin of fragment f_{
i
}, p^{m}(h) is the mixture probability of haplotype {h}_{i}\in \mathcal{H}, and {p}_{j}^{\mathrm{\Phi}}\left(\nu \right) is the probability that phase ν ∈ Δ is instantiated at site j. We define the probability of emitting fragment f_{
i
} from haplotype h_{
i
} given a fixed phase vector Φ^{(i)}as follows.
P\left({f}_{i}{h}_{i},\phantom{\rule{2.77695pt}{0ex}}{\mathrm{\Phi}}^{\left(i\right)}\right)=\prod _{j\in X\left({f}_{i}\right)}{p}^{e}\left({f}_{ij}{\phi}_{j{h}_{i}}^{\left(i\right)}\right)
where,
{p}^{e}\left(\sigma {\sigma}^{\prime}\right)=\left\{\begin{array}{cc}\hfill \left(1\alpha \right)\hfill & \hfill \mathsf{\text{for}}\phantom{\rule{2.77695pt}{0ex}}\sigma ={\sigma}^{\prime}\hfill \\ \hfill \alpha \hfill & \hfill \mathsf{\text{for}}\phantom{\rule{2.77695pt}{0ex}}\sigma \ne {\sigma}^{\prime}\hfill \end{array}\right.
is the probability that we observe σ ∈ Σ when the true allele is σ' ∈ Σ and α represents the sequence error rate which we assume is independent of fragments and positions.
We take α as a fixed constant because it is better estimated from other resources rather than from only the bases at the SNP sites. For example, we may estimate α by using the all the read sequences or by using information from other dedicated studies about sequencing and mapping errors. In the following, we use α = 0.1 unless otherwise mentioned and the dependency of the α is described in Additional file 1. We further assume the mixture probabilities are equal, p^{m}(0) = p^{m}(1) = 0.5, as they often converge to around 0.5. Therefore, the parameter set Θ that needs to be optimized consists only of the set of phase probabilities: \mathrm{\Theta}=\left\{{\theta}_{j\nu}\right\}=\left\{{p}_{j}^{\mathrm{\Phi}}\left(\nu \right)\right\}.
Let {\mathcal{I}}_{ihj\nu} be the indicator function that is one if fragment f_{
i
} is derived from haplotype h, X(f_{
i
}) includes j, and the haplotypes have phase ν at site j, and that is zero otherwise. {\mathcal{I}}_{ihj\nu} is uniquely determined if the haplotype origins H = {h_{
i
}i = 1, . . . , N} and phase vectors Ψ = {Φ^{(i)}i = 1, . . . , N} of fragments F are specified. Then the marginalized likelihood P (F Θ) is given by
\begin{array}{c}P\left(F\mathrm{\Theta}\right)=\sum _{H,\mathrm{\Psi}}P\left(F,\phantom{\rule{2.77695pt}{0ex}}H,\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Psi}\mathrm{\Theta}\right),\\ \text{log}\left(P\left(F,\phantom{\rule{2.77695pt}{0ex}}H,\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Psi}\mathrm{\Theta}\right)\right)=N\text{log}\left(0.5\right)+\\ \sum _{i=1}^{N}\sum _{h\in \mathcal{H}}\sum _{j\in X\left({f}_{i}\right)}\sum _{\nu \in \mathrm{\Delta}}{\mathcal{I}}_{ihj\nu}\left[{\mu}_{ihj\nu}+\text{log}{\theta}_{j\nu}\right],\\ {\mu}_{ihj\nu}=\text{log}\left({p}^{e}\left({f}_{ij}{\nu}_{h}\right)\right).\end{array}
We explain the difference between our model and the models of Kim [14] and Li [16] in Additional file 1.
The minimum connectivity score
As described above, the two haplotypes \mathcal{H} in the SIH problem have no particular identity and it is not possible to predict which of them converges to the actual paternal or maternal chromosome. In relation to this, the likelihood function P (F, H, Ψ Θ) has a symmetry between the switched configurations: P\left(F,\stackrel{\u0304}{H},\stackrel{\u0304}{\mathrm{\Psi}}\stackrel{\u0304}{\mathrm{\Theta}}\right)=P\left(F,H,\mathrm{\Psi}\mathrm{\Theta}\right), where \stackrel{\u0304}{H}=\left\{{\stackrel{\u0304}{h}}_{i}i=1,\dots ,N\right) and \stackrel{\u0304}{\mathrm{\Psi}}=\left\{{\stackrel{\u0304}{\mathrm{\Phi}}}^{\left(i\right)}i=1,\dots ,N\right\} represent the configuration that all the haplotype origins of the fragments are exchanged, and \stackrel{\u0304}{\mathrm{\Theta}}=\left\{{\stackrel{\u0304}{\theta}}_{jv}\right\},{\stackrel{\u0304}{\theta}}_{jv}={\theta}_{j\stackrel{\u0304}{v}} are the switched phase probabilities. Therefore, the marginal likelihood P\left(F\mathrm{\Theta}\right)={\sum}_{H,\mathrm{\Psi}}P\left(F,H,\mathrm{\Psi}\mathrm{\Theta}\right) is symmetric for the two parameter sets: P\left(F\stackrel{\u0304}{\mathrm{\Theta}}\right)=P\left(F\mathrm{\Theta}\right).
Suppose that the probabilistic model is optimized for two segments of SNP sites between which there are no connecting fragments, then the association of the haplotypes {0, 1} to the true paternal and maternal chromosomes are selected at random for each segment. Even if there are several connecting fragments, the associations in each segment are determined almost randomly if the number of connecting fragments is not sufficient or there are many conflicting fragments. Such sites often cause switch errors. We define the connectivity at site j_{0} as a log ratio of the marginal log likelihoods:
\mathrm{connectivity}\left({j}_{0}\right)=\text{log}\left(\frac{P\left(F\mathrm{\Theta}\right)}{P\left(F{\mathrm{\Theta}}^{\prime}\right)}\right)=\text{log}\left(\frac{P\left({F}^{c}\left({j}_{0}\right)\mathrm{\Theta}\right)}{P\left({F}^{c}\left({j}_{0}\right){\mathrm{\Theta}}^{\prime}\right)}\right)
Where {\mathrm{\Theta}}^{\prime}=\left\{{\theta}_{jv}^{\prime}\right\} with {\theta}_{jv}^{\prime}={\theta}_{jv} for j <j_{0} and {\theta}_{jv}^{\prime}={\stackrel{\u0304}{\theta}}_{jv} for j ≥ j_{0}. The second equality follows from the symmetry of P (F Θ) described above, and shows that only the fragments covering site j_{0} are necessary to compute the connectivity of site j_{0}. The connectivity measures the resilience of the assembly result against swapping the two haplotypes 0 and 1 in the right part j = j_{0}, . . . , M of the sites. We refer to this change of parameters Θ → Θ' as twisting the parameters at site j_{0}.
For each pair of sites (j_{1}, j_{2}) (j_{1} < j_{2}), we define the minimum connectivity (MC) score as
\mathsf{\text{MC}}\left({j}_{1},\phantom{\rule{2.77695pt}{0ex}}{j}_{2}\right)=\underset{{j}_{1}<j\le {j}_{2}}{\text{min}}\mathsf{\text{connectivity}}\left(j\right).
We extract confidently assembled regions by selecting the pairs (j_{1}, j_{2}) with high MC values. From the above definition, it is obvious that if the MC value is higher than a given threshold for some pair (j_{1}, j_{2}), then all the pairs inside range [j_{1}, j_{2}] have MC values higher than the threshold. In this sense, MC(j_{1}, j_{2}) can be considered as defined on the range [j_{1}, j_{2}].
Variational bayesian inference
We use the VBEM algorithm to optimize the parameters Θ [29]. We approximate the Bayesian posterior distribution P (H, Ψ, ΘF) with factorized variational functions Q(H, Ψ, Θ) = Q^{H}^{Ψ}(H, Ψ) ·Q^{Θ}(Θ) such that the KullbackLeibler divergence KL_{
H
}_{ΨΘ}(Q(H, Ψ, Θ)P (H, Ψ, ΘF)) between the two distributions is minimized. The solution to this optimization problem has the form
\begin{array}{c}{Q}^{H\mathrm{\Psi}}\left(H,\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Psi}\right)=\frac{1}{{Z}^{H\mathrm{\Psi}}}\text{exp}\left(\sum _{i=1}^{N}\sum _{h\in \mathcal{H}}\sum _{j\in X\left({f}_{i}\right)}\sum _{v\in \mathrm{\Delta}}{\mathcal{I}}_{ihjv}\text{log}\left({\beta}_{ihj\nu}\right)\right),\\ {Q}^{\mathrm{\Theta}}\left(\mathrm{\Theta}\right)=\prod _{j=1}^{M}\mathsf{\text{Dir}}\left({\theta}_{j}{\lambda}_{j}\right),\end{array}
where Z^{H}^{Ψ} is a normalization constant, β_{
ihjν
} and λ_{
jν
} represent the hyperparameters that specify the posterior distributions, and Dir(θ_{
j
}λ_{
j
}) is the Dirichlet probability distribution of  Δ parameters. Because Q^{H}^{Ψ}(H, Ψ) and Q^{Θ}(Θ) are connected through the dependencies among the hyperparameters, they cannot be found simultaneously. Therefore, we optimize β_{
ihjν
} and λ_{
jν
} by an iterative method.
In our model, the parameters often converge to suboptimal solutions, because switch errors existing in the suboptimal configurations are not removed by gradual parameter changes. Therefore, we apply a heuristic procedure that reruns the VBEM several times with twisted parameter configurations after every convergence:

1.
Do VBEM and calculate the connectivities for all the sites.

2.
Do another VBEM with a parameter set Λ that is twisted at a site with low connectivity.

3.
Repeat until convergence.
Here, the twist of hyperparameters Λ = {λ_{
jν
}} is defined similarly to that of parameters Θ = {θ_{
jν
}}. We describe the details of this procedure in Additional file 1.
Inferring haplotypes
We set {p}_{j}^{\mathrm{\Phi}}\left(v\right) to the posterior mean estimate of θ_{
jv
} with respect to the converged posterior distribution:
{p}_{j}^{\mathrm{\Phi}}\left(v\right)=\int d\mathrm{\Theta}{\theta}_{jv}{Q}_{\mathrm{\Theta}}\left(\mathrm{\Theta}\right)=\frac{{\lambda}_{jv}}{{\sum}_{{v}^{\prime}}{\lambda}_{j{v}^{\prime}}}.
We select the phase ν at site j for which this {p}_{j}^{\mathrm{\Phi}}\left(v\right) is the highest. We limit the predicted haplotype segments to the regions with high MC values.
Possible extensions of the model
In this paper, we consider only the binary representation of heterozygous sites. We also constrain the error rate to be constant throughout the sequence. However, some of these constraints are easily removed. We can include homozygous sites and four nucleotide alleles by expanding the phase set Δ. For example, the phase set of a multiallelic variant is represented like Δ = {(A,C),(A,G),(C,A),(C,G),(G,A),(G,C)}. We can even include small structural variations if they can be represented by additional allele symbols and the phase set of a structural variant is represented such as Δ_{1} = {(A,),(,A)} for indel and Δ_{2} = {("AC","ACAC"),("ACAC","AC")} for short tandem repeats. With these extensions, the accuracy of genotype calling of multiallelic variants from sequencing data might be improved by considering haplotypes simultaneously [30] and the accuracy and the recall of the haplotype region might be improved because all variant sites add information to infer the derivation of the fragments. Furthermore, we can make the error probability matrix p^{e}(σσ') dependent on the alleles of each fragment, which may be useful for incorporating the quality scores of sequenced reads.