Algorithms and implementation
Notation
Throughout the paper, we denote the number of elements of any set A by |A|, and the direct product set by A⊗n. Let X = {1, 2, . . . , M} be the set of SNP loci, and be the two haplotypes. It is convenient to introduce a phase vector Φ = φ1 ... φ
M
. The pair φ
j
= (φ
j0
, φ
j
1) is referred to as phase, and represents the two alleles of haplotype 0 and 1 at site j, respectively. Because the haplotype assembly problem is trivial for homozygous sites, and because it is usually much easier to determine the genotype than to determine the haplotypes, it is often convenient to restrict the SNP loci X to heterozygous sites. Furthermore, if sequence-specific sequencing errors are not considered, it is convenient to use a simple binary representation of alleles; we randomly assign 0 to one of the two alleles at each heterozygous site j, and 1 to the other allele. In this case, the set of alleles is denoted by Σ = {0, 1}, and the set of possible phases is denoted by Δ = {(0, 1), (1, 0)}. We assume this binary representation throughout the paper.
Let F = {f
i
|i = 1, . . . , N} be the set of input fragments which are supposed to be aligned to the reference genome, and each fragment f
i
takes value f
ij
∈ Σ at locus j ∈ X if a nucleotide is aligned and equal to one of two alleles, and f
ij
= ∅ if fragment f
i
is unaligned, gapped, ambiguous, or a base different from the two alleles, at site j. For any subset X' ⊆ X, we say fragment f
i
spans the sites X' if f
ij
≠ ∅ for all j ∈ X'. We refer to the subset of X spanned by fragment f as X(f). We say fragment f
i
covers site j if there exists a pair of spanning two different (possible non consecutive) SNP sites j1, j2 ∈ X(f
i
) such that j1 < j ≤ j2. The set of fragments that cover site j is denoted by Fc(j). Further, we refer to the set of all the possible haplotypes for sites X(f
i
) as .
The SIH problem takes a set of aligned SNP fragments F as input and outputs a hidden phase vector Φ (Figure 1). Because the SIH problem does not associate the inferred haplotypes with the real paternal and maternal chromosomes, the switched configuration with and , must be regarded as a completely equivalent prediction. Therefore, SIH has no meaning if there is only one heterozygous site, and it is only meaningful if one considers co-occurrences of alleles on the same haplotype for two or more heterozygous sites.
Mixture model
We model the probabilistic distribution of the observed fragments F by
where Θ represents a set of parameters defined later, Φ(i)∈ Δ(f
i
) represents a partial haplotype reconstruction over the sites X(f
i
) spanned by fragment f
i
, H = h1 . . . h
N
where represents the haplotype origin of fragment f
i
, pm(h) is the mixture probability of haplotype , and is the probability that phase ν ∈ Δ is instantiated at site j. We define the probability of emitting fragment f
i
from haplotype h
i
given a fixed phase vector Φ(i)as follows.
where,
is the probability that we observe σ ∈ Σ when the true allele is σ' ∈ Σ and α represents the sequence error rate which we assume is independent of fragments and positions.
We take α as a fixed constant because it is better estimated from other resources rather than from only the bases at the SNP sites. For example, we may estimate α by using the all the read sequences or by using information from other dedicated studies about sequencing and mapping errors. In the following, we use α = 0.1 unless otherwise mentioned and the dependency of the α is described in Additional file 1. We further assume the mixture probabilities are equal, pm(0) = pm(1) = 0.5, as they often converge to around 0.5. Therefore, the parameter set Θ that needs to be optimized consists only of the set of phase probabilities: .
Let be the indicator function that is one if fragment f
i
is derived from haplotype h, X(f
i
) includes j, and the haplotypes have phase ν at site j, and that is zero otherwise. is uniquely determined if the haplotype origins H = {h
i
|i = 1, . . . , N} and phase vectors Ψ = {Φ(i)|i = 1, . . . , N} of fragments F are specified. Then the marginalized likelihood P (F| Θ) is given by
We explain the difference between our model and the models of Kim [14] and Li [16] in Additional file 1.
The minimum connectivity score
As described above, the two haplotypes in the SIH problem have no particular identity and it is not possible to predict which of them converges to the actual paternal or maternal chromosome. In relation to this, the likelihood function P (F, H, Ψ| Θ) has a symmetry between the switched configurations: , where and represent the configuration that all the haplotype origins of the fragments are exchanged, and are the switched phase probabilities. Therefore, the marginal likelihood is symmetric for the two parameter sets: .
Suppose that the probabilistic model is optimized for two segments of SNP sites between which there are no connecting fragments, then the association of the haplotypes {0, 1} to the true paternal and maternal chromosomes are selected at random for each segment. Even if there are several connecting fragments, the associations in each segment are determined almost randomly if the number of connecting fragments is not sufficient or there are many conflicting fragments. Such sites often cause switch errors. We define the connectivity at site j0 as a log ratio of the marginal log likelihoods:
Where with for j <j0 and for j ≥ j0. The second equality follows from the symmetry of P (F| Θ) described above, and shows that only the fragments covering site j0 are necessary to compute the connectivity of site j0. The connectivity measures the resilience of the assembly result against swapping the two haplotypes 0 and 1 in the right part j = j0, . . . , M of the sites. We refer to this change of parameters Θ → Θ' as twisting the parameters at site j0.
For each pair of sites (j1, j2) (j1 < j2), we define the minimum connectivity (MC) score as
We extract confidently assembled regions by selecting the pairs (j1, j2) with high MC values. From the above definition, it is obvious that if the MC value is higher than a given threshold for some pair (j1, j2), then all the pairs inside range [j1, j2] have MC values higher than the threshold. In this sense, MC(j1, j2) can be considered as defined on the range [j1, j2].
Variational bayesian inference
We use the VBEM algorithm to optimize the parameters Θ [29]. We approximate the Bayesian posterior distribution P (H, Ψ, Θ|F) with factorized variational functions Q(H, Ψ, Θ) = QHΨ(H, Ψ) ·QΘ(Θ) such that the Kullback-Leibler divergence KL
H
ΨΘ(Q(H, Ψ, Θ)||P (H, Ψ, Θ|F)) between the two distributions is minimized. The solution to this optimization problem has the form
where ZHΨ is a normalization constant, β
ihjν
and λ
jν
represent the hyperparameters that specify the posterior distributions, and Dir(θ
j
|λ
j
) is the Dirichlet probability distribution of | Δ| parameters. Because QHΨ(H, Ψ) and QΘ(Θ) are connected through the dependencies among the hyperparameters, they cannot be found simultaneously. Therefore, we optimize β
ihjν
and λ
jν
by an iterative method.
In our model, the parameters often converge to sub-optimal solutions, because switch errors existing in the sub-optimal configurations are not removed by gradual parameter changes. Therefore, we apply a heuristic procedure that re-runs the VBEM several times with twisted parameter configurations after every convergence:
-
1.
Do VBEM and calculate the connectivities for all the sites.
-
2.
Do another VBEM with a parameter set Λ that is twisted at a site with low connectivity.
-
3.
Repeat until convergence.
Here, the twist of hyperparameters Λ = {λ
jν
} is defined similarly to that of parameters Θ = {θ
jν
}. We describe the details of this procedure in Additional file 1.
Inferring haplotypes
We set to the posterior mean estimate of θ
jv
with respect to the converged posterior distribution:
We select the phase ν at site j for which this is the highest. We limit the predicted haplotype segments to the regions with high MC values.
Possible extensions of the model
In this paper, we consider only the binary representation of heterozygous sites. We also constrain the error rate to be constant throughout the sequence. However, some of these constraints are easily removed. We can include homozygous sites and four nucleotide alleles by expanding the phase set Δ. For example, the phase set of a multi-allelic variant is represented like Δ = {(A,C),(A,G),(C,A),(C,G),(G,A),(G,C)}. We can even include small structural variations if they can be represented by additional allele symbols and the phase set of a structural variant is represented such as Δ1 = {(A,-),(-,A)} for indel and Δ2 = {("AC","ACAC"),("ACAC","AC")} for short tandem repeats. With these extensions, the accuracy of genotype calling of multi-allelic variants from sequencing data might be improved by considering haplotypes simultaneously [30] and the accuracy and the recall of the haplotype region might be improved because all variant sites add information to infer the derivation of the fragments. Furthermore, we can make the error probability matrix pe(σ|σ') dependent on the alleles of each fragment, which may be useful for incorporating the quality scores of sequenced reads.