Volume 14 Supplement 2
Selected articles from ISCBAsia 2012
MixSIH: a mixture model for single individual haplotyping
 Hirotaka Matsumoto^{1}Email author and
 Hisanori Kiryu^{1}
DOI: 10.1186/1471216414S2S5
© Matsumoto and Kiryu; licensee BioMed Central Ltd. 2013
Published: 15 February 2013
Abstract
Background
Haplotype information is useful for various genetic analyses, including genomewide association studies. Determining haplotypes experimentally is difficult and there are several computational approaches that infer haplotypes from genomic data. Among such approaches, single individual haplotyping or haplotype assembly, which infers two haplotypes of an individual from aligned sequence fragments, has been attracting considerable attention. To avoid incorrect results in downstream analyses, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. Although there are several efficient algorithms for solving haplotype assembly, there are no efficient method that allow for extracting the regions assembled with high confidence.
Results
We develop a probabilistic model, called MixSIH, for solving the haplotype assembly problem. The model has two mixture components representing two haplotypes. Based on the optimized model, a quality score is defined, which we call the 'minimum connectivity' (MC) score, for each segment in the haplotype assembly. Because existing accuracy measures for haplotype assembly are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the set of partially assembled haplotype segments, we develop an accuracy measure based on the pairwise consistency and evaluate the accuracy on the simulation and real data. By using the MC scores, our algorithm can extract highly accurate haplotype segments. We also show evidence that an existing experimental dataset contains chimeric read fragments derived from different haplotypes, which significantly degrade the quality of assembled haplotypes.
Conclusions
We develop a novel method for solving the haplotype assembly problem. We also define the quality score which is based on our model and indicates the accuracy of the haplotypes segments. In our evaluation, MixSIH has successfully extracted reliable haplotype segments. The C++ source code of MixSIH is available at https://sites.google.com/site/hmatsu1226/software/mixsih.
Introduction
Human somatic cells are diploid and contain two homologous copies of chromosomes, each of which is derived from either paternal or maternal chromosomes. The two chromosomes differ at a number of loci and the most abundant type of variation is single nucleotide polymorphism (SNP). Most current research does not determine the chromosomal origin of the variations and uses only genotype information for the analyses. However, haplotype information is valuable for genomewide association studies (GWAS) [1] and for analyzing genetic structures such as linkage disequilibrium, recombination patterns [2], and correlations between variations and diseases [3].
Let us consider a simple example to demonstrate the importance of haplotype information. Suppose that in a gene coding region, there are two SNP loci, each of which has an independent deleterious mutation in either one of the two homologous chromosomes. If both of the two deleterious mutations are located on the same chromosome, the other chromosome can produce normal proteins. On the other hand, if each chromosome contains either one of the two deleterious mutations, the cells cannot produce normal proteins. It is not possible to distinguish these two cases with only genotype information.
There is a group of algorithms for haplotype inference that statistically construct a set of haplotypes from population genotypes [4–8] Review see [9]. These algorithms have been developed in response to technological advances such as SNP arrays that efficiently measure personal genotypes at a genomic scale. The algorithms infer haplotype blocks based on the assumption that the variety of combinations of alleles is very limited. Therefore, these algorithms fail to identify correct haplotypes in regions with low linkage disequilibrium (LD) where there are frequent recombination events. These algorithms also cannot identify spontaneous mutations. These difficulties are partially resolved by using genotypes of pedigrees. However, family data are not always available, and furthermore, they cannot determine the haplotypes of the loci at which all the family members have the same genotype.
SIH algorithms did not attract much attention until recently, since the read fragments of nextgeneration sequencing experiments are not long enough to span multiple heterozygous loci, which exist at only one in one kilobase on average [18], and the Sanger sequencing that produces long read fragments is too expensive to be conducted at a genomic scale. However, this situation is changing rapidly with the advent of realtime singlemolecule sequencing technologies, which are able to sequence DNA fragments as long as 50 kilobases [19], and with the development of a novel experimental technique called 'fosmid poolbased nextgeneration sequencing' [13, 20, 21], which randomly assigns a barcode to each read cluster that is derived from the same region in the same chromosome. Because of these advances in experimental techniques, SIH has emerged as one of the most promising approaches for analyzing the haplotype structures of diploid organisms.
The haplotype information which contains errors is likely to lead to wrong results in downstream analyses. For example, in detecting the recombination events from the parentoffspring haplotypes [22], the haplotyping errors are regarded as recombination events by mistake. Another example is that haplotyping errors considerably decrease the detection power of amplified haplotypes in cancer [23] and fetus haplotypes [24]. To use haplotype information in downstream analyses while avoiding such harmful influence of haplotyping errors, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. In the statistical haplotype phasing, reliable haplotype regions are determined by selecting the blocks of limited haplotype diversity and level of LD [25–27]. Although there are many algorithms for SIH, none of these algorithms can provide confidence scores to extract reliable haplotype regions.
The algorithms for SIH are classified into two strategies; most of the previous algorithms use deterministic strategies [10–13, 15, 17] but a few take a probabilistic modeling approach [14, 16]. The deterministic algorithms usually include solving the MAXCUT problem of graph theory [28] in their computational procedures in order to partition the set of the input fragments into two groups representing the two haplotypes. Because these algorithms are designed to optimize only a certain global score function that measures the number of inconsistent fragments and do not model the fragments and haplotypes themselves, it is difficult to produce confidence scores for each region of the assembled haplotypes.
On the other hand, the probabilistic approaches of Kim [14] and Li [16] assume that each observed fragment is sampled from one of the two unobserved haplotypes. Unlike the deterministic approaches, probabilistic models allow the computation of various expected values and confidence values from the Bayesian posterior distributions. For example, Kim [14] and Li [16] defined a confidence value for the haplotype reconstruction of each segment of SNP loci. Unfortunately, those researchers chose a model structure for which the exact computation of the likelihood is extremely computationally intensive. Because the complexity of this summation is exponential in the number of SNP sites, only the posterior probabilities of the haplotypes for neighboring loci are considered. The complete haplotypes are reconstructed by connecting plausible haplotypes of neighboring pairs according to their posterior probabilities. Hence, their approach cannot take into account the full information of fragments that span three or more SNP loci. Their confidence scores for haplotype segments include a summation over all the possible haplotypes, and it is not possible to compute their confidence scores for all the possible segments in the assembled haplotypes.
In this paper, we develop a novel probabilistic SIH model that is very different from the probabilistic models of Kim [14] and Li [16]. Our model takes a 'mixture model' approach: each fragment is emitted completely independently of the other fragments. In contrast, Kim [14] and Li [16] took a 'hidden variables' approach: all the fragments are correlated through hidden haplotype variables (see the Additional file 1 for further explanation). This difference allows us to compute the likelihood with a computational time proportional to the total length of the input fragments. We use the variational Bayes expectation maximization (VBEM) algorithm [29] to compute the approximate posterior distribution of the haplotypes. By using the optimized distribution, we compute the 'minimum connectivity' (MC) score for each segment in the reconstructed haplotypes; this measures whether the segment is free from switch errors. We show that we can extract accurately assembled regions by selecting regions with high MC scores. We also analyze a recent dataset from fosmid poolbased nextgeneration sequencing and find evidence that the processed dataset contains chimeric fragments derived from the erroneous merging of read clusters in different haplotypes, which degrades the quality of assembled haplotypes significantly.
Methods
Algorithms and implementation
Notation
Throughout the paper, we denote the number of elements of any set A by A, and the direct product set $\underset{n}{\underset{\u23df}{A\times \cdots \times A}}$ by A^{⊗n}. Let X = {1, 2, . . . , M} be the set of SNP loci, and $\mathcal{H}=\left\{0,1\right\}$ be the two haplotypes. It is convenient to introduce a phase vector Φ = φ_{1} ... φ_{ M }. The pair φ_{ j } = (φ_{ j0 }, φ_{ j }_{1}) is referred to as phase, and represents the two alleles of haplotype 0 and 1 at site j, respectively. Because the haplotype assembly problem is trivial for homozygous sites, and because it is usually much easier to determine the genotype than to determine the haplotypes, it is often convenient to restrict the SNP loci X to heterozygous sites. Furthermore, if sequencespecific sequencing errors are not considered, it is convenient to use a simple binary representation of alleles; we randomly assign 0 to one of the two alleles at each heterozygous site j, and 1 to the other allele. In this case, the set of alleles is denoted by Σ = {0, 1}, and the set of possible phases is denoted by Δ = {(0, 1), (1, 0)}. We assume this binary representation throughout the paper.
Let F = {f_{ i }i = 1, . . . , N} be the set of input fragments which are supposed to be aligned to the reference genome, and each fragment f_{ i } takes value f_{ ij } ∈ Σ at locus j ∈ X if a nucleotide is aligned and equal to one of two alleles, and f_{ ij } = ∅ if fragment f_{ i } is unaligned, gapped, ambiguous, or a base different from the two alleles, at site j. For any subset X' ⊆ X, we say fragment f_{ i } spans the sites X' if f_{ ij } ≠ ∅ for all j ∈ X'. We refer to the subset of X spanned by fragment f as X(f). We say fragment f_{ i } covers site j if there exists a pair of spanning two different (possible non consecutive) SNP sites j_{1}, j_{2} ∈ X(f_{ i }) such that j_{1} < j ≤ j_{2}. The set of fragments that cover site j is denoted by F^{ c }(j). Further, we refer to the set of all the possible haplotypes for sites X(f_{ i }) as $\mathrm{\Delta}\left({f}_{i}\right)={\mathrm{\Delta}}^{\otimes \leftX\left({f}_{i}\right)\right}$.
The SIH problem takes a set of aligned SNP fragments F as input and outputs a hidden phase vector Φ (Figure 1). Because the SIH problem does not associate the inferred haplotypes $\mathcal{H}$with the real paternal and maternal chromosomes, the switched configuration $\stackrel{\u0304}{\mathrm{\Phi}}={\stackrel{\u0304}{\phi}}_{1}\cdots {\stackrel{\u0304}{\phi}}_{M},\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\stackrel{\u0304}{\phi}}_{j}=\left({\stackrel{\u0304}{\phi}}_{j\stackrel{\u0304}{0}},{\stackrel{\u0304}{\phi}}_{j\stackrel{\u0304}{1}}\right)$ with $\stackrel{\u0304}{0}=1$ and $\stackrel{\u0304}{1}=0$, must be regarded as a completely equivalent prediction. Therefore, SIH has no meaning if there is only one heterozygous site, and it is only meaningful if one considers cooccurrences of alleles on the same haplotype for two or more heterozygous sites.
Mixture model
is the probability that we observe σ ∈ Σ when the true allele is σ' ∈ Σ and α represents the sequence error rate which we assume is independent of fragments and positions.
We take α as a fixed constant because it is better estimated from other resources rather than from only the bases at the SNP sites. For example, we may estimate α by using the all the read sequences or by using information from other dedicated studies about sequencing and mapping errors. In the following, we use α = 0.1 unless otherwise mentioned and the dependency of the α is described in Additional file 1. We further assume the mixture probabilities are equal, p^{ m }(0) = p^{ m }(1) = 0.5, as they often converge to around 0.5. Therefore, the parameter set Θ that needs to be optimized consists only of the set of phase probabilities: $\mathrm{\Theta}=\left\{{\theta}_{j\nu}\right\}=\left\{{p}_{j}^{\mathrm{\Phi}}\left(\nu \right)\right\}$.
We explain the difference between our model and the models of Kim [14] and Li [16] in Additional file 1.
The minimum connectivity score
As described above, the two haplotypes $\mathcal{H}$ in the SIH problem have no particular identity and it is not possible to predict which of them converges to the actual paternal or maternal chromosome. In relation to this, the likelihood function P (F, H, Ψ Θ) has a symmetry between the switched configurations: $P\left(F,\stackrel{\u0304}{H},\stackrel{\u0304}{\mathrm{\Psi}}\stackrel{\u0304}{\mathrm{\Theta}}\right)=P\left(F,H,\mathrm{\Psi}\mathrm{\Theta}\right)$, where $\stackrel{\u0304}{H}=\left\{{\stackrel{\u0304}{h}}_{i}i=1,\dots ,N\right)$ and $\stackrel{\u0304}{\mathrm{\Psi}}=\left\{{\stackrel{\u0304}{\mathrm{\Phi}}}^{\left(i\right)}i=1,\dots ,N\right\}$ represent the configuration that all the haplotype origins of the fragments are exchanged, and $\stackrel{\u0304}{\mathrm{\Theta}}=\left\{{\stackrel{\u0304}{\theta}}_{jv}\right\},{\stackrel{\u0304}{\theta}}_{jv}={\theta}_{j\stackrel{\u0304}{v}}$ are the switched phase probabilities. Therefore, the marginal likelihood $P\left(F\mathrm{\Theta}\right)={\sum}_{H,\mathrm{\Psi}}P\left(F,H,\mathrm{\Psi}\mathrm{\Theta}\right)$ is symmetric for the two parameter sets: $P\left(F\stackrel{\u0304}{\mathrm{\Theta}}\right)=P\left(F\mathrm{\Theta}\right)$.
Where ${\mathrm{\Theta}}^{\prime}=\left\{{\theta}_{jv}^{\prime}\right\}$ with ${\theta}_{jv}^{\prime}={\theta}_{jv}$ for j <j_{0} and ${\theta}_{jv}^{\prime}={\stackrel{\u0304}{\theta}}_{jv}$ for j ≥ j_{0}. The second equality follows from the symmetry of P (F Θ) described above, and shows that only the fragments covering site j_{0} are necessary to compute the connectivity of site j_{0}. The connectivity measures the resilience of the assembly result against swapping the two haplotypes 0 and 1 in the right part j = j_{0}, . . . , M of the sites. We refer to this change of parameters Θ → Θ' as twisting the parameters at site j_{0}.
We extract confidently assembled regions by selecting the pairs (j_{1}, j_{2}) with high MC values. From the above definition, it is obvious that if the MC value is higher than a given threshold for some pair (j_{1}, j_{2}), then all the pairs inside range [j_{1}, j_{2}] have MC values higher than the threshold. In this sense, MC(j_{1}, j_{2}) can be considered as defined on the range [j_{1}, j_{2}].
Variational bayesian inference
where Z^{ H }^{Ψ} is a normalization constant, β_{ ihjν } and λ_{ jν } represent the hyperparameters that specify the posterior distributions, and Dir(θ_{ j }λ_{ j }) is the Dirichlet probability distribution of  Δ parameters. Because Q^{ H }^{Ψ}(H, Ψ) and Q^{Θ}(Θ) are connected through the dependencies among the hyperparameters, they cannot be found simultaneously. Therefore, we optimize β_{ ihjν } and λ_{ jν } by an iterative method.
 1.
Do VBEM and calculate the connectivities for all the sites.
 2.
Do another VBEM with a parameter set Λ that is twisted at a site with low connectivity.
 3.
Repeat until convergence.
Here, the twist of hyperparameters Λ = {λ_{ jν }} is defined similarly to that of parameters Θ = {θ_{ jν }}. We describe the details of this procedure in Additional file 1.
Inferring haplotypes
We select the phase ν at site j for which this ${p}_{j}^{\mathrm{\Phi}}\left(v\right)$ is the highest. We limit the predicted haplotype segments to the regions with high MC values.
Possible extensions of the model
In this paper, we consider only the binary representation of heterozygous sites. We also constrain the error rate to be constant throughout the sequence. However, some of these constraints are easily removed. We can include homozygous sites and four nucleotide alleles by expanding the phase set Δ. For example, the phase set of a multiallelic variant is represented like Δ = {(A,C),(A,G),(C,A),(C,G),(G,A),(G,C)}. We can even include small structural variations if they can be represented by additional allele symbols and the phase set of a structural variant is represented such as Δ_{1} = {(A,),(,A)} for indel and Δ_{2} = {("AC","ACAC"),("ACAC","AC")} for short tandem repeats. With these extensions, the accuracy of genotype calling of multiallelic variants from sequencing data might be improved by considering haplotypes simultaneously [30] and the accuracy and the recall of the haplotype region might be improved because all variant sites add information to infer the derivation of the fragments. Furthermore, we can make the error probability matrix p^{ e }(σσ') dependent on the alleles of each fragment, which may be useful for incorporating the quality scores of sequenced reads.
Datasets and data processing
Dataset generation
Simulation data were created through a strategy similar to the one reported by Geraci [31]. We first generated M binary heterozygous phase vectors and then we generated SNP fragments by replicating each haplotype c times and randomly dividing them into subsequences of length between l_{1} and l_{2}. We then randomly flipped the binary values of the fragments from 0(1) to 1(0) with probability e. In the following, we use M = 1000, c = 5, l_{1} = 3, l_{2} = 7 and e = 0.1 unless otherwise mentioned.
For the real data, we used the dataset of Duitama's work [13], who conducted fosmid poolbased nextgeneration sequencing for HapMap trio child NA12878 from the CEU population. NA12878 had about 1.65 × 10^{6} heterozygous sites on autosomal chromosome and the haplotypes of about 1.36× 10^{6} sites were determined by a triobased statistical phasing method [18]. In the fosmid poolbased nextgeneration sequencing, the diploid genomic DNA was fragmented into pieces of length about 40 kilobases, and partitioned into 32 pools with low concentration, so that the fragments were long enough to span several heterozygous sites and each pool rarely contained homologous chromosomal regions of different haplotypes. Each pool was sequenced separately using a nextgeneration sequencer and the read data were mapped onto the reference genome. Since a read cluster in which the reads were close to each other and had the same pool origin were supposed to originate from the same DNA fragment, the alleles observed in the same cluster were merged into a SNP fragment. Duitama [13] converted the fragment data to a binary representation by collecting only the alleles of the heterozygous sites determined by the 1000 genomes project. The coverage of the data was about 3.03. We used the triobased data and the sequencing data in binary format for our experiment.
The normalized linkage disequilibrium D' for the CEU population was downloaded from the HapMap Project [2].
We compared our MixSIH software with ReFHap [13], FastHare [17], DGS [15], which were implemented by Duitama [13], and HapCUT [11]. We selected these algorithms because they have been shown to be superior to other algorithms [13].
For the comparison of the runtimes, we generated simulation data with M = 100, 200, 500, 1000. We repeated the measurement 10 times for each M and the average runtimes are reported here. The computations were performed on a cluster of Linux machines equipped with dual Xeon X5550 processors and 24 GB RAM.
Accuracy measures
As described in the introduction, our algorithm is focusing on extracting the reliable haplotype regions. To examine whether we have succeeded in extracting the reliable haplotype regions, an accuracy measure which evaluates the quality of the piecewise haplotype regions is needed. However, existing accuracy measures are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the piecewise haplotype regions.
Let Φ^{(t)}be the true haplotypes, and Φ be inferred haplotypes. Because the inferred haplotypes Φ are sets of partially assembled haplotype segments Φ = (Φ_{1}, Φ_{2}, . . . , Φ_{ B }) where each of Φ_{ b } is independently predicted, the accuracy measures have to be applicable for such predictions.
However, this definition is inconvenient because the minimization is applied for each segment and this accuracy measure can always be improved just by breaking a segment into smaller pieces at random positions.
Here, we propose another simple accuracy measure based on the pairwise consistency of the prediction with the true haplotypes. This pairwise consistency score is inspired by the D'measure of linkage disequilibrium where the statistical correlations among population genomes are measured for pair sites. Similarly to the switch error, a pair of heterozygous sites j and j' (j < j') is defined as consistent if $\left({\phi}_{j},\phantom{\rule{2.77695pt}{0ex}}{\phi}_{{j}^{\prime}}\right)=\left({\phi}_{j}^{\left(t\right)},\phantom{\rule{2.77695pt}{0ex}}{\phi}_{{j}^{\prime}}^{\left(t\right)}\right)\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{or}}\phantom{\rule{2.77695pt}{0ex}}\left({\stackrel{\u0304}{\phi}}_{j}^{\left(t\right)},{\stackrel{\u0304}{\phi}}_{{j}^{\prime}}^{\left(t\right)}\right)$, and inconsistent otherwise. A pair (j, j') in a haplotype segment is consistent if there is no switch error in range [j, j^{ ' }] and inconsistent if there is one switch error in the segment. If there are uncontrolled number of switch errors in range [j, j'], the probabilities that pair (j, j^{ ' }) is consistent or inconsistent are both 0.5, which is equivalent to selecting a random phase at each site (Figure 2(A)). For each haplotype segment, we count the consistent and inconsistent pairs. The total numbers of consistent and inconsistent pairs over all the haplotype segments are denoted by CP and IP, respectively. We define precision by CP/(CP + IP). This is used as the measure of accuracy in the later sections. Unlike the switch error rate, this precision accounts for the global influence of switch errors because a switch error in the middle of a haplotype segment leads to a much smaller CP than switch errors at an end of the segment.
We define the total prediction space as follows. We consider a graph whose nodes are the set of all the heterozygous sites. We connect two nodes by an edge if there is a fragment spanning both the sites. We collect all the connected components with at least two nodes and consider each of the corresponding clusters of heterozygous sites as an independent segment. The total number of pairs is the sum of the numbers of all the pair sites over the segments. Although it is rare, there are cases in which some segments consist of noncontiguous heterozygous sites. For example, segment sets such as {(1, 4, 5), (2, 3)} and {(1, 3), (2, 4, 5)} may occur for the consecutive heterozygous sites (1, 2, 3, 4, 5). We define recall as the ratio of the predicted pairs divided by the total number of pairs. Because the previous algorithms provide no score to limit the prediction to highly confident regions, recall is always nearly equal to one for these algorithms. On the other hand, our algorithm is able to make predictions with high precision at the expense of reduced recall.
A more detailed discussions of other accuracy measures is given in Additional file 1.
Potential chimeric fragments
where n(f, h) is the number of sites at which the fragment f matches with the true haplotype h, f_{ ≤j } and f_{ >j } represent the left and right parts of fragment f divided at site j, and α_{0} = 0.028 is the empirical sequence error rate computed by comparing the true haplotypes and all the SNP fragments. We removed potential chimeric fragments with chimerity higher than a given threshold. We recomputed the accuracies for this removed dataset and compared them with those for the original dataset.
Results and discussion
Comparison of pairwise accuracies
Effects of potential chimeric fragments
These results suggest that more careful data processing to avoid spurious chimeric fragments is necessary to obtain highquality haplotype assembly.
Incorporation of the triobased data
Although the triobased statistical phasing method can determine most of the phases of the sites, there still exist SNP sites whose phases cannot be determined by this method. SIH is capable of determining the phases which are not determined by the triobased data, and we can obtain more complete haplotypes data by combining both of the SIHbased data and the triobased data. To examine how many phases of the sites can be determined anew by combining both of the SIHbased data and the triobased data, we devise a method that combines both information to determine the phases (see the Additional file 1). By using this method, about 82% (237,950/291,466) of the phases of the sites which are undetermined by triobased data could be determined anew and totally about 97% (1,601,381/1,654,897) of the phases could be determined by both the methods. This result suggests that almost all of the phases of the sites can be determined by using both of the SIHbased data and the triobased data.
Spatial distribution of MC values
Dependency of MC values on the fragment parameters
Optimality of inferred parameters
Comparison of running times
Conclusions
With advances in sequencing technologies and experimental techniques, single individual haplotyping (SIH) has become increasingly appealing for haplotype determination in recent years. In this paper, we have developed a probabilistic model for SIH (MixSIH) and defined the minimal connectivity (MC) score that can be used for extracting accurately assembled haplotype segments. We have introduced a new accuracy measure, based on the pairwise consistency of the inferred haplotypes, which is intuitive and easy to calculate but nevertheless avoids some of the problems of existing accuracy measures. By using the MC scores our algorithm can extract highly accurate haplotype segments. We have also found evidence that there are a small number of chimeric fragments in an existing dataset from fosmid poolbased nextgeneration sequencing, and these fragments considerably reduce the quality of the assembled haplotypes. Therefore, a better data processing method is necessary to avoid creating chimeric fragments.
Our program uses only read fragment data derived from an individual. However, it is expected that more powerful analyses could be made by combining SIH algorithms with statistical haplotype phasing methods that use population genotype data. An interesting possibility would be to construct a unified probabilistic model that infers the haplotypes on the basis of both kinds of data.
Abbreviations
 SIH:

Single Individual Haplotyping
 MC:

Minimum connectivity.
Declarations
Acknowledgements
The authors thank their research group colleagues for assistance in this study. This study was supported by a GrantinAid for Young Scientists (21700330), and a GrantinAid for Scientific Research (A) (22240031). Computations were performed using the supercomputing facilities at the Human Genome Center, University of Tokyo. (http://sc.hgc.jp/shirokane.html).
Declarations
The publication costs for this article were funded by a GrantinAid for Young Scientists (21700330), and a GrantinAid for Scientific Research (A) (22240031).
This article has been published as part of BMC Genomics Volume 14 Supplement 2, 2013: Selected articles from ISCBAsia 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S2.
Authors’ Affiliations
References
 Schaid DJ: Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004, 27: 348364. 10.1002/gepi.20037.View ArticlePubMed
 The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851861. 10.1038/nature06258.PubMed CentralView Article
 Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ: The importance of phase information for human genomics. Nat Rev Genet. 2011, 12: 215223. 10.1038/nrg2950.PubMed CentralView ArticlePubMed
 Clark AG: Inference of haplotypes from PCRamplified samples of diploid populations. Mol Biol Evol. 1990, 7: 111122.PubMed
 Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995, 12: 921927.PubMed
 Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001, 68: 978989. 10.1086/319501.PubMed CentralView ArticlePubMed
 Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet. 2003, 73: 11621169. 10.1086/379378.PubMed CentralView ArticlePubMed
 Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34 (8): 816834. 10.1002/gepi.20533.PubMed CentralView ArticlePubMed
 Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011, 12: 703714.PubMed CentralView ArticlePubMed
 Bansal V, Halpern AL, Axelrod N, Bafna V: An MCMC algorithm for haplotype assembly from wholegenome sequence data. Genome Res. 2008, 18: 13361346. 10.1101/gr.077065.108.PubMed CentralView ArticlePubMed
 Bansal V, Bafna V: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008, 24: i153159. 10.1093/bioinformatics/btn298.View ArticlePubMed
 Chen Z, Fu B, Schweller R, Yang B, Zhao Z, Zhu B: Linear time probabilistic algorithms for the singular haplotype reconstruction problem from SNP fragments. J Comput Biol. 2008, 15: 535546. 10.1089/cmb.2008.0003.View ArticlePubMed
 Duitama J, McEwen GK, Huebsch T, Palczewski S, Schulz S, Verstrepen K, Suk EK, Hoehe MR: Fosmidbased whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 2012, 40: 20412053. 10.1093/nar/gkr1042.PubMed CentralView ArticlePubMed
 Kim JH, Waterman MS, Li LM: Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Res. 2007, 17: 11011110. 10.1101/gr.5894107.PubMed CentralView ArticlePubMed
 Levy S et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e25410.1371/journal.pbio.0050254.PubMed CentralView ArticlePubMed
 Li LM, Kim JH, Waterman MS: Haplotype reconstruction from SNP alignment. J Comput Biol. 2004, 11: 505516. 10.1089/1066527041410454.View ArticlePubMed
 Panconesi A, Sozio M: Fast Hare: a fast heuristic for single individual SNP haplotype reconstruction. WABI'04. 2004, 266277.
 The 1000 Genomes Project Consortium: A map of human genome variation from populationscale sequencing. Nature. 2010, 467: 10611073. 10.1038/nature09534.PubMed CentralView Article
 Eid J et al: Realtime DNA sequencing from single polymerase molecules. Science. 2009, 323: 133138. 10.1126/science.1162986.View ArticlePubMed
 Kitzman JO, Mackenzie AP, Adey A, Hiatt JB, Patwardhan RP, Sudmant PH, Ng SB, Alkan C, Qiu R, Eichler EE, Shendure J: Haplotyperesolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol. 2011, 29: 5963. 10.1038/nbt.1740.PubMed CentralView ArticlePubMed
 Suk EK, McEwen GK, Duitama J, Nowick K, Schulz S, Palczewski S, Schreiber S, Holloway DT, McLaughlin S, Peckham H, Lee C, Huebsch T, Hoehe MR: A comprehensively molecular haplotyperesolved genome of a European individual. Genome Res. 2011, 21: 16721685. 10.1101/gr.125047.111.PubMed CentralView ArticlePubMed
 Coop G, Wen X, Ober C, Pritchard JK, Przeworski M: Highresolution mapping of crossovers reveals extensive variation in finescale recombination patterns among humans. Science. 2008, 319 (5868): 13951398. 10.1126/science.1151851.View ArticlePubMed
 Dewal N, Hu Y, Freedman ML, Laframboise T, Pe'er I: Calling amplified haplotypes in next generation tumor sequence data. Genome Res. 2012, 22 (2): 362374. 10.1101/gr.122564.111.PubMed CentralView ArticlePubMed
 Kitzman JO, Snyder MW, Ventura M, Lewis AP, Qiu R, Simmons LE, Gammill HS, Rubens CE, Santillan DA, Murray JC, Tabor HK, Bamshad MJ, Eichler EE, Shendure J: Noninvasive wholegenome sequencing of a human fetus. Sci Transl Med. 2012, 4 (137): 137ra7610.1126/scitranslmed.3004323.PubMed CentralView ArticlePubMed
 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296 (5576): 22252229. 10.1126/science.1069424.View ArticlePubMed
 Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA. 2002, 99 (11): 73357339. 10.1073/pnas.102186799.PubMed CentralView ArticlePubMed
 Anderson EC, Novembre J: Finding haplotype block boundaries by using the minimumdescriptionlength principle. Am J Hum Genet. 2003, 73 (2): 336354. 10.1086/377106.PubMed CentralView ArticlePubMed
 Karp RM: Reducibility among combinatorial problems. Complexity of Computer Computation. 1972, Plenum Press, 85103.View Article
 Attias H: Inferring parameters and structure of latent variable models by variational Bayes. UAI'99. 1999, 2130.
 Zhi D, Wu J, Liu N, Zhang K: Genotype calling from nextgeneration sequencing data using haplotype information of reads. Bioinformatics. 2012, 28 (7): 938946. 10.1093/bioinformatics/bts047.PubMed CentralView ArticlePubMed
 Geraci F: A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics. 2010, 26: 22172225. 10.1093/bioinformatics/btq411.PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.