Maximum parsimony xor haplotyping by sparse dictionary selection
 Abdulkadir Elmas^{1},
 Guido H Jajamovich^{2} and
 Xiaodong Wang^{1}Email author
https://doi.org/10.1186/1471216414645
© Elmas et al.; licensee BioMed Central Ltd. 2013
Received: 7 March 2013
Accepted: 2 September 2013
Published: 23 September 2013
Abstract
Background
Xorgenotype is a costeffective alternative to the genotype sequence of an individual. Recent methods developed for haplotype inference have aimed at finding the solution based on xorgenotype data. Given the xorgenotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual with the aid of a small number of regular genotypes.
Results
We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xorgenotypes augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic data sets with different number of individuals and SNPs, and compare the performances with the stateoftheart xorhaplotyping methods PPXH and XORHAPLOGEN.
Conclusions
Experimental results show good inference qualities for the proposed method under all circumstances, especially on large data sets. Results on a real database, CFTR, also demonstrate significantly better performance. The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors.
Background
A human genome is a sequence of nucleotides that can differ from one individual to another (approximately 0.1% difference between any two individual) due to various reasons, such as insertions/deletions of fractions of the sequence on the genome or mostly the substitution/mutation of single nucleotides on commonly observed sites called single nucleotide polymorphism (SNP) [1]. In most SNPs only two different nucleotides are observed out of 4 nucleotides. The information of nucleotide variations extracted from these SNP sites (loci) is encoded as a sequence called “haplotype”. That is, for a particular SNP site a notation is used for one of the observed nucleotides (e.g., the most commonly observed nucleotide variant  dominant/major allele) and another notation is used for the other (e.g., the least observed nucleotide variant  recessive/minor allele). Because of its informative and heredity nature identifying the haplotypes of individuals has been an important subject in various medical and scientific studies, such as gene related disease discovery and drug design [2, 3], population history research [4], etc. Nonetheless, current experimental techniques are not lowcost and efficient enough for directly sequencing haplotypes of an individual; thereby identifying them is mostly based on indirect approaches, e.g., using computational methods to infer haplotypes from an alternative costeffective data called “genotype”.
The entire human genome consists of 23 distinct chromosomes each appearing in two copies (autosomes) except for the chromosome23 (allosome) which consists of two copies of chromosomeX in females or one chromosomeX and one chromosomeY in males. Each chromosome is a pair of two distinct sequences haplotypes inherited from the parents, i.e., one is from the maternal genome and the other is from the paternal genome. The genotype is sequenced by identifying the types of alleles nucleotide variants across the SNP locations (locus) in chromosomes. In a particular locus of a chromosome if both haplotypes have the same allele we call this site in the genotype homozygous and denote it with the type of alleles in both haplotypes as either commontype or wildtype; otherwise, if both haplotypes have different alleles –one commontype and one wildtype– we call this site heterozygous. When identifying haplotypes for a given genotype, the ambiguity occurs for the heterozygous sites since there is no information about which haplotype has the commontype allele and which haplotype has the wildtype allele. Clearly, genotypes are less informative than haplotypes, as they present an ambiguity on heterozygous sites due to possible permutations and computational methods can be employed to identify which allele come from which haplotype. Recently, more costeffective alternative methods have been used for genotype sequencing [5], e.g., widely used denaturing highperformance liquid chromatography (DHPLC) [6]. By certain applications of such methods one can only determine whether an individual has homozygous or heterozygous allele in a given SNP site, but cannot distinguish the type of allele in homozygous sites. The sequenced data is thereby less informative than the regular genotypes as it only represents the differing sites (XOR operation) between the haplotypes. This less informative form of genotype is named xorgenotype. One can solve the haplotype inference problem based on the xorgenotypes, i.e., xorhaplotyping, with a reasonable extra computational effort.
Methods for solving the haplotype inference problem given the regular genotypes can be summarized in two categories: combinatorial methods that usually state an explicit objective function and propose methods for optimizing it, and statistical methods that relies on the statistical modeling of the problem. Various methods have been published for the haplotype inference problem [7–13], however the xorhaplotyping problem mostly remained underinvestigated. Two particular methods are suitable for xorhaplotyping problems: parsimony haplotyping that is based on the maximum parsimony principle, and perfect phylogeny haplotyping that relies on a population genetics assumption called the infinite sites/alleles model[14], i.e., it assumes that allele sequences are long enough so that a particular allele will have a mutation only once in the phylogenetic tree. The perfect phylogeny (PP) model utilizes the infinite sites assumption by building a tree of individuals haplotypes where all individuals evolve, with no recurrent mutation, from one common ancestor. An approximate solution to xorhaplotyping problem in the case of PP model was introduced in [15] where the xorhaplotype inference was cast as a graph realization problem [16, 17]. However, the proposed method (GREAL) in [15] is not wellsuited for the xorgenotypes with large number of SNPs, i.e., usually limited by 30 SNPs [18], and is not extended to missing data cases.
On the other hand, it is known that in a population of individuals certain haplotypes are frequently found in certain genomic regions [19]. This fact leads to the parsimony principle that states that the genotypes of a population of individuals are generated by the least number of distinct haplotypes. Identifying such smallest set of haplotypes is called Pure (Maximum) Parsimony Problem, which is NPhard [20]. An integer linear programing method was introduced in [21] that finds a pure parsimony solution to this problem, and in [22] a branchandbound method was used to solve pure parsimony problem. In [23] a method called XORHAPLOGEN was proposed for solving haplotype inference problem in the case of xorgenotype data. This method can find accurate solutions for xorgenotypes with large number of SNPs. Another parsimony method was introduced in [24] for xorhaplotype inference by representing it as a graph realization problem called pure parsimony xor haplotyping (PPXH).
In [25] a novel framework for (regular) haplotyping was proposed by interpreting the parsimony principle as a sparse representation of the genotypes. Two approaches are presented: maximizing a sparseness condition on the haplotype frequency vector determined by the inferred haplotypes, and casting the sparsity of this frequency vector as a sparse dictionary selection problem. The latter approach is based on an efficient greedy method SHSD where haplotypes explaining the given genotypes are determined according to a sparse selection from the set of compatible haplotypes. The method constructively determines the solution of each individual while selecting the haplotypes from this set, and it has the convergence guarantee.
For the xorhaplotyping problem, there is an increased ambiguity due to the XOR operation between haplotypes, i.e., the process of xorgenotyping that determines whether the type of alleles in both haplotypes differ in a particular site (heterozygous) or they are the same (homozygous). However this ambiguity can be resolved with the assistance of regular genotypes. Regular genotypes can either be used as postprocessing inputs for eliminating setequivalent solutions of a particular inference, or they can be used to refine inference while constructing the solution.
Tractability of the maximum parsimony haplotyping problem in the xorgenotype case is still open [24]. In this paper, we propose a modified version of SHSD —XHSD—, that can efficiently find a solution for maximum parsimony xorhaplotyping problem and resolve the ambiguity with the help of a small number of regular genotypes. For a given set of xorgenotypes the haplotype pairs for each individual are selected from the set of compatible haplotypes by a sparse dictionary selection method. The selection of dictionary columns from the set of compatible haplotypes and the sparse representation of xorgenotypes is formulated as a joint combinatorial optimization problem. The objective function of this problem maximizes a variance reduction metric over all individuals. Our algorithm is a lowcomplexity greedy method that terminates once the solution is fully determined. To resolve the ambiguity and to improve the inference accuracy, we employ a small number of regular genotypes as constraints for the set of compatible haplotypes to help resolve the type of homozygous alleles.
The remainder of the paper is organized as follows. In Preliminaries, we introduce the xorhaplotype inference problem. In Methods, we formulate the xorhaplotype inference as a sparse dictionary selection problem and present an efficient greedy method for solving this problem. We also discuss the use of regular genotypes to resolve ambiguity. In Extensions section we discuss how the algorithm deals with long sequences and data with missing sites. In Results and discussion, we present the experimental results on synthetic and real data sets under various conditions. Finally, the Conclusions section is given in the end.
Preliminaries
In an SNP locus only 2 nucleotides are observed, and a single bit is sufficient for the representation of nucleotide variants such that 0 encodes the major allele and 1 encodes the minor allele. The haplotype of an individual can thereby be represented with a binary vector that shows the SNP variants across the individual’s chromosome. The genotype can then be thought of as a ternary vector where a 0 (2) indicates that the site is homozygous and both haplotypes have major 0/0 (minor 1/1) alleles, and 1 indicates that the site is heterozygous and the haplotypes have different alleles 0/1 or 1/0. Notice that when encoding homozygous and heterozygous sites we used a different notation from the literature in order to express a genotype vector as the sum of two haplotypes: a minorhomozygous SNP is encoded with 2 and a heterozygous SNP is encoded with 1, so that a 2 in the genotype is given by (the sum of) two minor alleles, and a 1 in the genotype is given by (the sum of) one major and one minor allele.
In general, given a lengthL genotype vector, k ≤ L of the loci are heterozygous and thereby ambiguous, in each of the k sites one haplotype can take two values –0 or 1– and the other haplotype takes the complement value. Considering all k heterozygous sites, one haplotype can then be one of the 2^{ k } binary sequences, and the other haplotype will be the complement (inverted values) of that sequence. Therefore, for solving a genotype with k heterozygous sites, the pair of haplotypes is drawn from a set of 2^{ k } distinct binary vectors of lengthL.
On the other hand, in xorhaplotyping problem the conflated data — xorgenotype — is less informative than the regular genotype with respect to the information loss about the type of allele in homozygous sites. The xorgenotype is itself a binary vector, where for a given site, 1 indicates heterozygous SNP where both haplotypes have different alleles for this given site. The xorgenotype can be represented by the XOR sum of two haplotypes, likewise, for a given site 0 indicates a homozygous SNP where the haplotypes have the same allele but without any distinction whether the type of the allele is major or minor. That is, the xorgenotype contains the information whether a particular SNP site has homozygous alleles, but the type of alleles for those homozygous sites is not identified. Every site of an xorgenotype is ambiguous, and each site of the corresponding haplotype can take two values. Therefore, a lengthL xorgenotype can be explained by a pair of haplotypes that are drawn from a set of 2^{ L } distinct binary vectors of lengthL. Hence, because of the additional ambiguity on homozygous sites, the number of possible solutions for an xorgenotype is significantly (in fact, exponentially) larger than that of a regular genotype of the same size.
Besides the xorhaplotyping problem is NPhard, there is also no unique solution to this problem. The nature of the XOR operation results in a phenomenon called bit flip degree of freedom[15], i.e., for a particular solution set H consisting of lengthL haplotypes, one can produce equivalent solution sets by inverting a certain SNP i ≤ L (or a set of SNPs $S\subseteq \{1,\dots ,L\}$) in all haplotypes of H. Notice that inverting (complementing) an SNP across all haplotypes has no effect on the xorgenotypes they generated, because even the alleles explaining homozygous sites of xorgenotypes are not distinguished (hidden). More specifically, assume that ${h}_{i}^{1}\left(\ell \right)\in \{0,1\}$ and ${h}_{i}^{2}\left(\ell \right)\in \{0,1\}$ represent the haplotypes of ith individual in the ℓth SNP and they generate that individual’s xorgenotype x_{ i }(ℓ) such that ${x}_{i}\left(\ell \right)={h}_{i}^{1}\left(\ell \right)\oplus {h}_{i}^{2}\left(\ell \right)$. Then the complemented SNPs of haplotypes also explain that SNP of the same xorgenotype, i.e., ${x}_{i}\left(\ell \right)=\overline{{h}_{i}^{1}\left(\ell \right)}\oplus \overline{{h}_{i}^{2}\left(\ell \right)}$. It then follows that for a particular set H of lengthL haplotypes that solves a given set of xorgenotypes, there are at most $\sum _{i=1}^{L}\left(\genfrac{}{}{0ex}{}{L}{i}\right)={2}^{L}1$ equivalent sets ${\mathit{H}}_{i}^{\prime},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots {2}^{L}1$ to H where each equivalent set H i′ also solves that given set of xorgenotypes.
Problem definition
where ${\mathit{h}}_{i}^{j}=\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{h}_{i}^{j}\right(1)\dots {h}_{i}^{j}(L\left)\right]}^{T}$ is the jth (j = 1, 2) haplotype of the ith individual consisting of L SNPs. In this representation, we say that the xorgenotype of the ith individual x_{ i } is phased by the haplotype pair $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2}\}$.
In regular haplotyping, a putative haplotype z ∈ {0, 1}^{ L } is called compatible with a genotype g ∈ {0, 1, 2}^{ L } if (g − z) ∈ {0, 1}^{ L }, and such a haplotype is a possible solution that can explain that genotype. That is, the haplotype pair {z,(g − z)} is one of the possible solutions to the genotype g. Therefore, for every given genotype g_{ i } it is essential to determine a set of compatible haplotypes ${\mathcal{H}}_{i}$ when searching for possible solutions. The union of the sets ${\mathcal{H}}_{1},\dots ,{\mathcal{H}}_{N}$ for N individuals forms the matrix Z ∈ {0, 1}^{L × M} where M is the total number of distinct compatible haplotypes.
In xorhaplotyping, on the other hand, it is trivial to see that any haplotype z ∈ {0, 1}^{ L } is compatible (consistent) with any xorgenotype x, i.e., x = z ⊕ z^{′} since there always exists a haplotype z^{′} ∈ {0, 1}^{ L } such that z^{′} = x ⊕ z. Therefore, the set of compatible haplotypes ${\mathcal{H}}_{i}$ for a given lengthL xorgenotype x_{ i } consists of all binary vectors of lengthL, i.e., ${\mathcal{H}}_{1}={\mathcal{H}}_{2}=\cdots ={\mathcal{H}}_{N}={\{0,1\}}^{L\times {2}^{L}}\triangleq \mathit{Z}$.
Because of this compatibility between the xorgenotypes and candidate haplotypes an SNP site can always be explained by either of the two alleles, and thus unambiguous SNPs do not exist anymore. Notice that, in particular, an xorgenotype with allhomozygous SNPs is still ambiguous and requires to be solved up to bit flipping. However, we know that such an xorgenotype is always explained by a pair of identical haplotypes which correspond to the same column of Z. On the other hand, if there is at least one heterozygous SNP in the xorgenotype then its phasing haplotypes are not identical and correspond to the different columns in Z.
where (.)_{2} represents the componentwise modulo2 operation, and v_{ i } ∈ {0, 1, 2}^{ M }, 1^{ T }v_{ i } = 2, is the sparse vector indicating the haplotype locations as the indices of the matrix Z of consistent haplotypes. Notice that the modulo2 operation in (3) is equivalent to the XOR operation between the two haplotypes selected by v_{ i }.
Given Z, finding the indicator vector v_{ i } for an individual is equivalent to inferring its haplotype pair $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2}\}$. The maximum parsimony principle suggests that a given set of xorgenotypes should be explained by the smallest number of distinct haplotypes. Therefore, given the set of xorgenotypes for N individuals $\{{\mathit{x}}_{i},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N\}$, one needs to infer the haplotype pairs for each individual $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N\}$, so that the union of all inferred haplotypes forms the smallest set as possible. In other words, the xorhaplotyping problem is to infer v_{ i }, i = 1, …, N, given Z while selecting as few columns of Z as possible.
Methods
Xor Haplotyping by Sparse Dictionary Selection (XHSD)
where ${\mathcal{A}}_{i}$ is a set of indices corresponding to the nonzero elements of v_{ i }, ${\mathit{Z}}_{{\mathcal{A}}_{i}}$ is the submatrix of Z consisting of the columns indexed by ${\mathcal{A}}_{i}$, and $\stackrel{~}{{\mathit{v}}_{i}}$ is the nonzero elements of v_{ i }.
For each observed xorgenotype x_{ i }, the phasing haplotypes are located in columns of Z indexed by ${\mathcal{A}}_{i}$. The union of these column indices, i.e., $\mathcal{D}={\cup}_{i}{\mathcal{A}}_{i}$, forms the dictionary of the haplotypes that suffices to construct all given xorgenotypes. The maximum parsimony principle then dictates that the dictionary $\mathcal{D}$ should contain the least possible number of elements that can reconstruct all observed xorgenotypes. The set of haplotypes indicated by such a sparse dictionary $\mathcal{D}$ is given by $\mathit{H}={\mathit{Z}}_{\mathcal{D}}$, where ${\mathit{Z}}_{\mathcal{D}}$ is the submatrix of Z consisting of the columns indexed by $\mathcal{D}$. Then H is a solution set to the maximum parsimony haplotyping problem for the given set of xorgenotypes {${\mathit{x}}_{i}\phantom{\rule{2.77626pt}{0ex}},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N$}.
To solve the xorhaplotyping problem, we choose the sparse dictionary $\mathcal{D}$ to minimize the average distance between the observed xorgenotypes and the closest approximations constructed by the haplotypes in ${\mathit{Z}}_{\mathcal{D}}$. Since there is no prior information about the dictionary $\mathcal{D}$ and the indices ${\mathcal{A}}_{i}$ for proper reconstruction of each xorgenotype, determining $\mathcal{D}$ and ${\mathcal{A}}_{i}$ leads to a combinatorial problem. This jointoptimization problem can be efficiently solved by a greedy method that we will explain next.
Notice that determining both $\mathcal{D}$ for a given n in (8) and $\mathcal{A}$ for a given $\mathcal{D}$ in (7) is a combinatorial problem. In [26], it is shown that such combinatorial problems can be approximately solved efficiently by a simple greedy method if the objective function satisfies a fundamental property called submodularity. In [25], it is shown that the dictionary selection problem for (regular) haplotype inference has a cost function that is approximately submodular, and when a greedy method is used to optimize this cost function it can efficiently find an approximate solution with a theoretical guarantee [27].
For xorhaplotype inference, on the other hand, the problem is fundamentally different. That is, the submodularity property may not hold for the cost function in (5) due to the XOR operation, and thereby the theoretical guarantee does not hold either for the greedy method. Nonetheless, we still use the similar greedy heuristic as SHSD in [25] in order to maximize the variance reduction metric in (5) over the set of observations.
To compute (10) requires solving (5) and (6) for each k. In (6) for each individual i, ${\mathcal{A}}_{i}$ is found by computing the Euclidean distance (5) between x_{ i } and the possible reconstructions given by the pairwise xorsum of all columns in ${\mathit{Z}}_{\mathcal{D}}$, and picking the columns that minimize (6). Whenever indices ${\mathcal{A}}_{i}$ yield zero in (5) we can explain that individual with the corresponding haplotypes in Z, i.e., ${\mathit{x}}_{i}={\left({\mathit{Z}}_{{\mathcal{A}}_{i}}\phantom{\rule{2.77626pt}{0ex}}{\stackrel{~}{\mathit{v}}}_{i}\right)}_{2}$. The dictionary ${\mathcal{D}}_{\ell}$ keeps growing until all xorgenotypes are explained, i.e., $F\left({\mathcal{D}}_{\ell}\right)=\frac{1}{N}\sum _{i=1}^{N}{F}_{i}\left({\mathcal{D}}_{\ell}\right)=\frac{1}{N}\sum _{i=1}^{N}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.
Notice that in XHSD algorithm the number of compatible haplotypes Z exponentially increase in comparison to regular haplotyping problem with SHSD. However, –when available– we can reduce Z with respect to regular genotype information via utilizing them in the cost function (5). The necessary modifications are discussed in the next section XHSD with regular genotypes. Another fundamental difference in xorhaplotyping is that the xorgenotypes do not provide unambiguous genotype information which one can initialize the dictionary with corresponding haplotypes and improve the reconstruction accuracy. Nonetheless, with a bias weight, the modified cost function can exploit the available regular genotypes even when they are not unambiguous.
Summary of XHSD algorithm:

Initialization.

$\mathit{Z}={\{0,1\}}^{L\times {2}^{L}}$.

n ← 1.

${\mathcal{D}}_{n1}^{\ast}=\varphi $.


Iterate until all xorgenotypes are explained, i.e., $F\left({\mathcal{D}}_{n}^{\ast}\right)=\frac{1}{N}\sum _{i=1}^{N}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.

Perform the greedy search.

∗ For $\forall j\in \left\{1,\dots ,{2}^{L}\right\}\setminus {\mathcal{D}}_{n1}^{\ast}$, compute $F\left({\mathcal{D}}_{n1}^{\ast}\cup \left\{j\right\}\right)$.

∗ Let ${j}^{\ast}=arg\phantom{\rule{.5em}{0ex}}{\text{max}}_{j\in \{1\dots {2}^{L}\}\setminus \underset{n1}{\overset{\ast}{\mathcal{D}}}}F\left(\underset{n1}{\overset{\ast}{\mathcal{D}}}\cup \left\{j\right\}\right)$.
Set ${\mathcal{D}}_{n}^{\ast}={\mathcal{D}}_{n1}^{\ast}\cup \left\{{j}^{\ast}\right\}$.

∗ Check if any xorgenotype is explained by the addition of the new element ${\mathit{h}}^{{j}^{\ast}}$, i.e., if (5) is zero. If so, the inferred haplotype pair for the individual with such an xorgenotype is $[\phantom{\rule{0.3em}{0ex}}{\mathit{h}}^{{j}^{\ast}},\phantom{\rule{2.77626pt}{0ex}}{\mathit{x}}_{i}\oplus {\mathit{h}}^{{j}^{\ast}}]$.


n ← n + 1.

Given the xorgenotypes of a set of individuals, this algorithm finds the haplotypes of each individual based on the maximum parsimony principle.
After initializing Z, and starting with empty dictionary ${\mathcal{D}}_{0}$, the algorithm performs the greedy search by adding one haplotype from Z (with the maximal marginal gain) at a time. At iteration n = 1, (10) calculates m = arg max(0, 0, 0, 0, 0, 0, 0, 0) and m is randomly picked as 5 among the equal maximum values, then the corresponding haplotype Z_{5} = [001]^{ T } is added to the dictionary, i.e., ${\mathcal{D}}_{1}=\phantom{\rule{2.77626pt}{0ex}}\left[\phantom{\rule{0.3em}{0ex}}5\right]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{1}}=\left[\begin{array}{c}0\\ 0\\ 1\end{array}\right]$. Similarly, at n = 2, m = arg max(0.33, 0, 0.66, 0, 0, 0.33, 0) = 3 is calculated and the haplotype Z_{3} = [010]^{ T } is added to the dictionary, i.e., ${\mathcal{D}}_{2}=\phantom{\rule{2.77626pt}{0ex}}[\phantom{\rule{0.3em}{0ex}}5,3]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{2}}=\left[\begin{array}{c}00\\ 01\\ 10\end{array}\right]$. x_{3} is explained by the addition of new haplotype, i.e., x_{3} = [001]^{ T } ⊕ [010]^{ T }, yet the other xorgenotypes are not explained. At n = 3, m = arg max(1.33, 0.66, 0.66, 0.66, 0.66, 1.33, 0.66) = 1 is calculated and the haplotype Z_{1} = [000]^{ T } is added to the dictionary, i.e., ${\mathcal{D}}_{3}=\phantom{\rule{2.77626pt}{0ex}}[\phantom{\rule{0.3em}{0ex}}5,3,1]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{3}}=\left[\begin{array}{c}000\\ 010\\ 100\end{array}\right]$. Other two xorgenotypes are explained by the new addition, i.e., x_{1} = [001]^{ T } ⊕ [000]^{ T }, x_{2} = [010]^{ T } ⊕ [000]^{ T }, and the algorithm converges at n = 3 via calculating $F\left({\mathcal{D}}_{3}\right)=\frac{1}{3}\sum _{i=1}^{3}{F}_{i}\left({\mathcal{D}}_{3}\right)=\frac{4}{3}=\frac{1}{3}\sum _{i=1}^{3}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.
This simple example demonstrates how the proposed greedy approach can efficiently construct sparse solutions, where three xorgenotypes are explained by only three haplotypes within three iterations. Nonetheless, the solution set has the ambiguity of being one of the equivalent sets of the true solution due to the bit flip degree of freedom which should be resolved.
Resolving bit flip degree of freedom
In [15] it is shown that the xor perfect phylogeny problem can be solved up to bit flipping based on the characteristics of the given xorgenotypes. Let X ∈ {0, 1}^{L × N} be the xorgenotypes matrix of N individuals such that $\mathit{X}=\phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{\mathit{x}}_{1}\phantom{\rule{2.77626pt}{0ex}}{\mathit{x}}_{2}\dots {\mathit{x}}_{N}]$. Denote χ_{ i } as the set of heterozygous loci for the ith individual, i.e., χ_{ i } = {ℓ:x_{ i }(ℓ) = 1}, where x_{ i }(ℓ) is the ℓth SNP in x_{ i }. If there exists a set of individuals $\mathcal{I}\subseteq \{1,\dots ,N\}$ whose xorgenotypes have empty intersection, i.e., ${\cap}_{\mathcal{I}}{\chi}_{\mathcal{I}}=\varphi $, then with the knowledge of regular genotypes ${\mathit{G}}_{\mathcal{I}}\in {\{0,1,2\}}^{L\times \left\mathcal{I}\right}$ of those individuals one can remove all bit flip degrees of freedom. The empty intersection indicates that an SNP will have homozygous allele in at least one of those individuals and therefore that SNP can be resolved by revealing the type of allele at the corresponding regular genotype. Following this, a postprocessing method is suggested in [15] that can remove the bit flip degree of freedom across the loci where a set of xorgenotypes have empty intersection.
However, this postprocessing method have certain limitations. Notice that, for large L the setequivalent solutions are highly specific to the infererred H, e.g., for a given set of xorgenotypes it is very likely that any two different inferences H_{1} and H_{2} —which are not setequivalent— can have very different setequivalent solutions. Bit flipping on different inferences likely leads to different results, and thereby the bit flipping accuracy largely depends on the initial inference H which is made by avoiding the prior knowledge on homozygous SNPs, i.e., regular genotypes. Besides, –when available– utilizing more regular genotypes in postprocessing does not necessarily improve the bit flipping accuracy. Basically, to decide among the appropriate bit flippings for a particular locus requires the knowledge of that homozygous SNP from a regular genotype. Intuitively, to reveal a set of homozygous SNPs by employing the least number of regular genotypes, e.g., provided by the MTI method, will be necessary and sufficient for removing the bit flip degree of freedom across those SNPs. On the other hand, a larger number of regular genotypes will not be any more informative due to possible inconsistencies on the type of homozygous allele for an SNP site across the given regular genotypes.
Furthermore, notice that flipping the bits on some loci across all the haplotypes in H does not affect the parsimony of the solution. The final solution H^{′} will have the same parsimony with H regardless of the set of loci that are flipped. From the maximum parsimony point of view, refining an xorhaplotyping solution via bit flipping method does not necessarily lead to global optimum unless the initial inference is a setequivalent of the global optimal solution.
In most cases the xorgenotypes in X has empty intersection and for each run MTI outputs 2 or 3 individuals, i.e., $\left\mathcal{I}\right\le 3$; then ${\mathit{G}}_{\mathcal{I}}$ has at most 3 regular genotypes. One can obtain a larger ${\mathit{G}}_{\mathcal{I}}$ by performing multiple runs of MTI with X and collecting the distinct regular genotypes given by MTI.
Next we explain the necessary modifications to the XHSD algorithm for utilizing the regular genotypes.
XHSD with regular genotypes
where Z is the set of haplotypes that are compatible with the ith individual’s genotype g_{ i }, and $\mathcal{A}$ contains the indices of the haplotypes in Z that are used to approximate g_{ i }. In this representation the approximation accuracy is potentially higher when compared to the xorgenotypes, since the homozygous SNPs in g_{ i } are unambiguous. The haplotypes that are used to approximate those SNPs will be more reliable candidates when building the dictionary $\mathcal{D}$.
The weight parameter b_{ i } could be set as proportional to the average rate of homozygous SNPs per genotype, assuming that the more homozygous sites the regular genotype contains the more informative it will be. We experimentally set b_{ i }=4 as it yielded good performance with both synthetic and real databases.
Extensions
Long xorgenotypes
Note that the size of Z grows exponentially with the lengthL due to the compatibility between haplotypes and xorgenotypes. That is, finding the solution of a lengthL xorgenotype requires to perform the greedy search over Z that consists of 2^{ L } haplotypes. To mitigate the computational complexity we employ the partitionligation method [28] as in [25] where the block partitioning is based on identifying the recombination hot spots [29] existing between the haplotype blocks [30]. After partitioning the SNP sequences will be divided into blocks where within each block the haplotype diversity is as low as possible.
To determine the initial and ending loci of each block $[\phantom{\rule{0.3em}{0ex}}{l}^{q}:{m}^{q}],\phantom{\rule{2.77626pt}{0ex}}q=1,\dots Q$ that minimizes $\mathcal{E}$ we use the recursive method explained in [25], i.e., for each ending locus 1 ≤ m ≤ L we determine the block $[\phantom{\rule{0.3em}{0ex}}{l}_{m}^{\ast}:m]$, with $m{l}_{m}^{\ast}+1\le W$, that contributes with the lowest entropy and then backtrack the best initial points ${l}_{m}^{\ast}$ for each consecutive block by starting with the block $[\phantom{\rule{0.3em}{0ex}}{l}_{L}^{\ast},L]$.
Missing data
Genotyping errors often occur when the observed genotype of an individual differs from the original sequence for various reasons [31, 32]. A particular type of genotyping error is the case when some loci are not observed/missed during sequencing or other application processes. Although methods dealing with some type of errors were proposed, often erroneous genotypes are produced with significant missing/error rates [33]. Therefore, it is of high importance for an xorhaplotyping technique to be adaptive for resolving such databases with missing sites. We next present a modification to XHSD in order to perform xorhaplotyping for the individuals exposed to missing data conditions.
where $dim\left({\stackrel{~}{\mathit{x}}}_{i}\right)$ gives the dimension of ${\stackrel{~}{\mathit{x}}}_{i}$.
Different weight functions could be employed to exploit the distribution of missing sites. Since, in our experiments, the missing sites are uniformly distributed across the SNPs and individuals the function in (14) gave a good performance.
The proposed method does not account for the direct inference of the missing sites, i.e., imputing missing genotypes [34]. However, the missing values in each xorgenotype can be recovered from the solution by simply looking at the haplotype pairs which are specifically inferred for each individual. Since the proposed method has robust performance against missing data, as presented in the next section, the inferred solution will be sufficient to type missing genotype sites. An implementation of the proposed method –with aforementioned extensions– is provided in “Additional file 1”.
Results and discussion
We performed xorhaplotyping on various data sets, with and without missing information on loci: synthetic data sets with different recombination rates simulated by a coalescence based program of [36], a database consisting of the SNPs in the CFTR gene that is associated with cystic fibrosis (CF) disorder [37], and another database (ANRIL) containing the SNPs that have relatively lower linkage disequilibrium (high polymorphism). We tested different xorhaplotyping methods that are based on different assumptions including the parsimony graph realization model PPXH [24], the parsimony genetic search model XORHAPLOGEN (XHAP) [23], the graph representation model GREAL [15], and an integer programming approach PolyIP [38]. Among the four methods the last two were ineffective for practical reasons. GREAL failed at finding solutions for data sets with reasonably long sequences (SNPs >30), and PolyIP method is often computationally inefficient when solving even a simple problem (e.g., it takes more than 24 hours to solve a set of 50 individuals with 30 SNPs).
Synthetic data
Based on different recombination rates three different scenarios are considered in synthetic data sets: no recombination (r = 0), and recombination with rates r=4 and r=40, respectively. The recombination rate is the rate that the haplotypes of an individual exchange the sequence fragments due to several reasons such as crossingover events. This fact is simulated by a model given in Hudson’s software [36]. For each scenario we generated 100 different data sets by random pairing of a set of simulated haplotypes of different lengths (5 ≤ L ≤ 46) for a given population size. This is repeated for different population sizes as well, N ∈ {10, 20, 30, 40, 50}.
It is worthy of noting that the algorithms based on segmentation may deteriorate when processing long xorgenotype sequences, especially with increasing recombination rates where the detection of haplotype blocks is complicated [39]. We used block partitioning (segmentation) in XHSD to reduce complexity when processing long xorgenotype sequences. In Figure 4 the segmentation effect is noticeable particularly in very high recombination rates, i.e., r = 40. However, in general scenario, i.e., r ≤ 4, we can say that the segmentation effect is not significant for the proposed method’s performance, and it outperforms XORHAPLOGEN in most data sets containing typical recombination rates.
For more practical results we added regular genotypes in each method with different percentages of the population and allowed the methods to remove ambiguity by their own, except for PPXH. Since PPXH cannot make use of regular genotypes directly, we applied bit flipping using the MTI solver to remove ambiguity for this method. To regularly genotype a given percentage of the population, the regular genotypes are determined by running the MTI method several times until the number of distinct regular genotypes obtained achieves the given percentage of the total number of individuals.
Missing data
We investigated capability for dealing with missing data under different circumstances by various methods. Since the methods performed similarly under zero recombination rate we used the same data sets with no recombination to generate the database with missing entries. An SNP site of an individual is defined as “missing” with a probability of P_{ miss } and the data sets for different percentages of missing SNPs are generated accordingly. PPXH method is excluded since it cannot handle missing data. In XHSD the block partitioning is applied as before with a maximum block size of W = 8 SNPs.
CFTR gene database
Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the gene that encodes the cystic fibrosis transmembrane conductance regulator protein (CFTR). In [37], various mutations on 23 polymorphic locations from the chromosome 7 are detected as the disease loci for CF. We used this database corresponding to 29 distinct haplotypes to generate random xorgenotypes. By combining the haplotype pairs at random we generated the xorgenotypes for a given number of individuals N, and repeated the process for different population sizes, i.e., N ∈ {100, 200, 300, 400}. In this database, the data sets with small number of individuals present high haplotype diversities, i.e., many of the distinct haplotypes are only used once in the generation of individuals. Therefore, the larger data sets that have low haplotype diversities are expected to be solved with higher accuracy by biologicallyoriented methods, such as XORHAPLOGEN which obtains its inference according to a multilocus linkage disequilibrium (LD)based block identification model.
Typing errors
ANRIL database
Notice that the algorithms cannot mitigate the error rates with increasing number of individuals. This can be explained by the occurrence of very high haplotype diversity in corresponding lowLD SNP regions. The number of distinct haplotypes explaining the given number of individuals presumably remains at high diversity as the number of individuals grows, whereas the methods based on maximum parsimony principle fail to incorporate this fact. They are tend to find parsimonious (lowdiversity) solutions in all population sizes, with a decreasing ratio (ρ) of “total number of distinct haplotypes explaining the given set of individuals” to “total number of given individuals” as the population size grows. It is worthy of noticing that, in XHSD results in Figure 12 (P_{ miss } = 0), we observed that such ratio decreases as ρ = [1.3, 0.95, 0.83, 0.72, 0.66] in respect to the populations with 10,20,30,40,50 individuals; whereas the same ratio for the true phasing (ground truth data) is in fact much higher, i.e., ρ = [1.7, 1.48, 1.34, 1.27, 1.24], respectively, thereby causing the parsimonybased haplotyping methods to deteriorate on this database. On the other hand, in highLD CFTR database, the same ratio for the true phasing is very low due to low haplotype diversity, i.e., ρ = [0.29, 0.14, 0.1, 0.07], in respect to the populations with 100,200,300,400 individuals, and the XHSD method is good at achieving very similar rates,i.e., ρ = [0.43, 0.15, 0.1, 0.07], respectively.
Conclusions
In this paper, we have presented a new xorhaplotyping method XHSD based on the maximum parsimony principle that infers the haplotype pairs for each member of a group of unrelated individuals by observing their xorgenotypes. A dictionary selection method is utilized to find the smallest set of haplotypes selected from a candidate set that can explain the given set of xorgenotypes. The proposed approach requires regular genotypes from only a small percentage of individuals for the removal of ambiguity across all SNPs of the inferred haplotypes. The smallest subgroup of individuals having the most informative regular genotypes are efficiently determined by the minimum tree intersection algorithm. Although the inference accuracy was proportional to the percentage of the individuals given by regular genotypes, XHSD shows less dependency on regular genotypes compared to other methods. Experimental results have demonstrated that XHSD is a reliable method for xorhaplotyping under all circumstances including missing data and typing error cases. Low rates of missing values (≤ 10%) on the xorgenotypes has often insignificant contribution to the error rates, and the proposed method can deal with ∼ 2% typing errors. Particularly for large databases, XHSD produces the most accurate solution with significantly low error rates compared to other lowcomplexity xorhaplotyping methods. Experiments with CFTR gene database also proved that our approach can perform effectively on real data sets with/without missing sites. Another database with particularly lower LD rates indicates that the proposed algorithm can achieve the best performance with the stateoftheart algorithms. We expect that XHSD can serve as a practical tool for xorhaplotyping on realworld large instances, as the large data collections become more available in the era of nextgeneration DNA sequencing.
Declarations
Authors’ Affiliations
References
 Brookes A: The essence of SNPs. Gene. 1999, 234: 177186. 10.1016/S03781119(99)00219X.View ArticlePubMedGoogle Scholar
 Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 15161517. 10.1126/science.273.5281.1516.View ArticlePubMedGoogle Scholar
 Kwok PY, Gu Z: Single nucleotide polymorphism libraries: why and how are we building them?. Mol Med Today. 1999, 5: 538543. 10.1016/S13574310(99)016019.View ArticlePubMedGoogle Scholar
 Gray IC, Campbell DA, Spurr NK: Single nucleotide polymorphisms as tools in human genetics. Human Mol Genet. 2000, 9: 24032408. 10.1093/hmg/9.16.2403.View ArticleGoogle Scholar
 Peters T, Sedlmeier R: Current methods for highthroughput detection of novel DNA polymorphisms. Drug Discov Today: Technol. 2006, 3 (2): 123129. 10.1016/j.ddtec.2006.05.002.View ArticleGoogle Scholar
 Xiao W, Oefner PJ: Denaturing highperformance liquid chromatography: a review. Human Mutat. 2001, 17: 439474. 10.1002/humu.1130.View ArticleGoogle Scholar
 Bansal V, Bafna V: HapCUT: An efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008, 24 (16): i153i159. 10.1093/bioinformatics/btn298.View ArticlePubMedGoogle Scholar
 He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from wholegenome sequence data. Bioinformatics. 2010, 26 (12): i183i190. 10.1093/bioinformatics/btq215.PubMed CentralView ArticlePubMedGoogle Scholar
 Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34 (8): 816834. 10.1002/gepi.20533.PubMed CentralView ArticlePubMedGoogle Scholar
 Gupta A, Manuch J, Stacho L, Zhao X: Algorithm for haplotype inference via galledtree networks with simple galls. J Comput Biol. 2012, 19 (4): 439454. 10.1089/cmb.2010.0145.View ArticlePubMedGoogle Scholar
 Iliadis A, Anastassiou D, Wang X: A unified framework for haplotype inference in nuclear families. Ann Hum Genet. 2012, 76 (4): 312325. 10.1111/j.14691809.2012.00715.x.View ArticlePubMedGoogle Scholar
 Lai E, Wang W, Jiang T, Wu K: A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree. BMC Bioinformatics. 2012, 13 (Suppl 17): S1910.1186/1471210513S17S19.PubMed CentralView ArticlePubMedGoogle Scholar
 He D, Han B, Eskin E: Hapseq: An optimal algorithm for haplotype phasing with imputation using sequencing data. J Comput Biol. 2013, 20 (2): 8092. 10.1089/cmb.2012.0091.PubMed CentralView ArticlePubMedGoogle Scholar
 Kimura M, Crow JF: The number of alleles that can be maintained in a finite population. Genet. 1964, 49: 725738.Google Scholar
 Barzuza T, Beckmann JS, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: Typing without calling the allele. IEEE/ACM Trans Comput Biol Bioinformatics. 2008, 5: 101109.View ArticleGoogle Scholar
 Liberatore V: Matroid decomposition methods for the set maxima problem. Proceedings of the Ninth Annual ACMSIAM Symposium on Discrete Algorithms. 1998, Philadelphia: Society for Industrial and Applied Mathematics, 400409.Google Scholar
 Gusfield D: Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. Proc. 6th Annual Conference on Research in Computational Molecular Biology (RECOMB 2002). 2002, New York: ACM, 166175.Google Scholar
 Barzuza T, Beckmann J, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: XorGenotypes and tag SNPs. Combinatorial Pattern Matching, Volume 3109. Edited by: Dogrusoz U, Muthukrishnan S, Sahinalp S. 2004, Berlin, Heidelberg, New York: SpringerVerlag, 1431.View ArticleGoogle Scholar
 Patil N: Blocks of limited haplotype diversity revealed by highresolution scanning of human chromosome 21. Science. 2001, 294: 17191723. 10.1126/science.1065573.View ArticlePubMedGoogle Scholar
 Lancia G, Pinotti MC, Rizzi R: Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. INFORMS J on Comput. 2004, 16: 348359. 10.1287/ijoc.1040.0085.View ArticleGoogle Scholar
 Gusfield D: Haplotype inference by pure parsimony. Comb Pattern Matching. 2003, Berlin, Heidelberg, New York: SpringerVerlag, 144155.View ArticleGoogle Scholar
 Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics. 2003, 19: 17731780. 10.1093/bioinformatics/btg239.View ArticlePubMedGoogle Scholar
 Sazonova N, Sazonov E, Harner EJ: Algorithm for haplotype resolution and block partitioning for partial XORgenotype data. J of Biomed Inform. 2010, 43: 5159. 10.1016/j.jbi.2009.08.009.View ArticleGoogle Scholar
 Bonizzoni P, Della Vedova G, Dondi R, Pirola Y, Rizzi R: Pure parsimony xor haplotyping. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 7: 598610.View ArticleGoogle Scholar
 Jajamovich GH, Wang X: Maximumparsimony haplotype inference based on sparse representations of genotypes. IEEE Trans Signal Process. 2012, 60: 20132023.View ArticleGoogle Scholar
 Nemhauser GL, Wolsey LA, Fisher ML: An analysis of approximations for maximizing submodular set functions—I. Math Program. 1978, 14: 265294. 10.1007/BF01588971.View ArticleGoogle Scholar
 Krause A, Cevher V: Submodular dictionary selection for sparse representation. Proceedings of 27th International Conference on Machine Learning (ICML 2010). 2010, Madison: Omnipress, 567574.Google Scholar
 Niu T, Qin Z, Xu X, Liu J: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am J Hum Genet. 2002, 70: 157169. 10.1086/338446.PubMed CentralView ArticlePubMedGoogle Scholar
 Liang K, Wang X: A deterministic sequential monte carlo method for haplotype inference. IEEE J Selected Topics Signal Process. 2008, 2: 322331.View ArticleGoogle Scholar
 Hey J: What’s so hot about recombination hotspots?. PLoS Biol. 2004, 2: e190+PubMed CentralView ArticlePubMedGoogle Scholar
 Bonin A, Bellemain E, Bronken Eidesen P, Pompanon F, Brochmann C, Taberlet P: How to track and assess genotyping errors in population genetics studies. Mol Ecol. 2004, 13: 32613273. 10.1111/j.1365294X.2004.02346.x.View ArticlePubMedGoogle Scholar
 Pompanon F, Bonin A, Bellemain E, Taberlet P: Genotyping errors: causes, consequences and solutions. Nat Rev Genet. 2005, 6: 847846. 10.1038/nrg1707.View ArticlePubMedGoogle Scholar
 Hackett CA, Broadfoot LB: Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps. Heredity. 2003, 90: 3338. 10.1038/sj.hdy.6800173.View ArticlePubMedGoogle Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genet. 2007, 39 (7): 906913. 10.1038/ng2088.View ArticlePubMedGoogle Scholar
 Tininini L, Bertolazzi P, Godi A, Lancia G: CollHaps: a heuristic approach to haplotype inference by parsimony. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 7: 511523.View ArticleGoogle Scholar
 Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337338. 10.1093/bioinformatics/18.2.337.View ArticlePubMedGoogle Scholar
 Kerem B, Rommens J, Buchanan J, Markiewicz D, Cox T, Chakravarti A, Buchwald M, Tsui L: Identification of the cystic fibrosis gene: genetic analysis. Science. 1989, 245: 10731080. 10.1126/science.2570460.View ArticlePubMedGoogle Scholar
 Halldorsson BV, Bafna V, Edwards N, Yooseph S, Istrail S: A survey of computational methods for determining haplotypes. Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference. 2004, Berlin, Heidelberg, New York: SpringerVerlag, 2647.View ArticleGoogle Scholar
 Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missingdata imputation. Am J Human Genet. 2005, 76 (3): 449462. 10.1086/428594.View ArticleGoogle Scholar
 Excoffier L, Laval G, Balding D: Gametic phase estimation over large genomic regions using an adaptive window approach. Human Genomics. 2003, 1: 719. 10.1186/14797364117.PubMed CentralView ArticlePubMedGoogle Scholar
 Iacobucci I, Sazzini M, Garagnani P, Ferrari A, Boattini A, Lonetti A, Papayannidis C, Mantovani V, Marasco E, Ottaviani E, Soverini S, Girelli D, Luiselli D, Vignetti M, Baccarani M, Martinelli G: A polymorphism in the chromosome 9p21 ANRIL locus is associated to Philadelphia positive acute lymphoblastic leukemia. Leukemia Res. 2011, 35 (8): 10521059. 10.1016/j.leukres.2011.02.020.View ArticleGoogle Scholar
 Pasmant E, Laurendeau I, Heron D, Vidaud M, Vidaud D, Bieche I: Characterization of a germline deletion, including the entire INK4/ARF locus, in a melanomaneural system tumor family: Identification of ANRIL, an Antisense Noncoding RNA whose expression coclusters with ARF. Cancer Res. 2007, 67 (8): 39633969. 10.1158/00085472.CAN062004.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.