 Research article
 Open Access
 Published:
Maximum parsimony xor haplotyping by sparse dictionary selection
BMC Genomicsvolume 14, Article number: 645 (2013)
Abstract
Background
Xorgenotype is a costeffective alternative to the genotype sequence of an individual. Recent methods developed for haplotype inference have aimed at finding the solution based on xorgenotype data. Given the xorgenotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual with the aid of a small number of regular genotypes.
Results
We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xorgenotypes augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic data sets with different number of individuals and SNPs, and compare the performances with the stateoftheart xorhaplotyping methods PPXH and XORHAPLOGEN.
Conclusions
Experimental results show good inference qualities for the proposed method under all circumstances, especially on large data sets. Results on a real database, CFTR, also demonstrate significantly better performance. The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors.
Background
A human genome is a sequence of nucleotides that can differ from one individual to another (approximately 0.1% difference between any two individual) due to various reasons, such as insertions/deletions of fractions of the sequence on the genome or mostly the substitution/mutation of single nucleotides on commonly observed sites called single nucleotide polymorphism (SNP) [1]. In most SNPs only two different nucleotides are observed out of 4 nucleotides. The information of nucleotide variations extracted from these SNP sites (loci) is encoded as a sequence called “haplotype”. That is, for a particular SNP site a notation is used for one of the observed nucleotides (e.g., the most commonly observed nucleotide variant  dominant/major allele) and another notation is used for the other (e.g., the least observed nucleotide variant  recessive/minor allele). Because of its informative and heredity nature identifying the haplotypes of individuals has been an important subject in various medical and scientific studies, such as gene related disease discovery and drug design [2, 3], population history research [4], etc. Nonetheless, current experimental techniques are not lowcost and efficient enough for directly sequencing haplotypes of an individual; thereby identifying them is mostly based on indirect approaches, e.g., using computational methods to infer haplotypes from an alternative costeffective data called “genotype”.
The entire human genome consists of 23 distinct chromosomes each appearing in two copies (autosomes) except for the chromosome23 (allosome) which consists of two copies of chromosomeX in females or one chromosomeX and one chromosomeY in males. Each chromosome is a pair of two distinct sequences haplotypes inherited from the parents, i.e., one is from the maternal genome and the other is from the paternal genome. The genotype is sequenced by identifying the types of alleles nucleotide variants across the SNP locations (locus) in chromosomes. In a particular locus of a chromosome if both haplotypes have the same allele we call this site in the genotype homozygous and denote it with the type of alleles in both haplotypes as either commontype or wildtype; otherwise, if both haplotypes have different alleles –one commontype and one wildtype– we call this site heterozygous. When identifying haplotypes for a given genotype, the ambiguity occurs for the heterozygous sites since there is no information about which haplotype has the commontype allele and which haplotype has the wildtype allele. Clearly, genotypes are less informative than haplotypes, as they present an ambiguity on heterozygous sites due to possible permutations and computational methods can be employed to identify which allele come from which haplotype. Recently, more costeffective alternative methods have been used for genotype sequencing [5], e.g., widely used denaturing highperformance liquid chromatography (DHPLC) [6]. By certain applications of such methods one can only determine whether an individual has homozygous or heterozygous allele in a given SNP site, but cannot distinguish the type of allele in homozygous sites. The sequenced data is thereby less informative than the regular genotypes as it only represents the differing sites (XOR operation) between the haplotypes. This less informative form of genotype is named xorgenotype. One can solve the haplotype inference problem based on the xorgenotypes, i.e., xorhaplotyping, with a reasonable extra computational effort.
Methods for solving the haplotype inference problem given the regular genotypes can be summarized in two categories: combinatorial methods that usually state an explicit objective function and propose methods for optimizing it, and statistical methods that relies on the statistical modeling of the problem. Various methods have been published for the haplotype inference problem [7–13], however the xorhaplotyping problem mostly remained underinvestigated. Two particular methods are suitable for xorhaplotyping problems: parsimony haplotyping that is based on the maximum parsimony principle, and perfect phylogeny haplotyping that relies on a population genetics assumption called the infinite sites/alleles model[14], i.e., it assumes that allele sequences are long enough so that a particular allele will have a mutation only once in the phylogenetic tree. The perfect phylogeny (PP) model utilizes the infinite sites assumption by building a tree of individuals haplotypes where all individuals evolve, with no recurrent mutation, from one common ancestor. An approximate solution to xorhaplotyping problem in the case of PP model was introduced in [15] where the xorhaplotype inference was cast as a graph realization problem [16, 17]. However, the proposed method (GREAL) in [15] is not wellsuited for the xorgenotypes with large number of SNPs, i.e., usually limited by 30 SNPs [18], and is not extended to missing data cases.
On the other hand, it is known that in a population of individuals certain haplotypes are frequently found in certain genomic regions [19]. This fact leads to the parsimony principle that states that the genotypes of a population of individuals are generated by the least number of distinct haplotypes. Identifying such smallest set of haplotypes is called Pure (Maximum) Parsimony Problem, which is NPhard [20]. An integer linear programing method was introduced in [21] that finds a pure parsimony solution to this problem, and in [22] a branchandbound method was used to solve pure parsimony problem. In [23] a method called XORHAPLOGEN was proposed for solving haplotype inference problem in the case of xorgenotype data. This method can find accurate solutions for xorgenotypes with large number of SNPs. Another parsimony method was introduced in [24] for xorhaplotype inference by representing it as a graph realization problem called pure parsimony xor haplotyping (PPXH).
In [25] a novel framework for (regular) haplotyping was proposed by interpreting the parsimony principle as a sparse representation of the genotypes. Two approaches are presented: maximizing a sparseness condition on the haplotype frequency vector determined by the inferred haplotypes, and casting the sparsity of this frequency vector as a sparse dictionary selection problem. The latter approach is based on an efficient greedy method SHSD where haplotypes explaining the given genotypes are determined according to a sparse selection from the set of compatible haplotypes. The method constructively determines the solution of each individual while selecting the haplotypes from this set, and it has the convergence guarantee.
For the xorhaplotyping problem, there is an increased ambiguity due to the XOR operation between haplotypes, i.e., the process of xorgenotyping that determines whether the type of alleles in both haplotypes differ in a particular site (heterozygous) or they are the same (homozygous). However this ambiguity can be resolved with the assistance of regular genotypes. Regular genotypes can either be used as postprocessing inputs for eliminating setequivalent solutions of a particular inference, or they can be used to refine inference while constructing the solution.
Tractability of the maximum parsimony haplotyping problem in the xorgenotype case is still open [24]. In this paper, we propose a modified version of SHSD —XHSD—, that can efficiently find a solution for maximum parsimony xorhaplotyping problem and resolve the ambiguity with the help of a small number of regular genotypes. For a given set of xorgenotypes the haplotype pairs for each individual are selected from the set of compatible haplotypes by a sparse dictionary selection method. The selection of dictionary columns from the set of compatible haplotypes and the sparse representation of xorgenotypes is formulated as a joint combinatorial optimization problem. The objective function of this problem maximizes a variance reduction metric over all individuals. Our algorithm is a lowcomplexity greedy method that terminates once the solution is fully determined. To resolve the ambiguity and to improve the inference accuracy, we employ a small number of regular genotypes as constraints for the set of compatible haplotypes to help resolve the type of homozygous alleles.
The remainder of the paper is organized as follows. In Preliminaries, we introduce the xorhaplotype inference problem. In Methods, we formulate the xorhaplotype inference as a sparse dictionary selection problem and present an efficient greedy method for solving this problem. We also discuss the use of regular genotypes to resolve ambiguity. In Extensions section we discuss how the algorithm deals with long sequences and data with missing sites. In Results and discussion, we present the experimental results on synthetic and real data sets under various conditions. Finally, the Conclusions section is given in the end.
Preliminaries
In an SNP locus only 2 nucleotides are observed, and a single bit is sufficient for the representation of nucleotide variants such that 0 encodes the major allele and 1 encodes the minor allele. The haplotype of an individual can thereby be represented with a binary vector that shows the SNP variants across the individual’s chromosome. The genotype can then be thought of as a ternary vector where a 0 (2) indicates that the site is homozygous and both haplotypes have major 0/0 (minor 1/1) alleles, and 1 indicates that the site is heterozygous and the haplotypes have different alleles 0/1 or 1/0. Notice that when encoding homozygous and heterozygous sites we used a different notation from the literature in order to express a genotype vector as the sum of two haplotypes: a minorhomozygous SNP is encoded with 2 and a heterozygous SNP is encoded with 1, so that a 2 in the genotype is given by (the sum of) two minor alleles, and a 1 in the genotype is given by (the sum of) one major and one minor allele.
In general, given a lengthL genotype vector, k ≤ L of the loci are heterozygous and thereby ambiguous, in each of the k sites one haplotype can take two values –0 or 1– and the other haplotype takes the complement value. Considering all k heterozygous sites, one haplotype can then be one of the 2^{k} binary sequences, and the other haplotype will be the complement (inverted values) of that sequence. Therefore, for solving a genotype with k heterozygous sites, the pair of haplotypes is drawn from a set of 2^{k} distinct binary vectors of lengthL.
On the other hand, in xorhaplotyping problem the conflated data — xorgenotype — is less informative than the regular genotype with respect to the information loss about the type of allele in homozygous sites. The xorgenotype is itself a binary vector, where for a given site, 1 indicates heterozygous SNP where both haplotypes have different alleles for this given site. The xorgenotype can be represented by the XOR sum of two haplotypes, likewise, for a given site 0 indicates a homozygous SNP where the haplotypes have the same allele but without any distinction whether the type of the allele is major or minor. That is, the xorgenotype contains the information whether a particular SNP site has homozygous alleles, but the type of alleles for those homozygous sites is not identified. Every site of an xorgenotype is ambiguous, and each site of the corresponding haplotype can take two values. Therefore, a lengthL xorgenotype can be explained by a pair of haplotypes that are drawn from a set of 2^{L} distinct binary vectors of lengthL. Hence, because of the additional ambiguity on homozygous sites, the number of possible solutions for an xorgenotype is significantly (in fact, exponentially) larger than that of a regular genotype of the same size.
Besides the xorhaplotyping problem is NPhard, there is also no unique solution to this problem. The nature of the XOR operation results in a phenomenon called bit flip degree of freedom[15], i.e., for a particular solution set H consisting of lengthL haplotypes, one can produce equivalent solution sets by inverting a certain SNP i ≤ L (or a set of SNPs $S\subseteq \{1,\dots ,L\}$) in all haplotypes of H. Notice that inverting (complementing) an SNP across all haplotypes has no effect on the xorgenotypes they generated, because even the alleles explaining homozygous sites of xorgenotypes are not distinguished (hidden). More specifically, assume that ${h}_{i}^{1}\left(\ell \right)\in \{0,1\}$ and ${h}_{i}^{2}\left(\ell \right)\in \{0,1\}$ represent the haplotypes of ith individual in the ℓth SNP and they generate that individual’s xorgenotype x_{ i }(ℓ) such that ${x}_{i}\left(\ell \right)={h}_{i}^{1}\left(\ell \right)\oplus {h}_{i}^{2}\left(\ell \right)$. Then the complemented SNPs of haplotypes also explain that SNP of the same xorgenotype, i.e., ${x}_{i}\left(\ell \right)=\overline{{h}_{i}^{1}\left(\ell \right)}\oplus \overline{{h}_{i}^{2}\left(\ell \right)}$. It then follows that for a particular set H of lengthL haplotypes that solves a given set of xorgenotypes, there are at most $\sum _{i=1}^{L}\left(\genfrac{}{}{0ex}{}{L}{i}\right)={2}^{L}1$ equivalent sets ${\mathit{H}}_{i}^{\prime},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots {2}^{L}1$ to H where each equivalent set H i′ also solves that given set of xorgenotypes.
Problem definition
For each SNP $\ell =1,\dots ,L$, the xorgenotype is given by the XORsum of two haplotypes such that,
where x_{ i }(ℓ) ∈ {0, 1} is the xorgenotype of the ith individual in SNP ℓ, and ${h}_{i}^{j}\left(\ell \right)\in \{0,1\}$ is the jth haplotype of the ith individual in SNP ℓ. Let ${\mathit{x}}_{i}=\phantom{\rule{0.3em}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{x}_{i}\right(1)\dots {x}_{i}(L\left)\right]}^{T}$ be the xorgenotype of the ith individual, then (1) can be written as
where ${\mathit{h}}_{i}^{j}=\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{h}_{i}^{j}\right(1)\dots {h}_{i}^{j}(L\left)\right]}^{T}$ is the jth (j = 1, 2) haplotype of the ith individual consisting of L SNPs. In this representation, we say that the xorgenotype of the ith individual x_{ i } is phased by the haplotype pair $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2}\}$.
In regular haplotyping, a putative haplotype z ∈ {0, 1}^{L} is called compatible with a genotype g ∈ {0, 1, 2}^{L} if (g − z) ∈ {0, 1}^{L}, and such a haplotype is a possible solution that can explain that genotype. That is, the haplotype pair {z,(g − z)} is one of the possible solutions to the genotype g. Therefore, for every given genotype g_{ i } it is essential to determine a set of compatible haplotypes ${\mathcal{H}}_{i}$ when searching for possible solutions. The union of the sets ${\mathcal{H}}_{1},\dots ,{\mathcal{H}}_{N}$ for N individuals forms the matrix Z ∈ {0, 1}^{L × M} where M is the total number of distinct compatible haplotypes.
In xorhaplotyping, on the other hand, it is trivial to see that any haplotype z ∈ {0, 1}^{L} is compatible (consistent) with any xorgenotype x, i.e., x = z ⊕ z^{′} since there always exists a haplotype z^{′} ∈ {0, 1}^{L} such that z^{′} = x ⊕ z. Therefore, the set of compatible haplotypes ${\mathcal{H}}_{i}$ for a given lengthL xorgenotype x_{ i } consists of all binary vectors of lengthL, i.e., ${\mathcal{H}}_{1}={\mathcal{H}}_{2}=\cdots ={\mathcal{H}}_{N}={\{0,1\}}^{L\times {2}^{L}}\triangleq \mathit{Z}$.
Because of this compatibility between the xorgenotypes and candidate haplotypes an SNP site can always be explained by either of the two alleles, and thus unambiguous SNPs do not exist anymore. Notice that, in particular, an xorgenotype with allhomozygous SNPs is still ambiguous and requires to be solved up to bit flipping. However, we know that such an xorgenotype is always explained by a pair of identical haplotypes which correspond to the same column of Z. On the other hand, if there is at least one heterozygous SNP in the xorgenotype then its phasing haplotypes are not identical and correspond to the different columns in Z.
The xorgenotype of ith individual is expressed as
where (.)_{2} represents the componentwise modulo2 operation, and v_{ i } ∈ {0, 1, 2}^{M}, 1^{T}v_{ i } = 2, is the sparse vector indicating the haplotype locations as the indices of the matrix Z of consistent haplotypes. Notice that the modulo2 operation in (3) is equivalent to the XOR operation between the two haplotypes selected by v_{ i }.
Given Z, finding the indicator vector v_{ i } for an individual is equivalent to inferring its haplotype pair $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2}\}$. The maximum parsimony principle suggests that a given set of xorgenotypes should be explained by the smallest number of distinct haplotypes. Therefore, given the set of xorgenotypes for N individuals $\{{\mathit{x}}_{i},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N\}$, one needs to infer the haplotype pairs for each individual $\{{\mathit{h}}_{i}^{1},{\mathit{h}}_{i}^{2},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N\}$, so that the union of all inferred haplotypes forms the smallest set as possible. In other words, the xorhaplotyping problem is to infer v_{ i }, i = 1, …, N, given Z while selecting as few columns of Z as possible.
Methods
Xor Haplotyping by Sparse Dictionary Selection (XHSD)
If an (allhomozygous) xorgenotype is explained by only one haplotype, i.e., x_{ i } = h^{s} ⊕ h^{s}, where the haplotype h^{s} is the sth column of Z, then the indicator vector multiplies that haplotype by 2, i.e., v(s) = 2 and v(j) = 0 for $j=\{1\dots {2}^{L}\}\setminus \left\{s\right\}$. Otherwise, if the xorgenotype is explained by two different haplotypes ${\mathit{x}}_{i}={\mathit{h}}_{i}^{m}\oplus {\mathit{h}}_{i}^{n}$, m ≠ n, then they are indicated by the vector v such that v(m) = v(n) = 1 and v(j) = 0 for $j=\{1\dots {2}^{L}\}\setminus \{m,n\}$. Hence, we can rewrite (3) in the following more compact form
where ${\mathcal{A}}_{i}$ is a set of indices corresponding to the nonzero elements of v_{ i }, ${\mathit{Z}}_{{\mathcal{A}}_{i}}$ is the submatrix of Z consisting of the columns indexed by ${\mathcal{A}}_{i}$, and $\stackrel{~}{{\mathit{v}}_{i}}$ is the nonzero elements of v_{ i }.
For each observed xorgenotype x_{ i }, the phasing haplotypes are located in columns of Z indexed by ${\mathcal{A}}_{i}$. The union of these column indices, i.e., $\mathcal{D}={\cup}_{i}{\mathcal{A}}_{i}$, forms the dictionary of the haplotypes that suffices to construct all given xorgenotypes. The maximum parsimony principle then dictates that the dictionary $\mathcal{D}$ should contain the least possible number of elements that can reconstruct all observed xorgenotypes. The set of haplotypes indicated by such a sparse dictionary $\mathcal{D}$ is given by $\mathit{H}={\mathit{Z}}_{\mathcal{D}}$, where ${\mathit{Z}}_{\mathcal{D}}$ is the submatrix of Z consisting of the columns indexed by $\mathcal{D}$. Then H is a solution set to the maximum parsimony haplotyping problem for the given set of xorgenotypes {${\mathit{x}}_{i}\phantom{\rule{2.77626pt}{0ex}},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N$}.
To solve the xorhaplotyping problem, we choose the sparse dictionary $\mathcal{D}$ to minimize the average distance between the observed xorgenotypes and the closest approximations constructed by the haplotypes in ${\mathit{Z}}_{\mathcal{D}}$. Since there is no prior information about the dictionary $\mathcal{D}$ and the indices ${\mathcal{A}}_{i}$ for proper reconstruction of each xorgenotype, determining $\mathcal{D}$ and ${\mathcal{A}}_{i}$ leads to a combinatorial problem. This jointoptimization problem can be efficiently solved by a greedy method that we will explain next.
For an observed xorgenotype the reconstruction accuracy can be interpreted as the Euclidean distance between the observation and its closest approximation, i.e.,
where $\mathcal{A}$ represents the indices of haplotypes in Z used to approximate x_{ i }. Notice that an exact solution will satisfy ${L}_{i}\left(\mathcal{A}\right)=0$. For a given dictionary $\mathcal{D}$, the indices ${\mathcal{A}}_{i}$ for reconstructing each xorgenotype will be determined by restricting ${\mathcal{A}}_{i}$ to be a subset of $\mathcal{D}$ such that
The individual cost function in (5) is then translated into a fitness function associated with a given dictionary $\mathcal{D}$, i.e.,
Finally, the fitness value of $\mathcal{D}$ is averaged over all individuals to measure the overall reconstruction accuracy
For a given cardinality (sparsity) of n, the best dictionary is therefore given by
and the sparsest dictionary that is sufficient to reconstruct all observed xorgenotypes is determined by
Notice that determining both $\mathcal{D}$ for a given n in (8) and $\mathcal{A}$ for a given $\mathcal{D}$ in (7) is a combinatorial problem. In [26], it is shown that such combinatorial problems can be approximately solved efficiently by a simple greedy method if the objective function satisfies a fundamental property called submodularity. In [25], it is shown that the dictionary selection problem for (regular) haplotype inference has a cost function that is approximately submodular, and when a greedy method is used to optimize this cost function it can efficiently find an approximate solution with a theoretical guarantee [27].
For xorhaplotype inference, on the other hand, the problem is fundamentally different. That is, the submodularity property may not hold for the cost function in (5) due to the XOR operation, and thereby the theoretical guarantee does not hold either for the greedy method. Nonetheless, we still use the similar greedy heuristic as SHSD in [25] in order to maximize the variance reduction metric in (5) over the set of observations.
In our algorithm Xor Haplotyping by Sparse Dictionary Selection (XHSD), we start with an empty dictionary set ${\mathcal{D}}_{1}=\varphi $. Then at each iteration ℓ, among the consistent haplotypes that are not already in dictionary ${\mathcal{D}}_{\ell 1}$, i.e., in $\mathit{Z}\setminus {\mathcal{D}}_{\ell 1}$, we iteratively add the haplotype that contributes to the dictionary ${\mathcal{D}}_{\ell 1}$ with the maximal marginal gain. That is, at iteration ℓ the haplotype ${\mathit{h}}^{m}\in \mathit{Z}\setminus {\mathcal{D}}_{\ell 1}$ is added to ${\mathcal{D}}_{\ell 1}$ if it satisfies
To compute (10) requires solving (5) and (6) for each k. In (6) for each individual i, ${\mathcal{A}}_{i}$ is found by computing the Euclidean distance (5) between x_{ i } and the possible reconstructions given by the pairwise xorsum of all columns in ${\mathit{Z}}_{\mathcal{D}}$, and picking the columns that minimize (6). Whenever indices ${\mathcal{A}}_{i}$ yield zero in (5) we can explain that individual with the corresponding haplotypes in Z, i.e., ${\mathit{x}}_{i}={\left({\mathit{Z}}_{{\mathcal{A}}_{i}}\phantom{\rule{2.77626pt}{0ex}}{\stackrel{~}{\mathit{v}}}_{i}\right)}_{2}$. The dictionary ${\mathcal{D}}_{\ell}$ keeps growing until all xorgenotypes are explained, i.e., $F\left({\mathcal{D}}_{\ell}\right)=\frac{1}{N}\sum _{i=1}^{N}{F}_{i}\left({\mathcal{D}}_{\ell}\right)=\frac{1}{N}\sum _{i=1}^{N}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.
Notice that in XHSD algorithm the number of compatible haplotypes Z exponentially increase in comparison to regular haplotyping problem with SHSD. However, –when available– we can reduce Z with respect to regular genotype information via utilizing them in the cost function (5). The necessary modifications are discussed in the next section XHSD with regular genotypes. Another fundamental difference in xorhaplotyping is that the xorgenotypes do not provide unambiguous genotype information which one can initialize the dictionary with corresponding haplotypes and improve the reconstruction accuracy. Nonetheless, with a bias weight, the modified cost function can exploit the available regular genotypes even when they are not unambiguous.
Summary of XHSD algorithm:

Initialization.

$\mathit{Z}={\{0,1\}}^{L\times {2}^{L}}$.

n ← 1.

${\mathcal{D}}_{n1}^{\ast}=\varphi $.


Iterate until all xorgenotypes are explained, i.e., $F\left({\mathcal{D}}_{n}^{\ast}\right)=\frac{1}{N}\sum _{i=1}^{N}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.

Perform the greedy search.

∗ For $\forall j\in \left\{1,\dots ,{2}^{L}\right\}\setminus {\mathcal{D}}_{n1}^{\ast}$, compute $F\left({\mathcal{D}}_{n1}^{\ast}\cup \left\{j\right\}\right)$.

∗ Let ${j}^{\ast}=arg\phantom{\rule{.5em}{0ex}}{\text{max}}_{j\in \{1\dots {2}^{L}\}\setminus \underset{n1}{\overset{\ast}{\mathcal{D}}}}F\left(\underset{n1}{\overset{\ast}{\mathcal{D}}}\cup \left\{j\right\}\right)$.
Set ${\mathcal{D}}_{n}^{\ast}={\mathcal{D}}_{n1}^{\ast}\cup \left\{{j}^{\ast}\right\}$.

∗ Check if any xorgenotype is explained by the addition of the new element ${\mathit{h}}^{{j}^{\ast}}$, i.e., if (5) is zero. If so, the inferred haplotype pair for the individual with such an xorgenotype is $[\phantom{\rule{0.3em}{0ex}}{\mathit{h}}^{{j}^{\ast}},\phantom{\rule{2.77626pt}{0ex}}{\mathit{x}}_{i}\oplus {\mathit{h}}^{{j}^{\ast}}]$.


n ← n + 1.

Given the xorgenotypes of a set of individuals, this algorithm finds the haplotypes of each individual based on the maximum parsimony principle.
As an example, consider the following demonstration. Let x_{1},x_{2} and x_{3} be the xorgenotypes of three individuals each corresponding to three SNPs, i.e.,
The set of compatible haplotypes for these individuals will consist of all length3 binary vectors, i.e.,
After initializing Z, and starting with empty dictionary ${\mathcal{D}}_{0}$, the algorithm performs the greedy search by adding one haplotype from Z (with the maximal marginal gain) at a time. At iteration n = 1, (10) calculates m = arg max(0, 0, 0, 0, 0, 0, 0, 0) and m is randomly picked as 5 among the equal maximum values, then the corresponding haplotype Z_{5} = [001]^{T} is added to the dictionary, i.e., ${\mathcal{D}}_{1}=\phantom{\rule{2.77626pt}{0ex}}\left[\phantom{\rule{0.3em}{0ex}}5\right]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{1}}=\left[\begin{array}{c}0\\ 0\\ 1\end{array}\right]$. Similarly, at n = 2, m = arg max(0.33, 0, 0.66, 0, 0, 0.33, 0) = 3 is calculated and the haplotype Z_{3} = [010]^{T} is added to the dictionary, i.e., ${\mathcal{D}}_{2}=\phantom{\rule{2.77626pt}{0ex}}[\phantom{\rule{0.3em}{0ex}}5,3]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{2}}=\left[\begin{array}{c}00\\ 01\\ 10\end{array}\right]$. x_{3} is explained by the addition of new haplotype, i.e., x_{3} = [001]^{T} ⊕ [010]^{T}, yet the other xorgenotypes are not explained. At n = 3, m = arg max(1.33, 0.66, 0.66, 0.66, 0.66, 1.33, 0.66) = 1 is calculated and the haplotype Z_{1} = [000]^{T} is added to the dictionary, i.e., ${\mathcal{D}}_{3}=\phantom{\rule{2.77626pt}{0ex}}[\phantom{\rule{0.3em}{0ex}}5,3,1]$, and ${\mathcal{Z}}_{{\mathcal{D}}_{3}}=\left[\begin{array}{c}000\\ 010\\ 100\end{array}\right]$. Other two xorgenotypes are explained by the new addition, i.e., x_{1} = [001]^{T} ⊕ [000]^{T}, x_{2} = [010]^{T} ⊕ [000]^{T}, and the algorithm converges at n = 3 via calculating $F\left({\mathcal{D}}_{3}\right)=\frac{1}{3}\sum _{i=1}^{3}{F}_{i}\left({\mathcal{D}}_{3}\right)=\frac{4}{3}=\frac{1}{3}\sum _{i=1}^{3}\parallel {\mathit{x}}_{i}{\parallel}^{2}$.
This simple example demonstrates how the proposed greedy approach can efficiently construct sparse solutions, where three xorgenotypes are explained by only three haplotypes within three iterations. Nonetheless, the solution set has the ambiguity of being one of the equivalent sets of the true solution due to the bit flip degree of freedom which should be resolved.
Resolving bit flip degree of freedom
In [15] it is shown that the xor perfect phylogeny problem can be solved up to bit flipping based on the characteristics of the given xorgenotypes. Let X ∈ {0, 1}^{L × N} be the xorgenotypes matrix of N individuals such that $\mathit{X}=\phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{\mathit{x}}_{1}\phantom{\rule{2.77626pt}{0ex}}{\mathit{x}}_{2}\dots {\mathit{x}}_{N}]$. Denote χ_{ i } as the set of heterozygous loci for the ith individual, i.e., χ_{ i } = {ℓ:x_{ i }(ℓ) = 1}, where x_{ i }(ℓ) is the ℓth SNP in x_{ i }. If there exists a set of individuals $\mathcal{I}\subseteq \{1,\dots ,N\}$ whose xorgenotypes have empty intersection, i.e., ${\cap}_{\mathcal{I}}{\chi}_{\mathcal{I}}=\varphi $, then with the knowledge of regular genotypes ${\mathit{G}}_{\mathcal{I}}\in {\{0,1,2\}}^{L\times \left\mathcal{I}\right}$ of those individuals one can remove all bit flip degrees of freedom. The empty intersection indicates that an SNP will have homozygous allele in at least one of those individuals and therefore that SNP can be resolved by revealing the type of allele at the corresponding regular genotype. Following this, a postprocessing method is suggested in [15] that can remove the bit flip degree of freedom across the loci where a set of xorgenotypes have empty intersection.
By bit flipping on a given solution H, one attempts at choosing among the setequivalent solutions ${\mathit{H}}_{i}^{\prime},\phantom{\rule{2.77626pt}{0ex}}i=\{1,\dots ,{2}^{L}1\}$ and this choice is decided by the given regular genotypes (Figure 1).
However, this postprocessing method have certain limitations. Notice that, for large L the setequivalent solutions are highly specific to the infererred H, e.g., for a given set of xorgenotypes it is very likely that any two different inferences H_{1} and H_{2} —which are not setequivalent— can have very different setequivalent solutions. Bit flipping on different inferences likely leads to different results, and thereby the bit flipping accuracy largely depends on the initial inference H which is made by avoiding the prior knowledge on homozygous SNPs, i.e., regular genotypes. Besides, –when available– utilizing more regular genotypes in postprocessing does not necessarily improve the bit flipping accuracy. Basically, to decide among the appropriate bit flippings for a particular locus requires the knowledge of that homozygous SNP from a regular genotype. Intuitively, to reveal a set of homozygous SNPs by employing the least number of regular genotypes, e.g., provided by the MTI method, will be necessary and sufficient for removing the bit flip degree of freedom across those SNPs. On the other hand, a larger number of regular genotypes will not be any more informative due to possible inconsistencies on the type of homozygous allele for an SNP site across the given regular genotypes.
Furthermore, notice that flipping the bits on some loci across all the haplotypes in H does not affect the parsimony of the solution. The final solution H^{′} will have the same parsimony with H regardless of the set of loci that are flipped. From the maximum parsimony point of view, refining an xorhaplotyping solution via bit flipping method does not necessarily lead to global optimum unless the initial inference is a setequivalent of the global optimal solution.
Therefore, instead of using regular genotypes to postprocess a solution, a more intuitive way could be to aim at resolving the bitflip degree of freedom while constructing the solution. In particular, regular genotypes can be used as constraints when solving the homozygous sites of an xorgenotype. In this sense, given a set of individuals’ xorgenotypes we determine the individuals that have the most informative regular genotypes and preprocess the data set by replacing with the regular genotypes for those individuals. The MTI algorithm [15] is useful for finding the least number of such individuals that will be adequate to reveal the homozygous alleles for each of the L SNPs. In the proposed XHSD framework, we employ the MTI method to find which individuals should be replaced with regular genotypes and after replacing them the new data set is presented to the XHSD algorithm (Figure 2).
In most cases the xorgenotypes in X has empty intersection and for each run MTI outputs 2 or 3 individuals, i.e., $\left\mathcal{I}\right\le 3$; then ${\mathit{G}}_{\mathcal{I}}$ has at most 3 regular genotypes. One can obtain a larger ${\mathit{G}}_{\mathcal{I}}$ by performing multiple runs of MTI with X and collecting the distinct regular genotypes given by MTI.
Next we explain the necessary modifications to the XHSD algorithm for utilizing the regular genotypes.
XHSD with regular genotypes
The information provided by regular genotypes is used to reveal the type of allele in homozygous sites of an individual so that we can improve the reconstruction accuracy in (5), and build the dictionary $\mathcal{D}$ with more reliable haplotypes. That is, when a regular genotype g_{ i } is observed in the ith individual we employ the variance reduction metric that is given for regular genotypes such that
where Z is the set of haplotypes that are compatible with the ith individual’s genotype g_{ i }, and $\mathcal{A}$ contains the indices of the haplotypes in Z that are used to approximate g_{ i }. In this representation the approximation accuracy is potentially higher when compared to the xorgenotypes, since the homozygous SNPs in g_{ i } are unambiguous. The haplotypes that are used to approximate those SNPs will be more reliable candidates when building the dictionary $\mathcal{D}$.
To exploit this fact, we can introduce a weight b_{ i } in the cost function ${L}_{i}\left(\mathcal{A}\right)$ so that the algorithm will give a higher priority on the variance reduction of those individuals that are given by regular genotypes, and the dictionary will more likely grow with the haplotypes that are compatible with the given regular genotypes. The biased variance reduction metric for each individual is then given by
The weight parameter b_{ i } could be set as proportional to the average rate of homozygous SNPs per genotype, assuming that the more homozygous sites the regular genotype contains the more informative it will be. We experimentally set b_{ i }=4 as it yielded good performance with both synthetic and real databases.
Extensions
Long xorgenotypes
Note that the size of Z grows exponentially with the lengthL due to the compatibility between haplotypes and xorgenotypes. That is, finding the solution of a lengthL xorgenotype requires to perform the greedy search over Z that consists of 2^{L} haplotypes. To mitigate the computational complexity we employ the partitionligation method [28] as in [25] where the block partitioning is based on identifying the recombination hot spots [29] existing between the haplotype blocks [30]. After partitioning the SNP sequences will be divided into blocks where within each block the haplotype diversity is as low as possible.
The haplotype diversity of a given block is measured by its Shannon entropy. The block partitioning by minimizing the total Shannon entropy proceeds as follows. Let $\left\{{\stackrel{~}{\mathit{h}}}_{1}^{\mathit{\text{lm}}}\dots {\stackrel{~}{\mathit{h}}}_{{\stackrel{~}{K}}_{\mathit{\text{lm}}}}^{\mathit{\text{lm}}}\right\}$ be the ${\stackrel{~}{K}}_{\mathit{\text{lm}}}$ haplotypes that explains all the xorgenotypes ${\mathit{x}}_{i}^{\mathit{\text{lm}}},\phantom{\rule{2.77626pt}{0ex}}i=1,\dots ,N$ in the block that starts at locus l and ends at locus m, i.e., 1 ≤ l ≤ m ≤ L, and let ${\stackrel{~}{\mathit{f}}}^{\mathit{\text{lm}}}=\left[{\stackrel{~}{f}}_{1}^{\mathit{\text{lm}}},\dots ,{\stackrel{~}{f}}_{{\stackrel{~}{K}}_{\mathit{\text{lm}}}}^{\mathit{\text{lm}}}\right]$ be the haplotype frequency vector for this block. Each ${\stackrel{~}{f}}_{k}^{\mathit{\text{lm}}},\phantom{\rule{2.77626pt}{0ex}}k=1,\dots {\stackrel{~}{K}}_{\mathit{\text{lm}}}$ is represented by the density of the nonzero values of the indicator vectors $\{{\mathit{v}}_{1}^{\mathit{\text{lm}}},\dots ,{\mathit{v}}_{N}^{\mathit{\text{lm}}}\}$ for the given block ${\mathit{x}}_{i}^{\mathit{\text{lm}}}$, i.e.,
The entropy of the haplotype block ${\stackrel{~}{\mathit{h}}}_{k}^{\mathit{\text{lm}}}$ is then given by
and the total entropy of Q blocks, where each block $[\phantom{\rule{0.3em}{0ex}}{l}^{q}:{m}^{q}],\phantom{\rule{2.77626pt}{0ex}}q=1,\dots Q$ has an upper bound of length W, i.e., m^{q} − l^{q} + 1 ≤ W, is given by
To determine the initial and ending loci of each block $[\phantom{\rule{0.3em}{0ex}}{l}^{q}:{m}^{q}],\phantom{\rule{2.77626pt}{0ex}}q=1,\dots Q$ that minimizes $\mathcal{E}$ we use the recursive method explained in [25], i.e., for each ending locus 1 ≤ m ≤ L we determine the block $[\phantom{\rule{0.3em}{0ex}}{l}_{m}^{\ast}:m]$, with $m{l}_{m}^{\ast}+1\le W$, that contributes with the lowest entropy and then backtrack the best initial points ${l}_{m}^{\ast}$ for each consecutive block by starting with the block $[\phantom{\rule{0.3em}{0ex}}{l}_{L}^{\ast},L]$.
Missing data
Genotyping errors often occur when the observed genotype of an individual differs from the original sequence for various reasons [31, 32]. A particular type of genotyping error is the case when some loci are not observed/missed during sequencing or other application processes. Although methods dealing with some type of errors were proposed, often erroneous genotypes are produced with significant missing/error rates [33]. Therefore, it is of high importance for an xorhaplotyping technique to be adaptive for resolving such databases with missing sites. We next present a modification to XHSD in order to perform xorhaplotyping for the individuals exposed to missing data conditions.
Let ${\stackrel{~}{\mathit{g}}}_{i}$ be the incomplete genotype of the ith individual where the loci with missing information in g_{ i } are removed. Similarly, let ${\stackrel{~}{\mathit{x}}}_{i}$ represent the xorgenotype of the ith individual where the missing loci are removed. As the rate of missing loci increases the sequences become less informative. Following the suggestion in [25], we introduce another weight w(.) to give less weight to the less informative individuals when evaluating (12) in order to improve the reliability of haplotype inference, i.e.,
where ${\stackrel{~}{\mathit{Z}}}_{\mathcal{A}}^{i}$ is the matrix Z with the rows corresponding to the missing loci of the ith individual removed. The weight is selected as a nondecreasing function of the total information content in the sequence such that
where $dim\left({\stackrel{~}{\mathit{x}}}_{i}\right)$ gives the dimension of ${\stackrel{~}{\mathit{x}}}_{i}$.
Different weight functions could be employed to exploit the distribution of missing sites. Since, in our experiments, the missing sites are uniformly distributed across the SNPs and individuals the function in (14) gave a good performance.
The proposed method does not account for the direct inference of the missing sites, i.e., imputing missing genotypes [34]. However, the missing values in each xorgenotype can be recovered from the solution by simply looking at the haplotype pairs which are specifically inferred for each individual. Since the proposed method has robust performance against missing data, as presented in the next section, the inferred solution will be sufficient to type missing genotype sites. An implementation of the proposed method –with aforementioned extensions– is provided in “Additional file 1”.
Results and discussion
We tested the performance of several xorhaplotyping methods with a number of metrics. First we measured the probability of error (P_{ e }), i.e., the percentage of individuals whose inferred pair of haplotypes are different from the original pair. This measure is sensible for assessing the inference quality in regular haplotyping problem since the alleles corresponding to homozygous loci are known and only the heterozygous loci are ambiguous thereby performance depends on the inference accuracy on heterozygous loci. Nonetheless, in xorhaplotyping there are a large number of equivalent solutions to original one up to bit flipping and thereby it is very likely that a solution set differs from the original phasing on at least one SNP. In particular, for a given xorgenotype even if there is a single SNP difference (namely bit flip) between the pair of inferred haplotypes and the pair of haplotypes that originally gave rise to that xorgenotype, it is counted as misinference. A more sensible metric, therefore, would take into account the percentage of such SNPs where the inference differs from the true phasing. In that sense, the switch error rate (swr) [35] is a proper metric that counts the minimum amount of required switches for heterozygous loci to change to the correct alleles of the original haplotypes. It gives a sense of how closely the inference was made, i.e., as a ratio of total misinferred heterozygous loci $\mathit{\text{mis}}{s}_{i}^{\mathit{\text{het}}}$ in all individuals $i=\{1\dots N\}$ to the worstcase number of switches (half of the number of heterozygous loci in each individual χ_{ i } / 2), i.e.,
Moreover, to assess the accuracy on homozygous sites, we employ prediction error rate (e r r_{ p }) [23] computed as the fraction of incorrectly predicted hiddenhomozygous sites out of all hiddenhomozygous sites, i.e.,
We performed xorhaplotyping on various data sets, with and without missing information on loci: synthetic data sets with different recombination rates simulated by a coalescence based program of [36], a database consisting of the SNPs in the CFTR gene that is associated with cystic fibrosis (CF) disorder [37], and another database (ANRIL) containing the SNPs that have relatively lower linkage disequilibrium (high polymorphism). We tested different xorhaplotyping methods that are based on different assumptions including the parsimony graph realization model PPXH [24], the parsimony genetic search model XORHAPLOGEN (XHAP) [23], the graph representation model GREAL [15], and an integer programming approach PolyIP [38]. Among the four methods the last two were ineffective for practical reasons. GREAL failed at finding solutions for data sets with reasonably long sequences (SNPs >30), and PolyIP method is often computationally inefficient when solving even a simple problem (e.g., it takes more than 24 hours to solve a set of 50 individuals with 30 SNPs).
Synthetic data
Based on different recombination rates three different scenarios are considered in synthetic data sets: no recombination (r = 0), and recombination with rates r=4 and r=40, respectively. The recombination rate is the rate that the haplotypes of an individual exchange the sequence fragments due to several reasons such as crossingover events. This fact is simulated by a model given in Hudson’s software [36]. For each scenario we generated 100 different data sets by random pairing of a set of simulated haplotypes of different lengths (5 ≤ L ≤ 46) for a given population size. This is repeated for different population sizes as well, N ∈ {10, 20, 30, 40, 50}.
In Figure 3, the performances of different methods on short data sets (L<14) are displayed which is based only on xorgenotypes. The quality of inference is exhaustively determined after removing all bit flip degrees of freedom by looking for the best equivalent set of a particular inference, i.e., performing an exhaustive search to find the best bit flipping that gives a result closest to the true phasing of xorgenotypes. Such evaluation shows the best inference performance of different methods without the help of regular genotypes. Compared to other methods, XHSD can potentially resolve a set of xorgenotypes with comparably low error rates. Moreover, XHSD achieves the lowest switch error rates, especially for large datasets, indicating a better accuracy (i.e., similarity with the true haplotypes) for the initial inference given only the xorgenotypes.
To evaluate the inference quality when regular genotype data are available, we first determined only a limited number of regular genotypes by the MTI method, i.e., the smallest set of regular genotypes that have empty intersection on the heterozygous SNPs, then resolved the ambiguity by bit flipping on the initial inference according to these regular genotypes (Figure 4). This test evaluates how methods can deal with bitflip degree of freedom under very limited regular genotype data that –in theory– suffice to resolve all SNPs. Given the long xorgenotype data sets (5 ≤ L ≤ 46), block partitioning is applied in XHSD by limiting the maximum block size to W=8 SNPs. From Figure 4, we can say that XHSD has the best potential to make an inference with high accuracy when the regular genotypes are introduced. We also applied the proposed XHSD framework represented in Figure 2 to the same dataset where 2 xorgenotypes are replaced with the regular genotypes. Note that the Proposed XHSD achieves a significant decrease in P_{ e } rates despite the small augmentation of data by only 2 regular genotypes, compared to using them in the postprocessing, i.e., XHSD (bit flipping).
It is worthy of noting that the algorithms based on segmentation may deteriorate when processing long xorgenotype sequences, especially with increasing recombination rates where the detection of haplotype blocks is complicated [39]. We used block partitioning (segmentation) in XHSD to reduce complexity when processing long xorgenotype sequences. In Figure 4 the segmentation effect is noticeable particularly in very high recombination rates, i.e., r = 40. However, in general scenario, i.e., r ≤ 4, we can say that the segmentation effect is not significant for the proposed method’s performance, and it outperforms XORHAPLOGEN in most data sets containing typical recombination rates.
For more practical results we added regular genotypes in each method with different percentages of the population and allowed the methods to remove ambiguity by their own, except for PPXH. Since PPXH cannot make use of regular genotypes directly, we applied bit flipping using the MTI solver to remove ambiguity for this method. To regularly genotype a given percentage of the population, the regular genotypes are determined by running the MTI method several times until the number of distinct regular genotypes obtained achieves the given percentage of the total number of individuals.
Figure 5 shows performances on the synthetic data of a large population of 50 individuals with zero recombination rate, where cases are considered from 10% (5 individuals) to 100% (50 individuals) of the population are given by regular genotypes. XHSD overperforms other methods in almost all cases. Particularly after 20% of the population is given by regular genotypes, XHSD can immediately utilize regular genotypes and significantly improve the accuracy on both homozygous (err_{ p }) and heterozygous sites (swr). We can conclude that the parsimony principle of XHSD method is wellsuited for inferring the heterozygous sites, and for predicting the homozygous sites it usually suffices to have a small percentage of regular genotypes.
Missing data
We investigated capability for dealing with missing data under different circumstances by various methods. Since the methods performed similarly under zero recombination rate we used the same data sets with no recombination to generate the database with missing entries. An SNP site of an individual is defined as “missing” with a probability of P_{ miss } and the data sets for different percentages of missing SNPs are generated accordingly. PPXH method is excluded since it cannot handle missing data. In XHSD the block partitioning is applied as before with a maximum block size of W = 8 SNPs.
Figures 6 and 7 show the performances in different scenarios of partial regular genotyping under different rates of missing data. As in the previous plots, each point represents the average value of the corresponding metric over 100 realizations–100 different sets of varying SNP sizes between 5 and 46. In most cases, XORHAPLOGEN and XHSD are insensitive to the increased number of missing sites. XORHAPLOGEN is more accurate for small group of individuals. Nonetheless, when more individuals are available in the database (N>30) XHSD displays a better performance in all circumstances.
We examined the dependency of methods on percentage of the missing data rate for a population with large number of individuals. That is, we used the xorgenotypes from 50 individuals and replaced 30% and 50% of the population with regular genotypes, and performed xorhaplotype inference under different missing data rates ranging from 0.5% to 5%. As seen in Figure 8 both methods are robust against missing data. On the other hand, XHSD is less dependent on regular genotypes and it can achieve better error rates than XORHAPLOGEN by employing even less number of regular genotypes. XORHAPLOGEN needs approximately 20% more regular genotypes to reach the same P_{ e } level with XHSD, e.g., regular genotyping by 30% in XHSD is comparable to that of 50% in XORHAPLOGEN.
CFTR gene database
Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the gene that encodes the cystic fibrosis transmembrane conductance regulator protein (CFTR). In [37], various mutations on 23 polymorphic locations from the chromosome 7 are detected as the disease loci for CF. We used this database corresponding to 29 distinct haplotypes to generate random xorgenotypes. By combining the haplotype pairs at random we generated the xorgenotypes for a given number of individuals N, and repeated the process for different population sizes, i.e., N ∈ {100, 200, 300, 400}. In this database, the data sets with small number of individuals present high haplotype diversities, i.e., many of the distinct haplotypes are only used once in the generation of individuals. Therefore, the larger data sets that have low haplotype diversities are expected to be solved with higher accuracy by biologicallyoriented methods, such as XORHAPLOGEN which obtains its inference according to a multilocus linkage disequilibrium (LD)based block identification model.
We tested the performance of each method on this database with/without missing sites {0,5%}. PPXH method was excluded from the missing data analysis since it cannot deal with missing data. XHSD is applied with block partitioning and the maximum block length of W=8 SNPs as before. It is seen in Figure 9 that XHSD outperforms for various population sizes with significantly low error rates. As the xorgenotypes are taken from more individuals, the inference accuracy is immediately improved in XHSD and XORHAPLOGEN, whereas PPXH do not have this ability to benefit from the additional data.
Figure 10 shows the average running times of each method performing on this database. It is observed that XHSD has similar computational complexity as the size of data set grows, and it shows comparable running times with XORHAPLOGEN. Although PPXH performs significantly faster, it cannot mitigate the high error rates and is not able to provide accurate inferences.
Typing errors
Combinatorial optimization techniques are known with their sensitivity to genotyping errors [40]. Thereby, we tested the effect of typing errors on the proposed algorithm using CFTR gene database. We defined a SNP site of an individual as erroneous with a probability of P_{ err }, and typed the site as either homozygous or heterozygous with equal probabilities. We then run the algorithms without providing the knowledge of erroneous site positions. We excluded PPXH method due to its low performance on the CFTR database. Figure 11 illustrates the algorithms’ performance on typing errors with P_{ err } = 2%. It is seen that XORHAPLOGEN is a more robust method against typing errors because of its statical nature. Nonetheless, the proposed XHSD algorithm can deal with erroneous data containing ∼2% typing errors, with a small increase in the error rates compared to the results without typing errors.
ANRIL database
The performance of haplotyping methods can deteriorate on databases with decreasing linkage disequilibrium (LD) rates. A SNP database with low pairwiseLD scores are investigated in an association study given in [41] for their susceptibility to certain types of leukemia. This database includes 16 SNPs from the chromosome 9p21 associated with several diseases and a SNP locus encoding for antisense noncoding RNA in the INK4 locus (ANRIL)[42]. We used the corresponding haplotype data from HAPMAP database (http://hapmap.ncbi.nlm.nih.gov/) collected from 90 European individuals. We generated the xorgenotypes for the individuals by using their haplotype pairs and tested the algorithms on this database. It is seen from the Figure 12 that the algorithms deteriorate when inferring the haplotypes with lowLD SNPs. XHSD shows very similar performance with XORHAPLOGEN, and both methods overperform PPXH on this database.
Notice that the algorithms cannot mitigate the error rates with increasing number of individuals. This can be explained by the occurrence of very high haplotype diversity in corresponding lowLD SNP regions. The number of distinct haplotypes explaining the given number of individuals presumably remains at high diversity as the number of individuals grows, whereas the methods based on maximum parsimony principle fail to incorporate this fact. They are tend to find parsimonious (lowdiversity) solutions in all population sizes, with a decreasing ratio (ρ) of “total number of distinct haplotypes explaining the given set of individuals” to “total number of given individuals” as the population size grows. It is worthy of noticing that, in XHSD results in Figure 12 (P_{ miss } = 0), we observed that such ratio decreases as ρ = [1.3, 0.95, 0.83, 0.72, 0.66] in respect to the populations with 10,20,30,40,50 individuals; whereas the same ratio for the true phasing (ground truth data) is in fact much higher, i.e., ρ = [1.7, 1.48, 1.34, 1.27, 1.24], respectively, thereby causing the parsimonybased haplotyping methods to deteriorate on this database. On the other hand, in highLD CFTR database, the same ratio for the true phasing is very low due to low haplotype diversity, i.e., ρ = [0.29, 0.14, 0.1, 0.07], in respect to the populations with 100,200,300,400 individuals, and the XHSD method is good at achieving very similar rates,i.e., ρ = [0.43, 0.15, 0.1, 0.07], respectively.
Conclusions
In this paper, we have presented a new xorhaplotyping method XHSD based on the maximum parsimony principle that infers the haplotype pairs for each member of a group of unrelated individuals by observing their xorgenotypes. A dictionary selection method is utilized to find the smallest set of haplotypes selected from a candidate set that can explain the given set of xorgenotypes. The proposed approach requires regular genotypes from only a small percentage of individuals for the removal of ambiguity across all SNPs of the inferred haplotypes. The smallest subgroup of individuals having the most informative regular genotypes are efficiently determined by the minimum tree intersection algorithm. Although the inference accuracy was proportional to the percentage of the individuals given by regular genotypes, XHSD shows less dependency on regular genotypes compared to other methods. Experimental results have demonstrated that XHSD is a reliable method for xorhaplotyping under all circumstances including missing data and typing error cases. Low rates of missing values (≤ 10%) on the xorgenotypes has often insignificant contribution to the error rates, and the proposed method can deal with ∼ 2% typing errors. Particularly for large databases, XHSD produces the most accurate solution with significantly low error rates compared to other lowcomplexity xorhaplotyping methods. Experiments with CFTR gene database also proved that our approach can perform effectively on real data sets with/without missing sites. Another database with particularly lower LD rates indicates that the proposed algorithm can achieve the best performance with the stateoftheart algorithms. We expect that XHSD can serve as a practical tool for xorhaplotyping on realworld large instances, as the large data collections become more available in the era of nextgeneration DNA sequencing.
References
 1.
Brookes A: The essence of SNPs. Gene. 1999, 234: 177186. 10.1016/S03781119(99)00219X.
 2.
Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 15161517. 10.1126/science.273.5281.1516.
 3.
Kwok PY, Gu Z: Single nucleotide polymorphism libraries: why and how are we building them?. Mol Med Today. 1999, 5: 538543. 10.1016/S13574310(99)016019.
 4.
Gray IC, Campbell DA, Spurr NK: Single nucleotide polymorphisms as tools in human genetics. Human Mol Genet. 2000, 9: 24032408. 10.1093/hmg/9.16.2403.
 5.
Peters T, Sedlmeier R: Current methods for highthroughput detection of novel DNA polymorphisms. Drug Discov Today: Technol. 2006, 3 (2): 123129. 10.1016/j.ddtec.2006.05.002.
 6.
Xiao W, Oefner PJ: Denaturing highperformance liquid chromatography: a review. Human Mutat. 2001, 17: 439474. 10.1002/humu.1130.
 7.
Bansal V, Bafna V: HapCUT: An efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008, 24 (16): i153i159. 10.1093/bioinformatics/btn298.
 8.
He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from wholegenome sequence data. Bioinformatics. 2010, 26 (12): i183i190. 10.1093/bioinformatics/btq215.
 9.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34 (8): 816834. 10.1002/gepi.20533.
 10.
Gupta A, Manuch J, Stacho L, Zhao X: Algorithm for haplotype inference via galledtree networks with simple galls. J Comput Biol. 2012, 19 (4): 439454. 10.1089/cmb.2010.0145.
 11.
Iliadis A, Anastassiou D, Wang X: A unified framework for haplotype inference in nuclear families. Ann Hum Genet. 2012, 76 (4): 312325. 10.1111/j.14691809.2012.00715.x.
 12.
Lai E, Wang W, Jiang T, Wu K: A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree. BMC Bioinformatics. 2012, 13 (Suppl 17): S1910.1186/1471210513S17S19.
 13.
He D, Han B, Eskin E: Hapseq: An optimal algorithm for haplotype phasing with imputation using sequencing data. J Comput Biol. 2013, 20 (2): 8092. 10.1089/cmb.2012.0091.
 14.
Kimura M, Crow JF: The number of alleles that can be maintained in a finite population. Genet. 1964, 49: 725738.
 15.
Barzuza T, Beckmann JS, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: Typing without calling the allele. IEEE/ACM Trans Comput Biol Bioinformatics. 2008, 5: 101109.
 16.
Liberatore V: Matroid decomposition methods for the set maxima problem. Proceedings of the Ninth Annual ACMSIAM Symposium on Discrete Algorithms. 1998, Philadelphia: Society for Industrial and Applied Mathematics, 400409.
 17.
Gusfield D: Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. Proc. 6th Annual Conference on Research in Computational Molecular Biology (RECOMB 2002). 2002, New York: ACM, 166175.
 18.
Barzuza T, Beckmann J, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: XorGenotypes and tag SNPs. Combinatorial Pattern Matching, Volume 3109. Edited by: Dogrusoz U, Muthukrishnan S, Sahinalp S. 2004, Berlin, Heidelberg, New York: SpringerVerlag, 1431.
 19.
Patil N: Blocks of limited haplotype diversity revealed by highresolution scanning of human chromosome 21. Science. 2001, 294: 17191723. 10.1126/science.1065573.
 20.
Lancia G, Pinotti MC, Rizzi R: Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. INFORMS J on Comput. 2004, 16: 348359. 10.1287/ijoc.1040.0085.
 21.
Gusfield D: Haplotype inference by pure parsimony. Comb Pattern Matching. 2003, Berlin, Heidelberg, New York: SpringerVerlag, 144155.
 22.
Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics. 2003, 19: 17731780. 10.1093/bioinformatics/btg239.
 23.
Sazonova N, Sazonov E, Harner EJ: Algorithm for haplotype resolution and block partitioning for partial XORgenotype data. J of Biomed Inform. 2010, 43: 5159. 10.1016/j.jbi.2009.08.009.
 24.
Bonizzoni P, Della Vedova G, Dondi R, Pirola Y, Rizzi R: Pure parsimony xor haplotyping. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 7: 598610.
 25.
Jajamovich GH, Wang X: Maximumparsimony haplotype inference based on sparse representations of genotypes. IEEE Trans Signal Process. 2012, 60: 20132023.
 26.
Nemhauser GL, Wolsey LA, Fisher ML: An analysis of approximations for maximizing submodular set functions—I. Math Program. 1978, 14: 265294. 10.1007/BF01588971.
 27.
Krause A, Cevher V: Submodular dictionary selection for sparse representation. Proceedings of 27th International Conference on Machine Learning (ICML 2010). 2010, Madison: Omnipress, 567574.
 28.
Niu T, Qin Z, Xu X, Liu J: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am J Hum Genet. 2002, 70: 157169. 10.1086/338446.
 29.
Liang K, Wang X: A deterministic sequential monte carlo method for haplotype inference. IEEE J Selected Topics Signal Process. 2008, 2: 322331.
 30.
Hey J: What’s so hot about recombination hotspots?. PLoS Biol. 2004, 2: e190+
 31.
Bonin A, Bellemain E, Bronken Eidesen P, Pompanon F, Brochmann C, Taberlet P: How to track and assess genotyping errors in population genetics studies. Mol Ecol. 2004, 13: 32613273. 10.1111/j.1365294X.2004.02346.x.
 32.
Pompanon F, Bonin A, Bellemain E, Taberlet P: Genotyping errors: causes, consequences and solutions. Nat Rev Genet. 2005, 6: 847846. 10.1038/nrg1707.
 33.
Hackett CA, Broadfoot LB: Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps. Heredity. 2003, 90: 3338. 10.1038/sj.hdy.6800173.
 34.
Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genet. 2007, 39 (7): 906913. 10.1038/ng2088.
 35.
Tininini L, Bertolazzi P, Godi A, Lancia G: CollHaps: a heuristic approach to haplotype inference by parsimony. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 7: 511523.
 36.
Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337338. 10.1093/bioinformatics/18.2.337.
 37.
Kerem B, Rommens J, Buchanan J, Markiewicz D, Cox T, Chakravarti A, Buchwald M, Tsui L: Identification of the cystic fibrosis gene: genetic analysis. Science. 1989, 245: 10731080. 10.1126/science.2570460.
 38.
Halldorsson BV, Bafna V, Edwards N, Yooseph S, Istrail S: A survey of computational methods for determining haplotypes. Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference. 2004, Berlin, Heidelberg, New York: SpringerVerlag, 2647.
 39.
Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missingdata imputation. Am J Human Genet. 2005, 76 (3): 449462. 10.1086/428594.
 40.
Excoffier L, Laval G, Balding D: Gametic phase estimation over large genomic regions using an adaptive window approach. Human Genomics. 2003, 1: 719. 10.1186/14797364117.
 41.
Iacobucci I, Sazzini M, Garagnani P, Ferrari A, Boattini A, Lonetti A, Papayannidis C, Mantovani V, Marasco E, Ottaviani E, Soverini S, Girelli D, Luiselli D, Vignetti M, Baccarani M, Martinelli G: A polymorphism in the chromosome 9p21 ANRIL locus is associated to Philadelphia positive acute lymphoblastic leukemia. Leukemia Res. 2011, 35 (8): 10521059. 10.1016/j.leukres.2011.02.020.
 42.
Pasmant E, Laurendeau I, Heron D, Vidaud M, Vidaud D, Bieche I: Characterization of a germline deletion, including the entire INK4/ARF locus, in a melanomaneural system tumor family: Identification of ANRIL, an Antisense Noncoding RNA whose expression coclusters with ARF. Cancer Res. 2007, 67 (8): 39633969. 10.1158/00085472.CAN062004.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
XW and GJ conceived of the project. AE, GJ and XW participated in the design of the method. AE performed the computer experiments and contributed in the writing of the draft. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Cystic Fibrosis Transmembrane Conductance Regulator
 Greedy Method
 Block Partitioning
 Heterozygous Site
 Single Nucleotide Polymorphism Site