ComHapDet: a spatial community detection algorithm for haplotype assembly

Background Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. Results We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. Conclusions Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.


Background
Technological advancements in DNA sequencing have enabled unprecedented studies of genetic blueprints and variations between individual genomes. An individual genome of a eukaryotic organism is organized in K-tuples of homologous chromosomes; diploids (K = 2), including range of applications. For instance, in humans, the haplotype information contains indicators to the individual's susceptibility to diseases and expected responses to certain classes of drugs [1]. Haplotype sequences are also used to infer recombination patterns and identify genes under positive section [2]. In the case of agricultural crops such as the tuber family, the haplotypes provide insight into developing disease and pest resistant variety of crops [3]. Thus, fast and accurate algorithms for both polyploid and diploid haplotype reconstruction (or also known as phasing) from high throughput sequencing reads are highly desirable.
Recent advancements in DNA sequencing technologies have dramatically improved affordability and speed of sequencing; the most widely used are high-throughput sequencing devices (e.g., the Illumina's platforms). Information provided by such platforms typically comes in form of paired-end reads, each consisting of two short contiguous samples of the chromosome (separated by a few hundred bases). Typical reference-guided data processing pipeline starts by determining the relative ordering of the reads via mapping them to a reference genome; this step is followed by the detection of variant positions and SNP calling. To perform haplotype assembly, existing methods typically retain only the reads that cover variant positions; parts of the retained reads that cover homozygous sites are also discarded. Then the remaining information needs to be used to identify K distinct haplotype sequences S 1 , · · · , S K , of same length. From the previous discussion, it is clear that there is no position in these K strings where all the alleles are identical (such a position would be homozygous and therefore discarded in the pre-processing step). While the relative positions of reads are revealed by mapping them onto a reference, their origin remains unknown, i.e., it is not known which among the K chromosomes a given read samples. In the absence of sequencing errors, grouping reads according to their origin is rather straightforward and based on the disagreement regarding the allele information that the reads provide for each variant site. Unfortunately, however, sequencing is erroneous with state-of-the-art sequencing platforms achieving sequencing errors in the range of 10 −3 − 10 −2 . In the presence of errors, it is no longer obvious how to decide whether a read originates from a particular haplotype; to this end, one needs to rely on a computational framework for haplotype assemblythe central focus of the present paper.
Existing work formulates haplotype assembly as a combinatorial optimization problem where one seeks to reconstruct the haplotype sequence by judiciously making as few modifications of the data as possible in order to remove read membership ambiguities arising due to sequencing errors. This has led to optimization criteria such as minimal fragment removal, minimum SNP removal [4], maximum fragments cut [5] and minimum error correction (MEC) [6]. Motivated by the observation that the MEC score optimization is NP-hard [4,7], a considerable amount of recent work studied relaxations of the underlying combinatorial optimization problem [7][8][9][10]. In a pioneering work, [11] proposed a greedy algorithm aimed at assembling most-likely haplotype sequences given the observations. This line of thought of using Bayesian methods to reconstruct the most likely haplotype sequence was carried out further in [12] and [13] using MCMC and Gibbs sampling, respectively. However, these methods are usually slow as the associated Markov chains have large mixing times, thereby making their practical applicability limited. HapCUT [14] makes a connection between haplotype assembly and graph-clustering, and solves it by identifying a maximum cut in an appropriately constructed graph. This method was shown to be superior to [11] and is widely used in benchmarking.
HapCUT algorithm was then significantly outperformed by HapCompass [15]. However, apart from HapCompass, all of these methods are restricted to the diploid case. To address both the diploid and polyploid scenarios, SDhaP [16] relaxes the MEC minimization problem to a convex optimization program and solves it efficiently by exploiting the underlying data structure. More recently, AltHap [17] casts haplotype assembly as a low-rank tensor factorization problem and solves it via fast alternating optimization heuristics. The connection to matrix factorization was previously exploited in [18] and [19], but those methods were incapable of handling polyploids or polyallelic assembly problem. Prior to the current paper, AltHap [17] is the only algorithm capable of solving the assembly problem in the polyploids/polyallelic scenario.

Main contributions
In this paper, we propose a novel formulation of haplotype assembly as a spatial graph clustering problem. This formulation is based on spatial point process representation of paired-end reads; in particular, we argue that assigning to each paired-end read spatial coordinates corresponding to the starting indices of the reads in the pair represents valuable augmentation of the information. Equipped with such a representation of the data, we construct a graph whose nodes represent the reads, and place a weighted edge between two reads if they overlap in at least one position. The edge weights are formed using the scoring function adopted from [16]; this weighting mechanism ensures that if two reads belong to the same haplotype, then the edge connecting them will likely be assigned a large positive weight, while if they belong to different haplotypes, then the edge between them will likely be assigned a large negative weight. We then cast the haplotype assembly problem as an Euclidean community detection problem [20], where the community label of a node (i.e., a read) indicates the haplotype it comes from. We find in our experiments that such a 'spatial' embedding of the problem greatly improves both the accuracy and the run time complexity of polyploid phasing. The improvement in accuracy stems from the fact that our algorithm naturally enforces a regularizing constraint that the underlying sampling process is in a sense uniformly distributed in space. In other words, we know that in every 'location' of space, the total number reads covering the given SNP location belonging to the various haplotypes are identical. Having a spatial representation of the data allows one to incorporate this prior knowledge about the sampling process and leads to dramatically higher accuracy, especially in the polyploid case. The spatial representation is also crucial in reducing run times, as this exposes the problem's inherent 'locality' condition in space. More precisely, observe that if two reads are 'far away' in this embedding (i.e, they do not overlap), there will be no edge connecting them in the corresponding graph. This allows us to naturally 'decompose' the haplotype assembly problem using a divide-and-conquer paradigm, where we can perform assembly on smaller spatially localized sets of readswhich is computationally very efficient -and then perform a synchronization step to combine the local haplotype assemblies into a single global solution. This algorithmic framework is robust to inaccuracies in the individual local instances of phasing since the synchronization or the combining step has a natural error-correction mechanism due to a single location being phased in multiple local instances. Thus, we can employ faster but less accurate local phasing methods while still achieving high global accuracy and good run-time performance. Further technical aspects of these ideas are elaborated in "The haplotype assembly algorithm" section.

Problem formulation
We formally state the problem before describing our methods.

Setup and notation
Let m and n denote the length of the haplotype sequences and the total number of paired-end read measurements, respectively. Let k denote the ploidy, i.e., the number of haplotype sequences and a be the cardinality of the alphabet set. If the variant sites are polyallelic, then a = 4 (i.e., all 4 alleles A, C, G, T are possible); in diploid and polyploid bi-allelic case, a = 2. The haplotype sequences are denoted by s l [ i], where for each i ∈ {1, · · · , m} and l ∈ {1, · · · , k}, we have s l [ i] ∈ {1, · · · , a}. As an example, in the poly-allelic case, we have s l [ i] ∈ {A, C, G, T}. In the rest of the paper, we refer to the haplotype positions {1, · · · , m} as sites.
Each read r u , u ∈ {1, · · · , n}, is formed by first sampling a haplotype indexed by v u ∈ {1, · · · , k} and then sampling a sequence of alleles s (u) [ i] i∈I u at sites I u ⊂ {1, · · · , m}, where for each i ∈ I u ,s (u) [ i] ∈ {A, C, G, T} is a "noisy" (due to sequencing errors) version of the underlying ground truth s v u [ i]. For each measurement u ∈ {1, · · · , n}, we observe the set of positions I u and the noisy values s (u) [ i] i∈I u but not the index v u of the haplotype from which the read originates. Thus, using the above notation, the total dataset is denoted as The goal of assembly is to infer origins of the reads and recover haplotype sequences.
In this paper, we consider the case where the data is in form of paired-ended reads. Formally, this implies that each measurement u ∈ {1, · · · , n} is such that the set of sites covered by read u has two contiguous blocks. More precisely, we assume that each read u ∈ {1, · · · , n} is such that there exists i (u) , · · · , m}, such that the set of sites covered by u, which was denoted by I u satisfies, Fig. 1 for an illustration. We further assume that the set of m sites and n reads forms a single connected component. In particular, if we draw a graph Q on the vertex set {1, · · · , m}, where for any two a, b ∈ {1, · · · , n}, there is an edge in Q if and only if there is a read u ∈ {1, · · · , n}, that covers both sites a and b. We say that our data set consisting of the m sites and n reads is connected, if the graph Q has a single connected component. Refer to Fig. 2 for an illustration. Notice that since the problem definition is agnostic to how we label the strings from {1, · · · , k}, any haplotype phasing algorithm can only hope to recover the k strings up to a permutation of the string label (which takes values in {1, · · · , k}). For two disjoint blocks of a haplotype that are not bridged by a read, there is no way to ascertain how to phase fragments of the k haplotypes locally recovered inside the blocks. For this reason (and without the loss of generality of the proposed assembly framework), we assume that the set of haplotypes form a contiguous block of reads. If this were not the case, we could pre-process the reads and split the problem into many smaller instances, where each instance consists of a single contiguous haplotype block that needs to be assembled independently of other blocks.

Recovery goals and performance metrics
In this subsection, we formalize the main performance metrics used to benchmark haplotype assembly algorithms, namely the Correct Phasing Rate (CPR) and the Minimum Error Correction (MEC) score (see eg. ( [21][22][23])). The CPR measures the discrepancy between the reconstructed haplotypesŝ 1 , · · · ,ŝ k and the ground truth s 1 , · · · , s k , where S k is the set of all permutations of {1, · · · , k}. Note that this is a more demanding notion of recovery compared to that used in [17] and [24]; the metric used there, so-called Reconstruction Rate, we for convenience denote by M-CPR (abbreviating "Modified CPR"). This metric is defined as In the case of assembly of diploid haplotypes (i.e., k = 2), CPR and M-CPR are identical. However, in the polyploid case where the size of the alphabet is generally 4, it holds that CPR ≤ M-CPR since for all sets X 1 , · · · , X k , k j=1 1 X j ≤ k j=1 1 X j . We adopt CPR in Eq. (1) since it reflects the fraction of sites where the haplotype phasing algorithm correctly recovers all the haplotypes. Unfortunately, the design of algorithms for direct minimization of this objective is infeasible since the ground truth is unknown. A commonly used proxy metric in practice is the MEC which can be computed directly from the observed data and the reconstructed output. The MEC score is defined as A number of existing haplotype assembly schemes such as [14,17], attempt to directly minimize the MEC score by solving relaxations of the underlying combinatorial optimization problem that is known to be NP-hard [4,7]. Contrary to this common approach, in this paper we do not attempt to directly minimize the MEC but rather leverage inherent structural properties of the data and the fact that the noise in measurements is introduced randomly, rather than by an adversarial source, to design a randomized assembly algorithm. Specifically, we rely on the above observations to provide a solution to haplotype assembly by posing it as a clustering problem on a random graph.
The key parameters that impact the performance of assembly are coverage, error rate and effective read length. Formally, effective read-length R is defined to be the average of j (l) 1 and j (l) 2 , i.e., We define the coverage C as the average number of reads that cover a single base in a haplotype, i.e., C = 2nR km . Since there are n reads, each covering on average 2R haplotype alleles, the total average number of bases being read is 2nr. The error-rate p is the average error rate of the data acquisition process, i.e., the fraction of alleles incorrectly represented by the reads; this rate is aggregate of the sequencing and SNP calling errors. We adopt the standard practice of characterizing and benchmarking the performance of haplotype assembly algorithms using either the achieved MEC in practical settings where the ground truth is unknown, and the CPR in simulation studies where the ground truth is known. We will characterize the performance of our algorithm in settings with varied ploidy, alphabet size, coverage, read-lengths and error-rates.

The haplotype assembly algorithm
The algorithm we propose is based on identifying a simple connection between the aforementioned haplotype reconstruction problem and Euclidean community detection. Although such a connection was previously noted in the special case of single-ended reads and the diploid haplotype phasing problem [25], no prior work explored this connection in the case of paired-end reads and phasing polyploids. For the first time, we provide a unified framework based on Euclidean community detection (e.g., [20,26]) for both diploid and polyploid haplotype phasing problems.

Pre-processing the data
In order to invoke a connection to spatial community detection, we pre-process the n paired-end reads into a graph G with n nodes, where each node represents a paired-end read. This pre-processing has two steps -(i) Constructing weights between pairs of nodes (reads), (ii) Placing labels on the nodes.
1. Weights between nodes -For any two reads u, v ∈ [ n] with u = v, denote the intersection of sites at which the two measurements occur by I u ∩ I v := l 1 , · · · , l q , where q = 0 implies empty set. More precisely, each l i , for i ∈ {1, · · · , q}, is a position along the haplotype covered by both u and v. If q = 0, reads u and v cover disjoint set of sites; in this case, there is no edge between u and v in the graph G. If on the other hand q > 0, then we place an edge between them and assign it weight w uv given by In words, the weight of an edge between any two overlapping reads u and v is the difference between the number of positions (sites) where u and v agree and the number where they disagree, divided by the total number of sites in common. Observe in the definition that the weights w uv ∈[ −1, 1] for all u, v ∈[ n]. Such a weighting scheme ensures that if w uv is positive and large, then it is likely that the reads u and v are generated from the same string, while if w uv is negative and large in magnitude, then it is likely that the reads u and v are generated from different strings. This bias in the weights w uv can be understood by examining the typical structure of the polyploid phasing problem. Note that if the SNP positions were called accurately, i.e., all of the m haplotypes to be phased were 'true' , then it would hold that in any location i ∈ {1, · · · , m}, not all strings s 1 , · · · , s k have identical bases, i.e., the set of locations {i ∈ {1, · · · , m} : s 1 [ i] = · · · = s k [ i] } = ∅. Since sequencing errors are 'typically' small, it is thus the case that if two reads covering the same site have different values, then it is likely that they come from different haplotypes.
2. Node Labels -To each node u ∈[ N n ] of the graph we assign two labels: (i) -a community label Z u ∈[ k] and (ii)a spatial label X u ∈[ m] 2 . The community label of a node indicates origin of the corresponding read (i.e., which haplotype the read samples), while the spatial label of a node i.e., the locations along the haplotype where the two contiguous substrings of the paired-end read start. See Fig. 3 for an illustration of the data preprocessing. This definition of spatial labels assumes that the reads consist of precisely two contiguous substrings; however, experimental data show that this may be violated due to various practicalities in base-calling, read mapping and variant calling steps (e.g., due to omitting bases with confidence scores below a threshold); consequently, in practice a read (more precisely, the parts of a read covering heterogeneous variant positions that are being used in haplotype assembly) may consist of either a single or more than 2 contiguous fragments. If a read is only a single contiguous string of alleles, the spatial location of that read is placed on the diagonal in [ n] 2 , i.e., the spatial label assigned to the read is a pair of identical numbers (each being the starting position of the single read). If a read happens to consist of more than 2 contiguous fragments, Fig. 3 An illustration of our pre-processing the set of reads into a spatial random graph. The starting indices of the two contiguous blocks of a read forms its spatial coordinate and the weight is computed using Eq. 3. In this example, the shaded sites contribute to the weight w uv there are several possibilities when it comes to assigning it a spatial label. For instance, we may split the read into multiple reads such that each one has at most two contiguous fragments; alternatively, we may choose two of the multiple starting contiguous points in a suitable fashion to form the spatial label. Further details regarding assignment of spatial labels are discussed in "Performance on real data" section.

Summary of the algorithmic pipeline
Our algorithm takes the spatial graph G as input, and produces the reconstructed haplotypes. After pre-processing the sequencing data, we may form the graph G and assign spatial labels (X u ) u∈[n] to its nodes. However, the community labels (Z u ) u∈ [n] are unknown at this point. We defer a detailed discussion of the computational complexity involved in the pre-processing needed to construct G to the end of this section, where we show that one can exploit the structure in the data to reduce complexity of this pre-processing step from the naive O n 2 to roughly O(n). Given the representation of the data by the graph G, and having assigned spatial labeling to its nodes, our algorithm has two main components -(i) -a community detection on G to discover origin of each read and thus effectively group the reads into k different clusters and (ii) by a simple majority rule as described in the sequel.
1. Euclidean Community Detection -This part of the algorithm is where we take as input the graph G along with the spatial labels (X u ) u∈ [n] and estimate for each u ∈[ n], a community label Z u ∈[ k], denoting which of the k haplotype, a read is likely originating from. We summarize the key steps in this task. The formal pseudo code is given in Algorithm 1. synchronized to obtain a global clustering estimate from spatially-local clustering estimates. Since each grid point is present in multiple boxes, a read gets many estimates for its community, each of which adds an 'evidence' to the label of the node. This scheme has a natural 'error-correcting' mechanism, since it is less likely for a node to be misclassified in the majority of the boxes it lies in, as opposed to any one particular box.

Reduce by Majority
-After estimating for each node (read), the likely haplotype from which it originates Z u u∈ [n] , we reconstruct the haplotypes by a simple majority vote. For all j ∈[ k] and i ∈[ m], we estimateŝ j [ i] to be the majority among the 4 letters in the alphabet as indicated by the reads that cover site i and are estimated to belong to string j in the above clustering step.

Intuition behind the algorithm
Observe that for any two reads u and v that overlap, if the weight w uv is positive and large, then they are likely from the same haplotype while if negative and large, they are likely from different haplotypes. Consider a subset of reads such that the absolute value of the weights on the edges that connect them in graph G are 'large' . This subset of reads can readily be grouped into up to k different clusters using standard techniques such as the spectral method. Such simple schemes will be successful in this scenario since the 'signal' for clustering in the subgraph is high due to strong interaction between the corresponding reads, i.e., the adjacency matrix of the subgraph is dense.
However, the entire set of reads does not posses the aforementioned property -in particular, it has a large diameter (of order n). Thus, standard spectral methods applied to graph G are both computationally impractical as well as statistically sub-optimal for recovery. The computational impracticality is easy to observe; indeed, any clustering scheme will be super-linear in the number of nodes N n , which renders them extremely slow on large problem instances that are typical in practice. Furthermore, even the pre-processing of reads to construct graph G is of order N 2 n , which makes it computationally challenging in practical settings. The statistical sub-optimality is more subtle and stems from the fluctuations in the density of reads in space; in many problem instances, the density of reads varies across space due to randomness in the read generation process. For instance, in Fig. 4, we see an example where the density of reads captured by the spectral algorithm is highly imbalanced due to the fluctuations of the nodes in space. However, in the ground truth set, the statistical distribution of reads across space is the same for all communities. Therefore, to improve performance, one would need an additional 'constraint' in the reconstruction algorithm to enforce the fact that the spatial distribution of reads across all communities is the same.
Our strategy in this paper is to first partition the set of reads into subsets wherein the reads interact strongly. Since the reads have spatial locality, we partition the set of reads into boxes as illustrated in Fig. 5. In each box, we consider the subgraph of G with nodes having spatial labels lying in that box and then cluster this subgraph independently of other subgraphs in other boxes. The partition of space into boxes is fixed a priori and is data independent. The box size and how much they overlap are hyper-parameters that can be tuned. In each box, we run a clustering algorithm and then combine the estimates from different boxes to form a final global partitioning of the nodes of G. The partitioning based on spatial locality automatically ensures that the spatial density of the estimated communities are roughly identical. The intuition for this stems from the fact that the reads will be roughly uniformly distributed within a box since the box is 'small' in size. More importantly, by requiring that the boxes overlap, a single read will be present in multiple boxes. This further boosts statistical accuracy of clustering by embedding natural error-correction scheme; since a single read is processed multiple times, there are multiple instances of 'evidence' for the community label of a node. From a computational complexity viewpoint, partitioning the set of nodes and clustering smaller instances dramatically reduces run-time as a majority of clustering algorithms are super-linear in the number of data points and hence reducing the sizes of the graphs to be clustered has a significant beneficial impact. Therefore, our algorithm is both computationally feasible on large instances and is statistically superior compared to standard graph clustering algorithms directly applied on G.

Pseudo code
Let us start by introducing the notation needed to formalize the algorithm. The algorithm has hyper-parameters A, B, iter, M ∈ N and α ∈[ 0, 1]. For x, y ∈ ñ A , we define B x,y ⊂[ n] 2 as B x,y := Ax, min(Ax + B,ñ) × Ay, min(Ay + B,ñ) , the box indexed by (x, y). Thus, the parameters A and B dictate how large a box is and how many boxes cover a read. In the course of the algorithm, we maintain a dictionary of lists C, where for each node u ∈[ N n ], C[ u] denotes the list of community estimates for node u. Each node has more than one estimate as it belongs to more than one box. The estimates from clustering in each box are added as 'evidence' of the community estimate for the nodes in the box. Having multiple estimates for  a node helps in combating clustering errors in certain boxes. Main Routine -We now describe the algorithm in detail. The first step consists of partitioning the space [ n] 2 into multiple overlapping boxes as shown in Fig. 5. Recall that the hyper-parameters A and B allow one to tune both the size of a box and the number of boxes that will cover a given location of [ n] 2 . In each box indexed by (x, y) for x, y ∈ ñ A , we identify the nodes of G having their spatial label in that box; let us denote the collection of such nodes by H x,y . If the number of nodes in H x,y is small (e.g., smaller than a certain hyper-parameter M), then we do not attempt to cluster these nodes. We need to set such a minimum size of H x,y or the output of clustering may turn out to be noisy and non-informative. In addition, if more than an α < 1 fraction of nodes in H x,y have at least one community estimate, then again we do not cluster H x,y . The reason for doing so is to decrease the running time by ensuring we only perform the clustering step when there are sufficiently many new unexplored nodes. In each remaining box (x, y) (i.e., each box with at least M nodes where at most an α fraction of them have prior estimates) we apply a fast and simple local clustering algorithm. In particular, we generate an approximate clustering of the nodes in H x,y by directly running a standard k-means algorithm [27] on the adjacency matrix of H x,y . We then iteratively improve upon this initial guess by reassigning each node to the most likely cluster while keeping the assignment of other nodes fixed. This iterative update rule is reminiscent of the class of Expectation Maximization algorithms, although our method is fast and non-parametric. We run the iterative clean-up procedure for iter number of iterations. Once the nodes of H x,y are clustered, we append the result to the dictionary of lists C after appropriately synchronizing the community estimates of each node. Once we have iterated over all boxes, we assign a single community estimate to each node based on the majority in the list of estimates in C. The algorithm is formalized as the following pseudo-code.

Local Clustering
Step -This step is described in Algorithm 2. We follow a two step-procedure. In the first step, we get an approximate clustering of the graph H by applying the standard k-means algorithm on the adjacency matrix H. We then one-hot encode this clustering result. One hot encoding is one where for each read we associate a k × 1 vector in which all entries re 0 except for a single entry corresponding to the estimated community label of that read to be 1. More formally, if r denotes the number of nodes of H, then the one-hot encoding result for i ∈ {1, · · · , n} do For every node(read) 3:  10: if |H x,y | ≥ M AND f ≥ 1 − α then 11: e ← LOCAL-CLUSTER(H x,y , k, iter) 12: C ← SYNCHRONIZE(C, e) 13: end if 14: end for 15: end for 16: return REDUCE-BY-MAJORITY(C) 17: end procedure Algorithm 2 Small Graph Clustering 1: procedure LOCAL-CLUSTER(H, k, iter) 2:ẽ := k-means-cluster(H, k) k-means Algorithm of [27] 3: e 0 = One-Hot-Encoding(ẽ). 4: for 1 ≤ i ≤ iter do 5: e i = arg max He i−1 row wise argmax 6: end for 7: return e iter . 8: end procedure is a matrix e 0 of size r × k. Each entry of e 0 is either 0 or 1; the entry in the ith row and jth column being 1 implies that the ith node is classified as belonging to community j. Thus, each row of e 0 contains exactly one 1 while the rest of the entries are all 0. We then run a 'clean-up' procedure by iteratively updating the estimate as follows.
The function T is applied row-wise; for matrix A, it sets the ith row and jth column of , else the ith row and jth column of T (A) is set to 0. If a row has more than one column where the maximum is attained, the first column where the maximum occurs is assigned value 1 while the other columns are assigned value 0. Hence the dimensions of A and T (A) are the same. Furthermore, for any matrix A, the matrix T (A) is such that its entries are either 0 or 1, with each row having exactly one entry valued 1.
The iterative update is based on the following intuition. Let the clustering be encoded by matrix e t for some t ∈ N, and consider a tagged node u ∈[ n]. The new updated value of the community label of node u is then the 'mostlikely' label given the estimates of the community labels of the rest of the nodes. More precisely, the 'weight' that a node u is in a community l ∈[ k] is the sum of the weights along the edges connecting u to v ∈[ n] in G such that the estimate of node v is l. The new community label of node u is the one having the largest weight. By performing this operation simultaneously for all nodes, one obtains the representation in Eq. (4). The intuition for reassigning the node to the community with the maximum weight stems from the observation that if a weight along an edge is positive and large, then it is likely that the end nodes of the edge share the origin, i.e., the corresponding reads sample the same haplotype. Likewise, if the weight along an edge is negative and large, then it is likely that the end nodes represent reads that sample different haplotypes. Therefore, for the iterative update to perform well, the initial estimate e 0 must be 'good enough'; we achieve this by applying the k-means clustering algorithm on the adjacency matrix H. In principle, one can obtain somewhat better initial guess for e 0 by applying the k-means algorithm to the eigenvectors of H, but the marginal gains in statistical accuracy does not warrant the enormous increase in computation needed to perform such a spectral clustering.
The clean-up method, at first glance, seems to bear similarities to other dynamical algorithms such as expectation maximization, Belief Propagation (BP) and tensor factorization based methods of [17]. Unlike BP, however, we do not iterate the beliefs or probabilities of a node belonging to various communities; instead, we make a hard assignment at each update step. While for standard BP algorithms it is desirable that a graph is tree-structured,

Algorithm 3 Synchronization
Step 1: procedure SYNCHRONIZE(C, e) 2: The identity permutation 4: for All permutations π of [ k] do 5: local-weight ← 0 6: for All nodes in e do 7: end for 9: if W π > W then 10: π ← π. 12: end if 13: end for 14: for nodes in e do 15: Append π(e[ nodes] ) to C[ nodes]. 16: end for 17: return C. 18: end procedure our graph contains a lot of triangles and loops due to the spatial embeddings. Therefore, it would be insufficient to keep track of the node marginals -instead, BP would need the entire joint distribution which is not tractable. Despite the undesirable properties of G, benchmarking results demonstrate that our algorithm performs community detection on the graph very well. Synchronization Step -The main routine in Algorithm 1 considers the boxes sequentially and performs local clustering steps. Once the local clustering is performed, a key component is to synchronize the estimates of the current box with the estimates of the boxes that are already clustered. The synchronization is essential since the problem is a permutation invariant to the labels.  Estimate haplotype by majority 6: for l ∈ {1, · · · , k} do 7: for i ∈ {1. · · · , m} do 8: s l [ i] ← majority of alphabets among reads j with C[ j] = l and covering site i. 9: end for 10: end for 11: return ( s l [ i] ) l∈{1,··· ,k},i∈{1,··· ,m} 12: end procedure are permuted. Hence, the best hope for any recovery algorithm is to reconstruct the k strings upto a permutation of labels. Thus, if any clustering algorithm is run on two different subsets of nodes, the corresponding haplotype estimates need to be synchronized to produce a clustering of the nodes in the union of the sets. We perform this clustering in line 8 of the main routine Algorithm 1 by invoking the sub-routine 3.
In sub-routine 3, we decide on how to permute the community label output of the local clustering estimate of H x,y that best 'synchronizes' with the label estimates of the other nodes of G at that instance. Observe that at the instant of synchronizing the output of H x,y , other nodes of G have either none or multiple label estimates. There is a possibility that more than one label estimate per node is present in multiple boxes, each adding an 'evidence' for a node's cluster. We select a permutation of the labels by sequentially going over all permutations of [ k] and selecting the one that has the highest 'synchronization-weight' . More formally, let N x,y ⊂[ n] denote the indices of the nodes in H x,y ; for all u ∈ N x,y , denote by σ [ u] ∈[ k] the label estimates formed by the local clustering on H x,y . The synchronization weight for a permutation π of [ k] is defined as In words, we go over all nodes in H x,y that have at least one prior estimate and sum the fraction of the previous estimate equaling the label assigned by the local clustering H x,y after applying the permutation π to the local clustering's output. Among all permutations, we select the π * having the highest synchronization weight (ties are broken arbitrarily). After doing so, for each node u of H x,y we append the label π * (σ (u)) to the list C [ u]. The key feature of the above procedure is consideration of the fraction, which is a proxy for the 'belief ' of the community label of a node, rather than just a count; this is meaningful because the counts across different nodes can be significantly skewed by the order in which the boxes are clustered and synchronized.

Computational complexity
In this section, we discuss the computational complexity of implementing our algorithm and the effect of various hyper-parameters on runtimes. A naive implementation of the algorithm would incur a cost of the order n 2 just to construct the graph G from the reads. This step itself may be infeasible in practical scenarios where the number of reads will be on the order of millions. However, our algorithm only needs the subgraphs H x,y instead of the full graph G. Therefore, we pre-process the reads and create a hash-map where for each location in [ n] 2 , we store the list of reads that have spatial label in that location. This requires one pass through the list of reads, i.e., has computational complexity of order n and storage complexity of order n. Now, creating the adjacency matrix H x,y is quadratic in only the number of nodes in H x,y . The synchronization step requires time complexity of the order of the number of nodes in H x,y times the number of distinct permutations of [ k].

Choosing the parameters of our algorithm
Our algorithm has an inherent trade-off between computational complexity and statistical accuracy that can be achieved by varying the hyper-parameters. For instance, if we decrease A while keeping B fixed, the number of boxes and therefore the computational time are increased. However, the statistical accuracy would improve since each node would now be present in many boxes and hence the error-correction scheme performs more accurately. Similarly, increasing the parameter M can reduce the run-time

Results
We evaluate the performance of our proposed algorithm on both simulated and experimental data. We implemented our algorithm in Python. The simulations, as well as the experimental evaluations were conducted on a single core Intel I5 Processor with 2.3Ghz processor and 8 GB 2133 MHz LPDDR3 RAM.

Performance on simulated data
We first test the performance of our algorithm in simulations for both the diploid biallelic case as well as the more challenging polyploid polyallelic case. Since the ground truth in simulations is known, we use CPR, MEC and M-CPR as the primary performance benchmarks. The CPR and M-CPR are reported as percentages, for ease of presentation. We compare the performance over a range of problem parameters, namely the ploidy and alphabet size, as well as the measurement parameters, in particular the coverage, average read length and error rates. In each case, the hyper-parameters were set to A = 15 and B = 4. The parameter α = 0.95, for all polyploid cases and α = 0.85 in the diploid case. Recall that the parameter α allows one to control the trade-off between the run time and statistical accuracy; specifically, a lower value of α results in faster run times at the cost of reduced CPR. The column σ displays the standard deviation of CPR (after being multiplied by 100, for consistency). In each table and metric, the boldfaced entry represents the algorithm with the best performance for that entry.

Simulated data -diploid biallelic case
In the diploid case, we rely on the synthetic pairedend read data used in [17]. The average length of the effective (i.e., haplotype-informative) paired-end read 4 with an insert gap in a paired end reads being uniformly sampled between 50 to 150. We use a haplotype length (m in our notation) of 700 in all case to be consistent with prior literature. As an example [23], which is often used to benchmark haplotype assembly methods. The results are reported in Table 1. We simulate 15 instances for each configuration of coverage and error  probability and report the average in Table 1. We compare our methods against AltHap [17], a sparse tensor factorization method, and HPop [8], a state-of-the-art dynamic programming approach to haplotype assembly. Since for diplod case CPR and M-CPR are identical, we only report the CPR. We restrict our attention to these methods since it is already established in [8,17,28], that they are superior, both in terms of accuracy and run times, as compared to various other approaches including SDHaP [16], an approach inspired by semidefinite programming relaxations of the max-cut problem, BP [29], a communications system design inspired belief propagation algorithm, HapTree [30], an algorithm inspired by a Bayesian reformulation of the problem, and HapCompass [15], an algorithm focused on finding cycle basis in a graphical representation of the haplotype assembly problem.

Simulated data -polyploid polyallelic case
We report the results in Tables 2, 3 and 4 for the cases of triploid, tetraploid and hexaploid, respectively. In all cases, we considered the tetra-allelic case, i.e., the case of alphabets of size 4, and the average length of the effective (i.e., haplotype-informative) paired-end read 4. The average insert size between the paired end reads was chosen to be 200 with a minimum gap of 50. The benchmarking algorithm we consider is that of [17], the state-of-theart algorithm capable of polyploid polyallelic phasing; all other methods are restricted to biallelic variants. In each case, we test our algorithm on 10 different problem instances, where in each instance, a haplotype sequence of length 1000 was phased. We use the same method and the publicly available code from [17] to generate the synthetic data for the various instances. For the hexaploid case, we do not report performances on coverage smaller than 10, since the performance of both algorithms are poor.

Performance on real data Tetraploid potato data set
We test our algorithm on a tetraploid real data set of Chromosome 5 of Potato species Solanum Tuberosum, whose reference genome is available publicly 1 . We considered a set of paired-end sequence reads reported in experiment SRX3284127 available in the NCBI database 2 . We then mapped the reads to the reference genome using the BWA software of [31]. Subsequently, we use the FreeBayes software [32] for SNP calling and create the vcf file. We then extracted out connected components of reads and considered the haplotype assembly on instances that were at-least 20 haplotypes long. The data set of reads after SNP calling is available in the Github link provided in Abstract. We compare the performance of our algorithm with AltHap, HapCompass [33], HPoP [8] and HapTree [30], and report the results in Table 5. We use the recommended hyper parameters of A = 20, B = 5, α = 0.85 and the minimum problem size as 20. We chose these set of parameters to minimize run time, while at the same time ensuring that all reads in a data set are covered by at-least one community estimate. Table 5 compares MEC scores achieved by our method with those achieved by competing techniques. Note that the true accuracy is captured by the correct phasing rate but the ground truth data in this set is not available and thus CPR cannot be computed. While being a convenient surrogate metric, the MEC score may be misleading since e.g. a very low MEC score does not necessarily imply high CPR (see Table 1 of [17] for an illustrative example). Note that the task of tetraploid phasing presents a challenge to our scheme because the weight of an edge in the spatial graph (see Eq. (3) is biased towards being positive, even if the reads on the end points of an edge originate from different strings; this limits accuracy of assembly schemes that rely on read clustering. Finally, as illustrated by simulations, our methodology is suitable to settings where sequencing coverage exceeds ∼ 10× (both in diploid and polyploid setting). At the time of writing this paper we do not have access to data sets with coverage beyond 10× and thus do not perform further experiments on real data.

Discussions
The results indicate that our method is comparable to the state of the art, both in the diploid as well as the more challenging polyploid polyallelic scenario. In the diploid biallelic case, we see in Table 1 that our method performs comparable to both AltHap and HPoP in terms of reconstruction accuracy as measured by CPR and MEC scores.
In the polyploid polyallelic scenario however, Tables 2, 3 and 4 indicate that our algorithm is superior in terms of both CPR and MEC compared to the state of art, namely AltHap, which is the only methodology prior to our work that is capable of handling polyploid polyallelic data. In terms of experiment on a real data ployploid experiment, we see in Table 5, that our method performs comparably both in terms of reconstruction accuracy as measured by MEC score and runtime complexity as compared to other state of art methods. The results demonstrate that our methodology has significantly higher reconstruction at higher coverages as compared to lower coverages. This is unsurprising, as at higher coverages, there are more reads and hence more data to recover the haplotypes. Moreover, the runtime complexity of our method scales very gracefully with increasing coverage, making it attractive for many practical high coverage scenarios.

Conclusions
In this paper, we propose a novel methodology to assemble both diploid and ployploid haplotypes. The main observation we make is that, by a spatial representation of the paired-end reads, we can effectively convert the problem about haplotype assembly into a community detection task on a spatial graph. Our algorithm assigns to each paired end read, a spatial label corresponding to the starting indices of the two read fragments. We then divide the problem into overlapping instances, each of which considers the set of reads located nearby in this embedding and performs a community clustering, where the community label of a read (node) is the haplotype from which it originates from. Finally, for each read, we take the majority of the estimated communities from the various instances as the final community estimate of that read. We then use this estimated community labels for the reads to output the reconstructed haplotype.