Problem formulation
We formally state the problem before describing our methods.
Setup and notation
Let m and n denote the length of the haplotype sequences and the total number of paired-end read measurements, respectively. Let k denote the ploidy, i.e., the number of haplotype sequences and a be the cardinality of the alphabet set. If the variant sites are polyallelic, then a=4 (i.e., all 4 alleles A,C,G,T are possible); in diploid and polyploid bi-allelic case, a=2. The haplotype sequences are denoted by sl[i], where for each i∈{1,⋯,m} and l∈{1,⋯,k}, we have sl[i]∈{1,⋯,a}. As an example, in the poly-allelic case, we have sl[i]∈{A,C,G,T}. In the rest of the paper, we refer to the haplotype positions {1,⋯,m} as sites.
Each read ru, u∈{1,⋯,n}, is formed by first sampling a haplotype indexed by vu∈{1,⋯,k} and then sampling a sequence of alleles \(\left \{\tilde {s}^{(u)}[i]\right \}_{i \in \mathcal {I}_{u}}\) at sites \(\mathcal {I}_{u} \subset \left \{1,\cdots,m\right \}\), where for each \(i \in \mathcal {I}_{u}\), \(\tilde {s}^{(u)}[i] \in \{A,C,G,T\}\) is a “noisy” (due to sequencing errors) version of the underlying ground truth \(s_{v_{u}}[i]\). For each measurement u∈{1,⋯,n}, we observe the set of positions \(\mathcal {I}_{u}\) and the noisy values \(\left \{\tilde {s}^{(u)}[i]\right \}_{i\in \mathcal {I}_{u}}\) but not the index vu of the haplotype from which the read originates. Thus, using the above notation, the total dataset is denoted as \(\left (\left \{\tilde {s}^{(u)}[i]\right \}_{i\in \mathcal {I}_{u}}\right)_{u \in \left \{1,\cdots,n\right \}}\). The goal of assembly is to infer origins of the reads and recover haplotype sequences.
In this paper, we consider the case where the data is in form of paired-ended reads. Formally, this implies that each measurement u∈{1,⋯,n} is such that the set of sites covered by read u has two contiguous blocks. More precisely, we assume that each read u∈{1,⋯,n} is such that there exists \(i_{1}^{(u)}, j_{1}^{(u)}, i_{2}^{(u)},j_{2}^{(u)} \in \left \{1,\cdots,m\right \}\), such that the set of sites covered by u, which was denoted by \(\mathcal {I}_{u}\) satisfies, \(\mathcal {I}_{u} = \{i_{1}^{(u)},i_{1}^{(u)}+1,\cdots,i_{1}^{(u)}+j_{1}^{(u)}\} \cup \left \{i_{2}^{(u)},i_{2}^{(u)}+1,\cdots,i_{2}^{(u)}+j_{2}^{(u)}\right \}\). See Fig. 1 for an illustration.
We further assume that the set of m sites and n reads forms a single connected component. In particular, if we draw a graph Q on the vertex set {1,⋯,m}, where for any two a,b∈{1,⋯,n}, there is an edge in Q if and only if there is a read u∈{1,⋯,n}, that covers both sites a and b. We say that our data set consisting of the m sites and n reads is connected, if the graph Q has a single connected component. Refer to Fig. 2 for an illustration. Notice that since the problem definition is agnostic to how we label the strings from {1,⋯,k}, any haplotype phasing algorithm can only hope to recover the k strings up to a permutation of the string label (which takes values in {1,⋯,k}). For two disjoint blocks of a haplotype that are not bridged by a read, there is no way to ascertain how to phase fragments of the k haplotypes locally recovered inside the blocks. For this reason (and without the loss of generality of the proposed assembly framework), we assume that the set of haplotypes form a contiguous block of reads. If this were not the case, we could pre-process the reads and split the problem into many smaller instances, where each instance consists of a single contiguous haplotype block that needs to be assembled independently of other blocks.
Recovery goals and performance metrics
In this subsection, we formalize the main performance metrics used to benchmark haplotype assembly algorithms, namely the Correct Phasing Rate (CPR) and the Minimum Error Correction (MEC) score (see eg. ([21–23])). The CPR measures the discrepancy between the reconstructed haplotypes \(\hat {s}_{1},\cdots,\hat {s}_{k}\) and the ground truth s1,⋯,sk,
$$\begin{array}{*{20}l} \text{CPR} = \frac{1}{m} \sum\limits_{i=1}^{m} \max_{\pi \in \mathcal{S}_{k}} \prod_{l=1}^{k} \mathbf{1}_{\hat{s}_{l}[i] = s_{\pi(l)}[i]}, \end{array} $$
(1)
where \(\mathcal {S}_{k}\) is the set of all permutations of {1,⋯,k}. Note that this is a more demanding notion of recovery compared to that used in [17] and [24]; the metric used there, so-called Reconstruction Rate, we for convenience denote by M-CPR (abbreviating “Modified CPR”). This metric is defined as
$$\begin{array}{*{20}l} {\text{M-CPR}} = \max_{\pi \in \mathcal{S}_{k}}\frac{1}{mk} \sum\limits_{i=1}^{m} \sum\limits_{l=1}^{k} \mathbf{1}_{\hat{s}_{l}[i] = s_{\pi(l)}[i]}. \end{array} $$
(2)
In the case of assembly of diploid haplotypes (i.e., k=2), CPR and M-CPR are identical. However, in the polyploid case where the size of the alphabet is generally 4, it holds that CPR≤M-CPR since for all sets X1,⋯,Xk, \(\prod _{j=1}^{k} \mathbf {1}_{X_{j}} \leq \sum _{j=1}^{k} \mathbf {1}_{X_{j}}\). We adopt CPR in Eq. (1) since it reflects the fraction of sites where the haplotype phasing algorithm correctly recovers all the haplotypes. Unfortunately, the design of algorithms for direct minimization of this objective is infeasible since the ground truth is unknown. A commonly used proxy metric in practice is the MEC which can be computed directly from the observed data and the reconstructed output. The MEC score is defined as
$$\begin{array}{*{20}l} \text{MEC} = \sum\limits_{u=1}^{n} \min_{l \in \{1,\cdots,k\}}\sum\limits_{i=1}^{m} \mathbf{1}_{\text{Read }u \text{ covers site }i} \mathbf{1}_{\hat{s}^{(u)}[i] \neq s_{l}[j]}. \end{array} $$
A number of existing haplotype assembly schemes such as [14, 17], attempt to directly minimize the MEC score by solving relaxations of the underlying combinatorial optimization problem that is known to be NP-hard [4, 7]. Contrary to this common approach, in this paper we do not attempt to directly minimize the MEC but rather leverage inherent structural properties of the data and the fact that the noise in measurements is introduced randomly, rather than by an adversarial source, to design a randomized assembly algorithm. Specifically, we rely on the above observations to provide a solution to haplotype assembly by posing it as a clustering problem on a random graph.
The key parameters that impact the performance of assembly are coverage, error rate and effective read length. Formally, effective read-length R is defined to be the average of \(j_{1}^{(l)}\) and \(j_{2}^{(l)}\), i.e.,
$$\begin{array}{*{20}l} R = \frac{1}{2n} \sum\limits_{u=1}^{n} \left(j_{1}^{(u)}+j_{2}^{(u)}\right). \end{array} $$
We define the coverage \(\mathcal {C}\) as the average number of reads that cover a single base in a haplotype, i.e., \(\mathcal {C} = \frac {2nR}{km}\). Since there are n reads, each covering on average 2R haplotype alleles, the total average number of bases being read is 2nr. The error-rate p is the average error rate of the data acquisition process, i.e., the fraction of alleles incorrectly represented by the reads; this rate is aggregate of the sequencing and SNP calling errors. We adopt the standard practice of characterizing and benchmarking the performance of haplotype assembly algorithms using either the achieved MEC in practical settings where the ground truth is unknown, and the CPR in simulation studies where the ground truth is known. We will characterize the performance of our algorithm in settings with varied ploidy, alphabet size, coverage, read-lengths and error-rates.
The haplotype assembly algorithm
The algorithm we propose is based on identifying a simple connection between the aforementioned haplotype reconstruction problem and Euclidean community detection. Although such a connection was previously noted in the special case of single-ended reads and the diploid haplotype phasing problem [25], no prior work explored this connection in the case of paired-end reads and phasing polyploids. For the first time, we provide a unified framework based on Euclidean community detection (e.g., [20, 26]) for both diploid and polyploid haplotype phasing problems.
Pre-processing the data
In order to invoke a connection to spatial community detection, we pre-process the n paired-end reads into a graph G with n nodes, where each node represents a paired-end read. This pre-processing has two steps - (i) Constructing weights between pairs of nodes (reads), (ii) Placing labels on the nodes.
1. Weights between nodes - For any two reads u,v∈[n] with u≠v, denote the intersection of sites at which the two measurements occur by \(\mathcal {I}_{u} \cap \mathcal {I}_{v}:=\left \{l_{1},\cdots,l_{q}\right \}\), where q=0 implies empty set. More precisely, each li, for i∈{1,⋯,q}, is a position along the haplotype covered by both u and v. If q=0, reads u and v cover disjoint set of sites; in this case, there is no edge between u and v in the graph G. If on the other hand q>0, then we place an edge between them and assign it weight wuv given by
$$ w_{uv} := \frac{ 1 }{q} \sum\limits_{h=1}^{q} \left(\mathbf{1}_{\tilde{s}^{(u)}[l_{h}] = \tilde{s}^{(v)}[l_{h}]} - \mathbf{1}_{\tilde{s}^{(u)}[l_{h}] \neq \tilde{s}^{(v)}[l_{h}]} \right). $$
(3)
In words, the weight of an edge between any two overlapping reads u and v is the difference between the number of positions (sites) where u and v agree and the number where they disagree, divided by the total number of sites in common. Observe in the definition that the weights wuv∈[−1,1] for all u,v∈[n]. Such a weighting scheme ensures that if wuv is positive and large, then it is likely that the reads u and v are generated from the same string, while if wuv is negative and large in magnitude, then it is likely that the reads u and v are generated from different strings. This bias in the weights wuv can be understood by examining the typical structure of the polyploid phasing problem. Note that if the SNP positions were called accurately, i.e., all of the m haplotypes to be phased were ‘true’, then it would hold that in any location i∈{1,⋯,m}, not all strings s1,⋯,sk have identical bases, i.e., the set of locations {i∈{1,⋯,m}:s1[i]=⋯=sk[i]}=∅. Since sequencing errors are ‘typically’ small, it is thus the case that if two reads covering the same site have different values, then it is likely that they come from different haplotypes.
2. Node Labels - To each node u∈[Nn] of the graph we assign two labels: (i) - a community label Zu∈[k] and (ii) - a spatial label Xu∈[m]2. The community label of a node indicates origin of the corresponding read (i.e., which haplotype the read samples), while the spatial label of a node u∈[n] is \(\left (i^{(u)}_{1},i^{(u)}_{2}\right)\), i.e., the locations along the haplotype where the two contiguous substrings of the paired-end read start. See Fig. 3 for an illustration of the data pre-processing. This definition of spatial labels assumes that the reads consist of precisely two contiguous substrings; however, experimental data show that this may be violated due to various practicalities in base-calling, read mapping and variant calling steps (e.g., due to omitting bases with confidence scores below a threshold); consequently, in practice a read (more precisely, the parts of a read covering heterogeneous variant positions that are being used in haplotype assembly) may consist of either a single or more than 2 contiguous fragments. If a read is only a single contiguous string of alleles, the spatial location of that read is placed on the diagonal in [n]2, i.e., the spatial label assigned to the read is a pair of identical numbers (each being the starting position of the single read). If a read happens to consist of more than 2 contiguous fragments, there are several possibilities when it comes to assigning it a spatial label. For instance, we may split the read into multiple reads such that each one has at most two contiguous fragments; alternatively, we may choose two of the multiple starting contiguous points in a suitable fashion to form the spatial label. Further details regarding assignment of spatial labels are discussed in “Performance on real data” section.
Summary of the algorithmic pipeline
Our algorithm takes the spatial graph G as input, and produces the reconstructed haplotypes. After pre-processing the sequencing data, we may form the graph G and assign spatial labels (Xu)u∈[n] to its nodes. However, the community labels (Zu)u∈[n] are unknown at this point. We defer a detailed discussion of the computational complexity involved in the pre-processing needed to construct G to the end of this section, where we show that one can exploit the structure in the data to reduce complexity of this pre-processing step from the naive O(n2) to roughly O(n). Given the representation of the data by the graph G, and having assigned spatial labeling to its nodes, our algorithm has two main components - (i) - a community detection on G to discover origin of each read and thus effectively group the reads into k different clusters and (ii) - Subsequently, for all i∈[m] and l∈[k], we estimate \(\hat {s}_{l}[i]\) by a simple majority rule as described in the sequel.
1. Euclidean Community Detection - This part of the algorithm is where we take as input the graph G along with the spatial labels (Xu)u∈[n] and estimate for each u∈[n], a community label \(\widehat {Z}_{u} \in [k]\), denoting which of the k haplotype, a read is likely originating from. We summarize the key steps in this task. The formal pseudo code is given in Algorithm 1.
-
1
We first tessellate the grid [n]2 into smaller overlapping boxes, denoted by \(\left (B_{x,y}\right)_{1 \leq x \leq \tilde {n}, 1 \leq y \leq \tilde {n}}\). Here \(\tilde {n} < n\) is a parameter which we choose and each Bx,y⊂[n]2. The tessellation is such that each grid point u∈[n]2 belongs to multiple boxes Bx,y since the boxes overlap.
-
2
For each box Bx,y, let Hx,y denote the subgraph of G containing nodes whose spatial locations lie in Bx,y. The nodes of Hx,y are all clustered independently into k communities.
-
3
The community estimates in different boxes are synchronized to obtain a global clustering estimate from spatially-local clustering estimates. Since each grid point is present in multiple boxes, a read gets many estimates for its community, each of which adds an ‘evidence’ to the label of the node. This scheme has a natural ‘error-correcting’ mechanism, since it is less likely for a node to be misclassified in the majority of the boxes it lies in, as opposed to any one particular box.
2. Reduce by Majority - After estimating for each node (read), the likely haplotype from which it originates \(\left (\widehat {Z}_{u}\right)_{u \in [n]}\), we reconstruct the haplotypes by a simple majority vote. For all j∈[k] and i∈[m], we estimate \(\hat {s}_{j}[i]\) to be the majority among the 4 letters in the alphabet as indicated by the reads that cover site i and are estimated to belong to string j in the above clustering step.
Intuition behind the algorithm
Observe that for any two reads u and v that overlap, if the weight wuv is positive and large, then they are likely from the same haplotype while if negative and large, they are likely from different haplotypes. Consider a subset of reads such that the absolute value of the weights on the edges that connect them in graph G are ‘large’. This subset of reads can readily be grouped into up to k different clusters using standard techniques such as the spectral method. Such simple schemes will be successful in this scenario since the ‘signal’ for clustering in the subgraph is high due to strong interaction between the corresponding reads, i.e., the adjacency matrix of the subgraph is dense.
However, the entire set of reads does not posses the aforementioned property – in particular, it has a large diameter (of order n). Thus, standard spectral methods applied to graph G are both computationally impractical as well as statistically sub-optimal for recovery. The computational impracticality is easy to observe; indeed, any clustering scheme will be super-linear in the number of nodes Nn, which renders them extremely slow on large problem instances that are typical in practice. Furthermore, even the pre-processing of reads to construct graph G is of order \(N_{n}^{2}\), which makes it computationally challenging in practical settings. The statistical sub-optimality is more subtle and stems from the fluctuations in the density of reads in space; in many problem instances, the density of reads varies across space due to randomness in the read generation process. For instance, in Fig. 4, we see an example where the density of reads captured by the spectral algorithm is highly imbalanced due to the fluctuations of the nodes in space. However, in the ground truth set, the statistical distribution of reads across space is the same for all communities. Therefore, to improve performance, one would need an additional ‘constraint’ in the reconstruction algorithm to enforce the fact that the spatial distribution of reads across all communities is the same.
Our strategy in this paper is to first partition the set of reads into subsets wherein the reads interact strongly. Since the reads have spatial locality, we partition the set of reads into boxes as illustrated in Fig. 5. In each box, we consider the subgraph of G with nodes having spatial labels lying in that box and then cluster this sub-graph independently of other subgraphs in other boxes. The partition of space into boxes is fixed a priori and is data independent. The box size and how much they overlap are hyper-parameters that can be tuned. In each box, we run a clustering algorithm and then combine the estimates from different boxes to form a final global partitioning of the nodes of G. The partitioning based on spatial locality automatically ensures that the spatial density of the estimated communities are roughly identical. The intuition for this stems from the fact that the reads will be roughly uniformly distributed within a box since the box is ‘small’ in size. More importantly, by requiring that the boxes overlap, a single read will be present in multiple boxes. This further boosts statistical accuracy of clustering by embedding natural error-correction scheme; since a single read is processed multiple times, there are multiple instances of ‘evidence’ for the community label of a node. From a computational complexity viewpoint, partitioning the set of nodes and clustering smaller instances dramatically reduces run-time as a majority of clustering algorithms are super-linear in the number of data points and hence reducing the sizes of the graphs to be clustered has a significant beneficial impact. Therefore, our algorithm is both computationally feasible on large instances and is statistically superior compared to standard graph clustering algorithms directly applied on G.
Pseudo code
Let us start by introducing the notation needed to formalize the algorithm. The algorithm has hyper-parameters \(A,B, {iter},M \in \mathbb {N}\) and α∈[0,1]. For \(x,y \in \left [ \lceil \frac {\tilde {n}}{A} \rceil \right ]\), we define Bx,y⊂[n]2 as \(B_{x,y} := \left [Ax, \min (Ax+B,\tilde {n})\right ] \times \left [Ay, \min (Ay+B,\tilde {n})\right ]\), the box indexed by (x,y). Thus, the parameters A and B dictate how large a box is and how many boxes cover a read. In the course of the algorithm, we maintain a dictionary of lists \(\mathcal {C}\), where for each node u∈[Nn], \(\mathcal {C}[u]\) denotes the list of community estimates for node u. Each node has more than one estimate as it belongs to more than one box. The estimates from clustering in each box are added as ‘evidence’ of the community estimate for the nodes in the box. Having multiple estimates for a node helps in combating clustering errors in certain boxes.
Main Routine - We now describe the algorithm in detail. The first step consists of partitioning the space [n]2 into multiple overlapping boxes as shown in Fig. 5. Recall that the hyper-parameters A and B allow one to tune both the size of a box and the number of boxes that will cover a given location of [n]2. In each box indexed by (x,y) for \(x,y \in \left [ \lceil \frac {\tilde {n}}{A} \rceil \right ]\), we identify the nodes of G having their spatial label in that box; let us denote the collection of such nodes by Hx,y. If the number of nodes in Hx,y is small (e.g., smaller than a certain hyper-parameter M), then we do not attempt to cluster these nodes. We need to set such a minimum size of Hx,y or the output of clustering may turn out to be noisy and non-informative. In addition, if more than an α<1 fraction of nodes in Hx,y have at least one community estimate, then again we do not cluster Hx,y. The reason for doing so is to decrease the running time by ensuring we only perform the clustering step when there are sufficiently many new unexplored nodes. In each remaining box (x,y) (i.e., each box with at least M nodes where at most an α fraction of them have prior estimates) we apply a fast and simple local clustering algorithm. In particular, we generate an approximate clustering of the nodes in Hx,y by directly running a standard k-means algorithm [27] on the adjacency matrix of Hx,y. We then iteratively improve upon this initial guess by reassigning each node to the most likely cluster while keeping the assignment of other nodes fixed. This iterative update rule is reminiscent of the class of Expectation Maximization algorithms, although our method is fast and non-parametric. We run the iterative clean-up procedure for iter number of iterations. Once the nodes of Hx,y are clustered, we append the result to the dictionary of lists \(\mathcal {C}\) after appropriately synchronizing the community estimates of each node. Once we have iterated over all boxes, we assign a single community estimate to each node based on the majority in the list of estimates in \(\mathcal {C}\). The algorithm is formalized as the following pseudo-code.
Local Clustering Step - This step is described in Algorithm 2. We follow a two step-procedure. In the first step, we get an approximate clustering of the graph H by applying the standard k-means algorithm on the adjacency matrix H. We then one-hot encode this clustering result. One hot encoding is one where for each read we associate a k×1 vector in which all entries re 0 except for a single entry corresponding to the estimated community label of that read to be 1. More formally, if r denotes the number of nodes of H, then the one-hot encoding result is a matrix e0 of size r×k. Each entry of e0 is either 0 or 1; the entry in the ith row and jth column being 1 implies that the ith node is classified as belonging to community j. Thus, each row of e0 contains exactly one 1 while the rest of the entries are all 0. We then run a ‘clean-up’ procedure by iteratively updating the estimate as follows.
$$\begin{array}{*{20}l} \mathbf{e}_{t+1} = \mathcal{T}(H \mathbf{e}_{t}). \end{array} $$
(4)
The function \(\mathcal {T}\) is applied row-wise; for matrix A, it sets the ith row and jth column of \(\mathcal {T}(A)\) to 1 if j= arg maxA[i], else the ith row and jth column of \(\mathcal {T}(A)\) is set to 0. If a row has more than one column where the maximum is attained, the first column where the maximum occurs is assigned value 1 while the other columns are assigned value 0. Hence the dimensions of A and \(\mathcal {T}(A)\) are the same. Furthermore, for any matrix A, the matrix \(\mathcal {T}(A)\) is such that its entries are either 0 or 1, with each row having exactly one entry valued 1.
The iterative update is based on the following intuition. Let the clustering be encoded by matrix et for some \(t \in \mathbb {N}\), and consider a tagged node u∈[n]. The new updated value of the community label of node u is then the ‘most-likely’ label given the estimates of the community labels of the rest of the nodes. More precisely, the ‘weight’ that a node u is in a community l∈[k] is the sum of the weights along the edges connecting u to v∈[n] in G such that the estimate of node v is l. The new community label of node u is the one having the largest weight. By performing this operation simultaneously for all nodes, one obtains the representation in Eq. (4). The intuition for reassigning the node to the community with the maximum weight stems from the observation that if a weight along an edge is positive and large, then it is likely that the end nodes of the edge share the origin, i.e., the corresponding reads sample the same haplotype. Likewise, if the weight along an edge is negative and large, then it is likely that the end nodes represent reads that sample different haplotypes. Therefore, for the iterative update to perform well, the initial estimate e0 must be ‘good enough’; we achieve this by applying the k-means clustering algorithm on the adjacency matrix H. In principle, one can obtain somewhat better initial guess for e0 by applying the k-means algorithm to the eigenvectors of H, but the marginal gains in statistical accuracy does not warrant the enormous increase in computation needed to perform such a spectral clustering.
The clean-up method, at first glance, seems to bear similarities to other dynamical algorithms such as expectation maximization, Belief Propagation (BP) and tensor factorization based methods of [17]. Unlike BP, however, we do not iterate the beliefs or probabilities of a node belonging to various communities; instead, we make a hard assignment at each update step. While for standard BP algorithms it is desirable that a graph is tree-structured, our graph contains a lot of triangles and loops due to the spatial embeddings. Therefore, it would be insufficient to keep track of the node marginals – instead, BP would need the entire joint distribution which is not tractable. Despite the undesirable properties of G, benchmarking results demonstrate that our algorithm performs community detection on the graph very well.
Synchronization Step - The main routine in Algorithm 1 considers the boxes sequentially and performs local clustering steps. Once the local clustering is performed, a key component is to synchronize the estimates of the current box with the estimates of the boxes that are already clustered. The synchronization is essential since the problem is a permutation invariant to the labels. Formally, the statistical distribution of the data remains unchanged if the true underlying labels of the [k] strings are permuted. Hence, the best hope for any recovery algorithm is to reconstruct the k strings upto a permutation of labels. Thus, if any clustering algorithm is run on two different subsets of nodes, the corresponding haplotype estimates need to be synchronized to produce a clustering of the nodes in the union of the sets. We perform this clustering in line 8 of the main routine Algorithm 1 by invoking the sub-routine 3.
In sub-routine 3, we decide on how to permute the community label output of the local clustering estimate of Hx,y that best ‘synchronizes’ with the label estimates of the other nodes of G at that instance. Observe that at the instant of synchronizing the output of Hx,y, other nodes of G have either none or multiple label estimates. There is a possibility that more than one label estimate per node is present in multiple boxes, each adding an ‘evidence’ for a node’s cluster. We select a permutation of the labels by sequentially going over all permutations of [k] and selecting the one that has the highest ‘synchronization-weight’. More formally, let \(\mathcal {N}_{x,y} \subset [n]\) denote the indices of the nodes in Hx,y; for all \(u \in \mathcal {N}_{x,y}\), denote by σ[u]∈[k] the label estimates formed by the local clustering on Hx,y. The synchronization weight for a permutation π of [k] is defined as
$$\begin{array}{*{20}l} W_{\pi}:=\sum\limits_{i \in \mathcal{N}_{x,y}} \mathbf{1}_{\mathcal{C}[i] \neq \text{empty}} \frac{\sum_{j \in \mathcal{C}[i]} \mathbf{1}_{\pi(\sigma[i]) = \mathcal{C}[i][j]} }{\sum_{j \in \mathcal{C}[i]}1}. \end{array} $$
In words, we go over all nodes in Hx,y that have at least one prior estimate and sum the fraction of the previous estimate equaling the label assigned by the local clustering Hx,y after applying the permutation π to the local clustering’s output. Among all permutations, we select the π∗ having the highest synchronization weight (ties are broken arbitrarily). After doing so, for each node u of Hx,y we append the label π∗(σ(u)) to the list \(\mathcal {C}[u]\). The key feature of the above procedure is consideration of the fraction, which is a proxy for the ‘belief’ of the community label of a node, rather than just a count; this is meaningful because the counts across different nodes can be significantly skewed by the order in which the boxes are clustered and synchronized.
Computational complexity
In this section, we discuss the computational complexity of implementing our algorithm and the effect of various hyper-parameters on runtimes. A naive implementation of the algorithm would incur a cost of the order n2 just to construct the graph G from the reads. This step itself may be infeasible in practical scenarios where the number of reads will be on the order of millions. However, our algorithm only needs the subgraphs Hx,y instead of the full graph G. Therefore, we pre-process the reads and create a hash-map where for each location in [n]2, we store the list of reads that have spatial label in that location. This requires one pass through the list of reads, i.e., has computational complexity of order n and storage complexity of order n. Now, creating the adjacency matrix Hx,y is quadratic in only the number of nodes in Hx,y. The synchronization step requires time complexity of the order of the number of nodes in Hx,y times the number of distinct permutations of [k].
Choosing the parameters of our algorithm
Our algorithm has an inherent trade-off between computational complexity and statistical accuracy that can be achieved by varying the hyper-parameters. For instance, if we decrease A while keeping B fixed, the number of boxes and therefore the computational time are increased. However, the statistical accuracy would improve since each node would now be present in many boxes and hence the error-correction scheme performs more accurately. Similarly, increasing the parameter M can reduce the run-time by considering fewer boxes to perform local clustering, while potentially decreasing statistical accuracy because there is less evidence for determining community label of each node.