Our algorithms are based on finding the distance between populations using Earth Movers’ Distance (EMD) between distributions of k-mers in NGS data. The general pipeline of the algorithm (see Fig. 1) includes obtaining k-mer distributions from NGS reads for corresponding hosts and computing EMD between them. As a result, we obtain mean of hosts A and B Mean(A, B) and EMD EMD(A, B) between them. We first describe how we find distances between k-mers and then describe how we find distance between samples.
Finding distances between k-mers in the De Bruijn graph
k-mer refers to a substring of length k. In our work, we use De Bruijn graph to calculate distance between k-mers. De Bruijn graph is the graph, that is constructed so that vertices represent every string over a finite alphabet of length l, and edges are added between vertices that have overlap of l − 1.
Once De Bruijn graph is constructed, distance between k-mers can be calculated as a length of shortest path between corresponding vertices using breadth-first search algorithm. In our algorithms, obtained graph is converted to undirected be- fore shortest path computation.
Finding EMD between viral samples
Viral populations can be compared by comparing the corresponding k-mer distri- butions using EMD. First, k-mer distributions are obtained for each sample, so that they contain all k-mers and normalized frequencies.
EMD is a method, that allows to evaluate dissimilarity between two multi- dimensional distributions in some feature space where a distance measure between single features (ground distance) is given [13]. Distributions can be represented as signatures - sets of clusters, so that each cluster is represented by its mean and by the fraction of distribution that belongs to that cluster. Computation of EMD is based on solving the transportation problem, which can be formulated as following: for several suppliers, each with a given amount of goods, several consumers, each with limited capacity, and a cost of transporting a single unit of goods between each supplier-consumer pair, find a least-expensive flow of goods from the suppliers to the consumers that satisfies the consumers’ demand. EMD is calculated as the following: \( EMD\left(P,Q\right)={\varSigma}_{i=1}^m{\varSigma}_{j=1}^n{f}_{ij}{d}_{ij} \) where fij is the minimum-cost flow between supplier i and consumer j, and dij is the distance between i and j.
It should also be noted that EMD is usually normalized by the total flow, but we perform but we perform normalization of frequencies in k-mer distributions before EMD computation, which results in total flow always being equal to 1.
Example of EMD computation
Constructing of the De Bruijn graph between two sequences CGATTCTAAGT and CGATTGTAAGT is shown on Fig. 2. Once original graph is obtained, directions are removed and pairwise distances are computed for all k-mers. Figure 3 describes an example of k-EMD distance computation. After k-mer distributions are generated for input sequences, EMD is computed as the work, where fij is the flow between histogram(k-mer distribution) elements i and j and dij is the corresponding distance between k-mers, which is obtained from De Bruijn graph (Fig. 2). This way, EMD = 0.88.
Mean k-mer distribution
Representing samples as k-mer distributions allows to estimate the center from a group of samples by introducing a mean host. We use the maximum mean k-mer distribution, which is obtained by finding the maximum observed frequency for each.
$$ \mathrm{k}-\mathrm{mer}\ {k}_i{f}_i^{max}={\mathit{\max}}_{1\le i\le n}{f}_i\ \mathrm{and}\ \mathrm{normalization}\ {f}_j^{\prime }=\frac{f_i^{max}}{\sum \limits_{1\le i\le n}{f}_i^{max}}. $$
Identification of relatedness
We train out algorithm on all given outbreaks and obtain minimal EMD between 2 unrelated hosts, which we use as a threshold t. To identify whether 2 hosts A and B are related, we compute EMD between them EMD(A, B) and predict that they are related if EMD(A, B) < t, and unrelated otherwise.
.Identification of transmission direction between hosts
To infer transmission direction between a pair of samples X and Y, we first compute a mean host Mean(A, B).
Once Mean(A, B) is obtained, we calculate EMD between mean host and hosts A and B EMD(Mean(A, B), A) and EMD(Mean(A, B), B). Host, that is closer to the maximum mean is assumed to be the transmission source, so that if EMD(Mean(A, B), A) < EMD(Mean(A, B), B), we predict that the transmis- sion happened from A to B (Fig. 4).
Identification of transmission clusters
To test hierarchical clustering, single-linkage algorithm was used. This method eval- uates the similarity of two clusters based on their most similar members [14] and groups clusters in bottom-up order until certain termination condition is satisfied. In our algorithm, we use a distance criteria, so clusters are merged until distance between them exceeds a pre-defined distance threshold, which represents EMD be- tween two closest unrelated samples in the dataset. This way, we obtain a partition, where some of the related hosts remain in different clusters. At this point, we pro- ceed to the second stage of the algorithm, that allows to improve the clustering quality by merging the clusters, that contain related hosts by performing the fol- lowing steps:
-
1.
For each cluster, obtained from hierarchical clustering, compute center as the mean of all hosts within the cluster;
-
2.
For each center, obtained at the previous step:
-
Find distances to the furthest in-cluster host and closest host, that be- longs to the different cluster;
-
If for cluster A there exists an ‘overlap’ (there is a host from cluster B, that is closer to the center than the furthest host, belonging to the same cluster (A)), merge A and B
Example of the algorithm is demonstrated in Fig. 5. a) shows output of threshold- based hierarchical clustering, where circles represent hosts, that are connected with an edge if distance between them doesn’t exceed a threshold. There are 2 clusters that belong to the same outbreak. b) shows how clusters are merged based on circle overlap. For each cluster, mean host of all hosts within the cluster is calculated (shown in the center). Circles with dashed borders have centers in respective mean hosts; their radiuses are calculated as distances between mean hosts and furthest in-cluster hosts. In the example, Mean 1 is closer to host A that to the furthest host from the same (left) cluster. This way, according to our algorithm, intersecting clusters collapse.
Deciding whether source is present in a set of hosts
To decide whether source is present in a set of sequenced hosts S, the following algorithm is applied:
-
1.
Calculate mean Mean(S) for all hosts within an outbreak;
-
2.
For every host H, calculate EMD between H and mean Mean EMD(Mean(S), H);
-
3.
If there exists a host, for which EMD(Mean(S), H) < t, source is present. To obtain threshold t, we train the algorithm on all outbreaks with known sources. For every such outbreak, we first calculate the mean host Mean(S) and distances between mean and every host H in the outbreak EMD(Mean(S), H), find the smallest distance and normalize it by the median distance from mean to host in an outbreak. After this, we repeat the procedure for the same outbreak, but discard the source. We define t as the minimal EMD(Mean(S), H) for an outbreak without source, which maximizes accuracy, so that outbreaks, where source is present, have EMD(Mean(S), H) < t.
Source identification
To identify sources, we find a maximum mean host for an outbreak Mean and cal- culate EMD between every host and Mean. Host with minimum EMD(H, Mean) is assumed to be the source.
Runtime complexity
The algorithm uses Pele and Werman’s [15, 16] algorithm for fast EMD computation, which has a runtime complexity of O(N 2UlogN), where N is the number of nodes (k-mers), and U is an upper bound on the largest supply (flow) of any node (since frequencies are normalized, this is equal to 1). This way, k – mer EMD has a worst time complexity of O(N 2logN).