Relatedness depth (ReD) algorithm
ReD is a deterministic algorithm based on deterministic hierarchical clustering. The key concept of this method is a kclustered intersection of viral populations (we used similar idea previously for combinatorial pooling [17]). For two sets of viral sequences P
_{1} and P
_{2}, their kclustered intersection \(P_{1} \overline {\cap } P_{2}\) is calculated as follows:

1)
Partition the union P
_{1}∪P
_{2} into k clusters C
_{1},...,C
_{
k
};

2)
\(P_{1} \overline {\cap } P_{2} = \bigcup \limits _{i\in B} C_{i}\), where B={i∈{1,...,k}:C
_{
i
}∩P
_{1}≠∅,C
_{
i
}∩P
_{2}≠∅}, i.e. \(P_{1} \overline {\cap } P_{2}\) is the union of clusters, which contain sequences from both P
_{1} and P
_{2} (see Fig. 1);
The parameter k is a scale of clustering. In particular, populations P
_{1} and P
_{2} are separable, if \(P_{1} \overline {\cap } P_{2} = \emptyset \), while the fact that \(P_{1} \overline {\cap } P_{2} \ne \emptyset \) indicates that they may be genetically related. In the most extreme case \(P_{1} \overline {\cap } P_{2} = P_{1}\cup P_{2}\), i.e. populations are completely inseparable under the scale k.
The degree of confidence that the samples are genetically close is represented by the relatedness depth
d(P
_{1},P
_{2}), which is calculated by Algorithm 1. Simply speaking, Algorithm 1 tries to recursively separate populations P
_{1} and P
_{2}. At each iteration, kclustered intersection is calculated. If two populations are separable, then the algorithm stops. Otherwise, it continues the separation of sequences from P
_{1} and P
_{2} within their kclustered intersection. The separation depth is a depth of this recursion. It is possible that at some iterations of Algorithm 1 two populations are completely inseparable under a current clustering scale. In this case, the scale k is increased and kclustered intersection is recalculated. The initial value of k used by Algorithm 1 is k=2.
kclustered intersections depend on a clustering method. Our implementation uses a hierarchical clustering based on neighborjoining tree (as implemented in Matlab (MathWorks, Natick, MA)). The algorithm utilizes a standard JukesCantor distance which is based on the simplest substitutionbased evolutionary model.
Clustered intersections also allow for estimating the direction of transmissions. It is reasonable to assume that if two hosts share a population, then a host with more heterogeneous population is more likely to be the transmission source [18]. Formally, if \(I = P_{1} \overline {\cap } P_{2}\), P
_{1}⊆I and P
_{2} ∖ I≠∅, then we assume that probable transmission direction is from P
_{2} to P
_{1} (see Fig. 1). The direction is defined according to the first occurrence of such situation during execution of Algorithm 1. Note that in some cases direction may not be identified.
Given the collection of viral populations \(\mathcal {P} = \{P_{1},...,P_{n}\}\), ReD produces the weighted directed genetic relatedness graph G=(V,A,d) with \(V=\mathcal {P}\). An arc (P
_{
i
},P
_{
j
}) is in A whenever populations P
_{
i
} and P
_{
j
} are genetically related, i.e., have sufficiently high relatedness depth; the direction of an arc corresponds to the estimated direction of transmission and its weight to the relatedness depth. Transmission clusters are calculated as weakly connected components of the digraph G. To determine transmission clusters, the simplest depth cutoff T=1 can be used. In addition, only components containing at least one arc a of weight d(a)≥2 were considered as reliable. For each reliable component, a source s of the corresponding outbreak is identified as a vertex with highest eigenvector centrality.
Viral outbreak inference (VOICE) simulation method
VOICE is another approach to predict epidemiological characteristics. Unlike ReD, it is not deterministic. Instead, it simulates the process of evolution from one viral population (source) into another (recipient) as a Markov process on a union of both populations. VOICE starts evolution from a subset of source sequences called the border set and estimates the number of generations required to acquire a genetic heterogeneity observed in the recipient.
Formally, given two sets of viral sequences P
_{1} and P
_{2}, VOICE simulates viral evolution to estimate times t
_{12} and t
_{21} needed to cover all sequences from the recipient population under the assumptions that first and second host were sources of infection. Based on the value min{t
_{12},t
_{21}}, the algorithm decides whether the populations are related. The direction of possible transmission between the related pair is assumed to follow the direction which requires less time.
The simulation starts from the δborder set
B
_{1}, which contains viral variants that are likely the closest to variants transmitted between P
_{1} and P
_{2}. It is defined as the set of vertices of P
_{1} minimizing pairwise Hamming distance D between vertices from P
_{1} and P
_{2} up to a constant δ:
$$B_{1} = \left\{u\in P_{1} : \exists v\in P_{2} ~~ D(u,v) = \min_{x\in P_{1},y\in P_{2}}D(x,y) + \delta \right\} $$
(see Fig. 2). The constant δ is a parameter, with the default value 1.
The simulated evolutionary process is carried out in the evolutionary space represented by the variant graph
G(B
_{1},P
_{2}), which is constructed as follows. First, construct a union of all minimal spanning trees of the complete graph on a vertex set B
_{1}∪P
_{2} with the edge weights equal to Hamming distances between variants (sometimes referred to as a pathfinder network P
F
N
e
t(n−1,∞) [19, 20]). Then substitute every edge in graph with two directed edges of the same weight. Next, subdivide each edge (u
_{1},u
_{2}) of weight w≥2 with w−1 vertices v
_{1},...,v
_{
w−1} and add multiple directed edges as follows: add w−1 edges between vertices u
_{1} and v
_{1}; w−2 edges between v
_{1} and v
_{2}; and so forth as shown on Fig. 3. This model can be explained as follows: to mutate from vertex u
_{1} to u
_{2} during simulation, there should occur mutations at w positions that are different between u
_{1} and u
_{2}. During the first step, simulation can mutate any of w positions, then any of w−1 positions on the second step and so forth.
The simulation starts from all border vertices B
_{1} and runs until all the vertices of the population P
_{2} are reached. At the beginning of the simulation, border vertices get count equal to 1, and the rest of the vertices get count 0. Each tact simulates variants replication by updating vertex counts according to one of the three following scenarios happening with the specified probabilities (see Fig. 4). First, if during replication there are no mutations, then the vertex v replicates itself and its count label is incremented. This happens with the probability p
_{1} (1). Second, the vertex can mutate into one of its neighboring vertices with probability p
_{2} (see Eq. (2)), in which case the count of the neighbor is incremented. Finally, with probability p
_{3}, vertex does not produce any viable offspring, in which case vertex counts are not changed. If the count of a vertex reaches the maximum allowed variant population size C
_{
max
}, then it is not increased. The probabilities of these scenarios are calculated as follows:
$$\begin{array}{@{}rcl@{}} p_{1} & = & (13\epsilon)^{L} \end{array} $$
(1)
$$\begin{array}{@{}rcl@{}} p_{2} & = & p_{1}\frac{\epsilon}{13\epsilon} \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} p_{3} & = & 1  p_{1}  p_{2}\deg^{}(v) \end{array} $$
(3)
where ε is the mutation rate, L is the genome length and deg−(v) is an outdegree of a vertex v.
Algorithm 2 represents the flow of the method. The time t
_{12} is computed as the average over s simulations. The same procedure is repeated for the opposite direction of the transmission with its border set B
_{2} and the time t
_{21} is computed. The value min{t
_{12},t
_{21}} determines which direction of transmission is more likely.
Data normalization
The sizes of observed intrahost viral populations may significantly vary due to sampling and sequencing biases. Since the larger population will require more time to cover, the estimation of t
_{12} and t
_{21} could be biased. VOICE avoids such biases by normalizing the intrahost population sizes. The deterministic normalization partitions each viral population into q clusters using hierarchical clustering and each cluster is replaced with the consensus of its members. The subsampling normalization randomly chooses q sequences from each population. The procedure is repeated r times, and the final result is an average over all subsamplings.
Identification of genetic relatedness, transmission directions, clusters and sources of outbreaks
Analogously to ReD, VOICE produces a weighted directed genetic relatedness graph G=(V,A,w) with \(V=\mathcal {P}\). An arc P
_{
i
}
P
_{
j
} is in A whenever populations P
_{
i
} and P
_{
j
} are genetically related, i.e., value min{t
_{
ij
},t
_{
ji
}} is less than a threshold. Weakly connected components of G represent transmission clusters or outbreaks. To determine the source of each outbreak, we build a Shortest Paths Tree (SPT) for every vertex in the corresponding component. The source is estimated as the vertex with an SPT of minimal weight.
MinDistB method
The method extends the MinDist approach proposed in [12], which defines the distance between viral populations as the minimum Hamming distance between their representatives. The new approach also takes into account sizes of border sets, on which the minimum distance is achieved.
Formally, given an integer δ (by default δ=1), the δcrossing between populations P
_{1} and P
_{2} is the set of pairs of variants (u,v) from different populations, the Hamming distance D(u,v) between which is within δ from the minimum Hamming distance:
$${{\begin{aligned} {}B_{\delta}(P_{1},P_{2})= \left\{(u,v): u \in P_{1}, v \in P_{2}, D(u,v) \le \min_{x\in P_{1},y\in P_{2}}D(x,y) + \delta\right\} \end{aligned}}} $$
(see Fig. 2). Our empirical study shows that in case when the crossing is large (see Fig. 2
a), then the populations are less likely to be related than in case when the borders are small (see Fig. 2
b).
This effect can be intuitively explained. Two related populations likely diverge away from the common ancestor and from each other, and their borders are formed by few old survived variants closest to the common ancestor. Two unrelated populations diverging from two different ancestors may in time reduce minimum distance from each other randomly and closest variants are relatively young and abundant (see Fig. 5).
We define a δdistance between populations P
_{1} and P
_{2} as follows:
$$ D_{\delta}(P_{1}, P_{2}) = D(P_{1},P_{2}) + c \ln(B_{\delta} (P_{1},P_{2})) $$
(4)
where c=3 is an empirically chosen constant.
Identification of genetic relatedness, transmission clusters and sources of outbreaks
For MinDistB methods, genetic relatedness graph G=(V,E,w) is a weighted undirected graph with the vertex set \(V=\mathcal {P}\) and an edge of weight w
_{
i,j
} connecting populations P
_{
i
},P
_{
j
} whenever w
_{
i,j
}=D
_{
δ
}(P
_{1},P
_{2}) does not exceed a threshold. Transmission clusters are estimated as connected components of the graph G. For each transmission cluster its source could be inferred either as a vertex with maximum eigenvector centrality or as a vertex with the shortest paths tree of minimal weight.