Sequence data and phylogenetic analysis
Positive SARS-CoV-2 samples collected in British Columbia (BC), Canada, between 18th March 2020 and 13th August 2021 underwent whole-genome sequencing at the BCCDC Public Health Laboratory. Sequencing sampling strategy changed over the course of the pandemic and increases with sequencing capacity at the lab. Sampling strategies included random sampling (ranging from 5–100% of cases at different periods) and targeted sampling (outbreaks and targeted populations such as travellers) [27]. Sequence data used in this study have been deposited in the GISAID database [8].
Nucleic acids were extracted using the MagMAX instrument from Thermofisher (AM1836) and amplified using the Freed primer scheme (1200 base pair amplicons) detailed here [28]. Consensus sequences were generated using the Connor Laboratory pipeline (https://github.com/connor-lab/ncov2019-artic-nf) with consensus bases called at a frequency of 0.75 with a subsampling read count strategy. Consensus sequences were aligned and trimmed to Wuhan-Hu-1 reference sequence (Accession MN908947, Version MN908947.3) using MAFFT (v7.471) [29] prior to phylogenetic tree production. A specific fork of the ARTIC pipeline for processing SARS-CoV-2 sequences was created to support the SARS-CoV-2 sequencing efforts at the BCCDC Public Health Laboratory, located here: https://github.com/BCCDC-PHL/ncov2019-artic-nf. Sequences with no collection date or excess or ambiguous sites (> 15% missing calls) were removed from the analysis.
Phylogenetic analyses
A multiple sequence alignment of the full SARS-CoV-2 genome was used to construct maximum-likelihood (M-L) phylogenetic trees with IQ-TREE (v.2.1.3) [30]. One sequence per individual was included for analysis, with the earliest sequence chosen where longitudinal samples were taken from the same disease episode. Optimal nucleotide substitution models for the data were chosen using ModelFinder in IQ-TREE [31] and applied to each tree construction pipeline. For comparison to the proposed clustering approach, phylogenetic clustering was performed using TreeCluster (v.1.0.3) [21] using two thresholds, 1) a maximum divergence threshold within clusters of 4 × 10–4 substitutions/genome (used previously for generating SARS-CoV-2 phylogenetic clusters 14), and 2) a maximum pairwise divergence threshold (among pairs in a cluster) of 5 × 10–5 substitutions/genome, equivalent to a SNP distance of between 1–2 SNPs.
Genomic clustering methodology
Genomic clusters were defined as networks of connected sequences (nodes) where the pairwise probability of clustering was above a given threshold. The probability of clustering between two sequences was calculated using the logit model:
$${\varvec{P}}=\frac{1}{1+{{\varvec{e}}}^{-{\varvec{z}}}}$$
$${\varvec{z}}={{\varvec{\beta}}}_{0}+{{\varvec{\beta}}}_{{\varvec{d}}}{{\varvec{x}}}_{{\varvec{d}}}+{{\varvec{\beta}}}_{{\varvec{t}}}{{\varvec{x}}}_{{\varvec{t}}}\dots {{\varvec{\beta}}}_{{\varvec{d}}}{{\varvec{x}}}_{{\varvec{d}}}$$
Coefficients (β) can be either manually chosen or estimated using the logistic regression on data with known clusters. d is the pairwise genetic divergence, calculated between all pairs of isolates by extracting patristic distances (the sum of branch lengths connecting two tips) on the phylogenetic tree, in units of substitutions/genome. t is a measure of difference in time between sequences, either the date of collection or symptom onset, and can be extracted from the associated metadata or inferred from a timed phylogeny. Additional covariates (n), such as contact data between hosts or shared exposure events, can be included to further resolve clusters. Pairwise transmission probabilities calculated in previous clustering runs can be included in new analysis to allow for greater continuity in cluster designations, as well as permitting subsequent clustering runs to be run on subsampled datasets to increase speed and efficiency when clustering large numbers of sequences. The full R code is available at github.com/bensobkowiak/cov2clusters.
We compared the results of our genomic clustering method at three pairwise probability thresholds of 0.7, 0.8 and 0.9 to link sequences to the clusters obtained using TreeCluster ‘max_clade’ (where the maximum pairwise patristic distance threshold between any two sequences in a cluster was 4 × 104 substitutions/genome) and ‘single_linkage’ (where any two sequences up to a maximum patristic distance threshold of 5 × 105 substitutions/genome must be in the same cluster). Beta coefficients for the genomic clustering algorithm of β0=3, β1=-1.9736 × 10–4, and β2= 7.5 × 10–2 were chosen to only link sequences at the 0.7 probability threshold with a maximum genomic divergence equivalent to two SNPs when the time between collection dates is low (less than 10 days). These values correspond to a pairwise probability of 0.95 between sequences with the same genomic sequence and collected date, with a decrease in pairwise probability as the patristic distance and/or collection date difference increases. Supplementary figure S3 shows the pairwise probabilities from logistic regression with these beta coefficients when varying the patristic distance (converted to SNP distance by multiplying by the genome length) and difference in collection dates (in days) between sequences.
Clustering accuracy and stability measures
The clustering accuracy of each tested method was assessed using three measures for evaluating clustering, the precision, recall and F1 score. These measures were calculated from sequences that were collected in seven epidemiologically well-supported clusters from the SARS-CoV-2 BC dataset (Supplementary Table S1).
Precision is defined as:
$${\varvec{P}}{\varvec{r}}{\varvec{e}}{\varvec{c}}{\varvec{i}}{\varvec{s}}{\varvec{i}}{\varvec{o}}{\varvec{n}}=\boldsymbol{ }\frac{{{\varvec{\Sigma}}}_{{\varvec{k}}}{{\varvec{m}}{\varvec{a}}{\varvec{x}}}_{{\varvec{s}}}\{{{\varvec{a}}}_{{\varvec{k}}{\varvec{s}}}\}}{{{\varvec{\Sigma}}}_{{\varvec{k}}}{{\varvec{\Sigma}}}_{{\varvec{s}}}{{\varvec{a}}}_{{\varvec{k}}{\varvec{s}}}}$$
Recall is defined as:
$${\varvec{R}}{\varvec{e}}{\varvec{c}}{\varvec{a}}{\varvec{l}}{\varvec{l}}=\boldsymbol{ }\frac{{{\varvec{\Sigma}}}_{{\varvec{k}}}{{\varvec{m}}{\varvec{a}}{\varvec{x}}}_{{\varvec{k}}}\{{{\varvec{a}}}_{{\varvec{k}}{\varvec{s}}}\}}{{({\varvec{\Sigma}}}_{{\varvec{k}}}{{\varvec{\Sigma}}}_{{\varvec{s}}}{{\varvec{a}}}_{{\varvec{k}}{\varvec{s}}}+{\varvec{U}})}$$
And the F1 score is calculated as:
$${\varvec{F}}1=\boldsymbol{ }\frac{{\varvec{P}}{\varvec{r}}{\varvec{e}}{\varvec{c}}{\varvec{i}}{\varvec{s}}{\varvec{i}}{\varvec{o}}{\varvec{n}}\boldsymbol{*}{\varvec{R}}{\varvec{e}}{\varvec{c}}{\varvec{a}}{\varvec{l}}{\varvec{l}}}{{\varvec{P}}{\varvec{r}}{\varvec{e}}{\varvec{c}}{\varvec{i}}{\varvec{s}}{\varvec{i}}{\varvec{o}}{\varvec{n}}+{\varvec{R}}{\varvec{e}}{\varvec{c}}{\varvec{a}}{\varvec{l}}{\varvec{l}}}$$
Here, \({\varvec{k}}\) is the number of clusters predicted by the clustering method, \({\varvec{s}}\) is the number of epidemiologically supported clusters, and \({{\varvec{a}}}_{{\varvec{k}}{\varvec{s}}}\) is the total number of sequences belonging to the \({{\varvec{k}}}^{{\varvec{t}}{\varvec{h}}}\) and \({{\varvec{s}}}^{{\varvec{t}}{\varvec{h}}}\) clusters.
We assessed the stability of clusters through time by calculating three measures on the clusters predicted on the SARS-CoV-2 sequences collected in BC between 11th June 2021 and 13th August 2021. Phylogenetic tree construction and clustering was repeated each week, adding sequences that were collected in that week to the dataset. We calculated the proportion of sequences that moved from to non-clustered, the number of cluster splits, and the overall entropy score of the clusters, in each week after the first clustering analysis. Cluster entropy was defined using Shannon’s entropy as:
$$\mathcal{H}\left({\varvec{X}}\right)=\boldsymbol{ }-\sum {\varvec{P}}\left({{\varvec{x}}}_{{\varvec{i}}}\right)\boldsymbol{*}\boldsymbol{ }{\mathbf{log}}_{2}{\varvec{P}}\boldsymbol{ }({{\varvec{x}}}_{{\varvec{i}}})$$
Here, \(\mathcal{H}\left({\varvec{X}}\right)\) is the overall clustering entropy and \({\varvec{P}}\) is the probability of belonging in cluster \({{\varvec{x}}}_{{\varvec{i}}}\). The lowest score of 0 occurs when all sequences are in a single cluster and non-clustered sequences are not included in the calculation.