 Research
 Open Access
Inference of genetic relatedness between viral quasispecies from sequencing data
 Olga Glebova†^{1}Email author,
 Sergey Knyazev^{1},
 Andrew Melnyk^{1},
 Alexander Artyomenko^{1},
 Yury Khudyakov^{2},
 Alex Zelikovsky^{1} and
 Pavel Skums^{1, 2}
 Published: 6 December 2017
Abstract
Background
RNA viruses such as HCV and HIV mutate at extremely high rates, and as a result, they exist in infected hosts as populations of genetically related variants. Recent advances in sequencing technologies make possible to identify such populations at great depth. In particular, these technologies provide new opportunities for inference of relatedness between viral samples, identification of transmission clusters and sources of infection, which are crucial tasks for viral outbreaks investigations.
Results
We present (i) an evolutionary simulation algorithm Viral Outbreak InferenCE (VOICE) inferring genetic relatedness, (ii) an algorithm MinDistB detecting possible transmission using minimal distances between intrahost viral populations and sizes of their relative borders, and (iii) a nonparametric recursive clustering algorithm Relatedness Depth (ReD) analyzing clusters’ structure to infer possible transmissions and their directions. All proposed algorithms were validated using real sequencing data from HCV outbreaks.
Conclusions
All algorithms are applicable to the analysis of outbreaks of highly heterogeneous RNA viruses. Our experimental validation shows that they can successfully identify genetic relatedness between viral populations, as well as infer transmission clusters and outbreak sources.
Keywords
 Genetic relatedness
 Transmission networks
 Outbreaks investigations
 Simulation
 Clustering
Background
Inferring transmission clusters, transmission directions, and sources of outbreaks from viral sequencing data are crucial for viral outbreaks investigation. Outbreaks of RNA viruses, such as Human Immunodeficiency Virus (HIV) and Hepatitis C virus (HCV), are particularly dangerous and pose a significant problem for public health. It is well known that genomes of RNA viruses mutate at extremely high rates [1]. As a result, RNA viruses exist in infected hosts as populations of closely related variants called quasispecies [2, 3]. However, only recently with the progress. Consequently, a contribution of sequencing technologies to molecular surveillance of viral disease epidemic spread becomes more and more substantial [10, 11].
Computational methods can be used to infer transmission characteristics from sequencing data. The first question usually is whether two viral populations belong to the same outbreak. The methods typically utilize the simple observation that all samples from the same outbreak are genetically related, so they use some measure of genetic relatedness as a predictor for epidemiological relatedness [10–12].
The second question is which samples constitute isolated outbreaks. For this purposes, we define a transmission cluster as a connected set of genetically related viral populations. The third questions we address in this article is “Who is the source of infection?”. This questions is the most difficult to answer, and there were only a few attempts to do it computationally using solely genomic data [13] without invoking additional epidemiological information [14]. To the best of our knowledge, there is still no freely available computational tool for this problem.
Computational methods for detection of viral transmissions and inference of transmission clusters are often consensusbased, i.e. they analyze only a single representative sequence per intrahost population (for example, consensus sequence). Such methods assign two hosts into one transmission cluster, if the distances between corresponding sequences do not exceed a predefined threshold [10, 11]. Although consensusbased methods proved to be useful, they do not take into account intrahost viral diversity. Inclusion of whole intrahost populations into analysis is important, because minor viral variants are frequently responsible for transmission of RNA viruses [15, 16].
Recently published computational approach (further referred to as MinDist) [12] uses the minimal genetic distance between sequences of two viral populations as a measure of genetic relatedness of intrahost viral populations. Since minimal genetic distances between different pairs of populations can be achieved on various pairs of sequences, this approach takes into account intrahost diversity.
However, both consensusbased and MinDist approaches have further limitations. First of all, they do not allow to detect directions of transmissions, which is crucial for detection of outbreak sources and transmission histories. Secondly, distance thresholds utilized by both approaches could be derived from analysis of limited or incomplete experimental data and highly data and situationspecific, with different viruses or even different genomic regions of the same virus requiring specifically established thresholds.

Relatedness Depth (ReD) method uses clusteringbased analysis of intrahost viral populations. It is a nonparametric algorithm, so it does not rely on any virusspecific threshold values to predict epidemiological characteristics.

Viral Outbreak InferenCE (VOICE) is a simulationbased method which imitates viral evolution as a Markov process in the space of observed viral haplotypes

MinDistB method is a modification of MinDist [12], which takes into account the sizes of relative borders of each pair of viral populations.
The proposed algorithms were validated on the experimental data obtained from HCV outbreaks. Comparative results suggest that our methods are efficient in epidemiological characteristics inference.
Methods
Relatedness depth (ReD) algorithm
 1)
Partition the union P _{1}∪P _{2} into k clusters C _{1},...,C _{ k };
 2)\(P_{1} \overline {\cap } P_{2} = \bigcup \limits _{i\in B} C_{i}\), where B={i∈{1,...,k}:C _{ i }∩P _{1}≠∅,C _{ i }∩P _{2}≠∅}, i.e. \(P_{1} \overline {\cap } P_{2}\) is the union of clusters, which contain sequences from both P _{1} and P _{2} (see Fig. 1);
The parameter k is a scale of clustering. In particular, populations P _{1} and P _{2} are separable, if \(P_{1} \overline {\cap } P_{2} = \emptyset \), while the fact that \(P_{1} \overline {\cap } P_{2} \ne \emptyset \) indicates that they may be genetically related. In the most extreme case \(P_{1} \overline {\cap } P_{2} = P_{1}\cup P_{2}\), i.e. populations are completely inseparable under the scale k.
The degree of confidence that the samples are genetically close is represented by the relatedness depth d(P _{1},P _{2}), which is calculated by Algorithm 1. Simply speaking, Algorithm 1 tries to recursively separate populations P _{1} and P _{2}. At each iteration, kclustered intersection is calculated. If two populations are separable, then the algorithm stops. Otherwise, it continues the separation of sequences from P _{1} and P _{2} within their kclustered intersection. The separation depth is a depth of this recursion. It is possible that at some iterations of Algorithm 1 two populations are completely inseparable under a current clustering scale. In this case, the scale k is increased and kclustered intersection is recalculated. The initial value of k used by Algorithm 1 is k=2.
kclustered intersections depend on a clustering method. Our implementation uses a hierarchical clustering based on neighborjoining tree (as implemented in Matlab (MathWorks, Natick, MA)). The algorithm utilizes a standard JukesCantor distance which is based on the simplest substitutionbased evolutionary model.
Clustered intersections also allow for estimating the direction of transmissions. It is reasonable to assume that if two hosts share a population, then a host with more heterogeneous population is more likely to be the transmission source [18]. Formally, if \(I = P_{1} \overline {\cap } P_{2}\), P _{1}⊆I and P _{2} ∖ I≠∅, then we assume that probable transmission direction is from P _{2} to P _{1} (see Fig. 1). The direction is defined according to the first occurrence of such situation during execution of Algorithm 1. Note that in some cases direction may not be identified.
Given the collection of viral populations \(\mathcal {P} = \{P_{1},...,P_{n}\}\), ReD produces the weighted directed genetic relatedness graph G=(V,A,d) with \(V=\mathcal {P}\). An arc (P _{ i },P _{ j }) is in A whenever populations P _{ i } and P _{ j } are genetically related, i.e., have sufficiently high relatedness depth; the direction of an arc corresponds to the estimated direction of transmission and its weight to the relatedness depth. Transmission clusters are calculated as weakly connected components of the digraph G. To determine transmission clusters, the simplest depth cutoff T=1 can be used. In addition, only components containing at least one arc a of weight d(a)≥2 were considered as reliable. For each reliable component, a source s of the corresponding outbreak is identified as a vertex with highest eigenvector centrality.
Viral outbreak inference (VOICE) simulation method
VOICE is another approach to predict epidemiological characteristics. Unlike ReD, it is not deterministic. Instead, it simulates the process of evolution from one viral population (source) into another (recipient) as a Markov process on a union of both populations. VOICE starts evolution from a subset of source sequences called the border set and estimates the number of generations required to acquire a genetic heterogeneity observed in the recipient.
Formally, given two sets of viral sequences P _{1} and P _{2}, VOICE simulates viral evolution to estimate times t _{12} and t _{21} needed to cover all sequences from the recipient population under the assumptions that first and second host were sources of infection. Based on the value min{t _{12},t _{21}}, the algorithm decides whether the populations are related. The direction of possible transmission between the related pair is assumed to follow the direction which requires less time.
where ε is the mutation rate, L is the genome length and deg−(v) is an outdegree of a vertex v.
Algorithm 2 represents the flow of the method. The time t _{12} is computed as the average over s simulations. The same procedure is repeated for the opposite direction of the transmission with its border set B _{2} and the time t _{21} is computed. The value min{t _{12},t _{21}} determines which direction of transmission is more likely.
Data normalization
The sizes of observed intrahost viral populations may significantly vary due to sampling and sequencing biases. Since the larger population will require more time to cover, the estimation of t _{12} and t _{21} could be biased. VOICE avoids such biases by normalizing the intrahost population sizes. The deterministic normalization partitions each viral population into q clusters using hierarchical clustering and each cluster is replaced with the consensus of its members. The subsampling normalization randomly chooses q sequences from each population. The procedure is repeated r times, and the final result is an average over all subsamplings.
Identification of genetic relatedness, transmission directions, clusters and sources of outbreaks
Analogously to ReD, VOICE produces a weighted directed genetic relatedness graph G=(V,A,w) with \(V=\mathcal {P}\). An arc P _{ i } P _{ j } is in A whenever populations P _{ i } and P _{ j } are genetically related, i.e., value min{t _{ ij },t _{ ji }} is less than a threshold. Weakly connected components of G represent transmission clusters or outbreaks. To determine the source of each outbreak, we build a Shortest Paths Tree (SPT) for every vertex in the corresponding component. The source is estimated as the vertex with an SPT of minimal weight.
MinDistB method
The method extends the MinDist approach proposed in [12], which defines the distance between viral populations as the minimum Hamming distance between their representatives. The new approach also takes into account sizes of border sets, on which the minimum distance is achieved.
We define a δdistance between populations P _{1} and P _{2} as follows:
where c=3 is an empirically chosen constant.
Identification of genetic relatedness, transmission clusters and sources of outbreaks
For MinDistB methods, genetic relatedness graph G=(V,E,w) is a weighted undirected graph with the vertex set \(V=\mathcal {P}\) and an edge of weight w _{ i,j } connecting populations P _{ i },P _{ j } whenever w _{ i,j }=D _{ δ }(P _{1},P _{2}) does not exceed a threshold. Transmission clusters are estimated as connected components of the graph G. For each transmission cluster its source could be inferred either as a vertex with maximum eigenvector centrality or as a vertex with the shortest paths tree of minimal weight.
Results and discussions
ReD, VOICE and MinDistB were validated using experimental outbreak sequencing data, and their predictions were compared with the previously published MinDist method [12].
Data sets

Outbreak collection contains 142 HCV samples from 33 epidemiologically curated outbreaks reported to Centers for Disease Control and Prevention in 2008–2013. Outbreaks contain from 2 to 19 samples. Epidemiological histories, including sources of infection, are known for 10 outbreaks.

Collection of 193 epidemiologically unrelated HCV samples.
All viral sequences represent a fragment of E1/E2 genomic region of length 264 bp.
Prediction of epidemiological characteristics

genetic relatedness between populations;

transmission clusters representing outbreaks and isolated samples;

sources of outbreaks;

transmission directions between pairs of samples.
Validation results
Methods  MinDist  MinDistB  ReD  VOICED  VOICES 

Relatedness  
Sensitivity, %  90%  92.9%  55.3%  85.2%  86.8% 
AUROC  0.992  0.996  N/A  0.993  0.990 
Clustering  
Sensitivity, %  100%  100%  96.3%  98.2%  98.2% 
Source  
Accuracy, %  50%  40%  90%  80%  90% 
Directions  
Accuracy, %  N/A  N/A  87.1%  83.9%  87.1% 
Genetic relatedness between populations
Detection of transmission clusters
Table 1 shows that MinDistB and MinDist demonstrate the highest sensitivity.
Source identification
The accuracy of the source identification is defined as the percentage of correctly predicted sources for outbreaks, where the correct sources are known. The Source section of Table 1 shows that the best results are achieved by ReD and V O I C E−S which were able to detect sources in 90% of cases. At the same time, MinDist and MinDistB, which are not able to identify transmission directions, were significantly less accurate.
Transmission direction
Among tested algorithms, only ReD and VOICE allows for detection of transmission directions. For that algorithms, percentages of correctly predicted pairs sourcerecipient were calculated (Table 1). Here the highest accuracy of 87.1% was achieved by ReD and V O I C E−S.
Running time
All tests were performed on PC with DDR31333MHz 4 GBx12 RAM and 2 Intel XeonX5550 2.67 GHz processors. The fastest algorithms were MinDist and MinDistB, with running times 9 ms for a pair of samples in our dataset. ReD requires ∼0.1s per pair of samples, While the running time of VOICE is ∼35 s per pair.
Conclusions
Currently, a molecular viral analysis is one of the major approaches used for investigations of outbreaks and inference of transmission networks. Although modern sequencing technologies significantly facilitated molecular analysis, providing unprecedented access to intrahost viral populations, they generated novel bioinformatics challenges.
This work proposed three novel algorithms for the investigation of viral transmissions based on analysis of the intrahost viral populations, which allow clustering genetically related samples, infer transmission directions and predict sources of outbreaks. Evaluation of the algorithms on experimental data from HCV outbreaks demonstrated their ability to accurately reconstruct various transmission characteristics. It should be noted, that although ReD was proved to be accurate in estimation of transmission clusters, directions and sources, its accuracy of relatedness detection is lower than for other evaluated methods. However, the advantage of this method over other methods is its nonparametricity (i.e. independence from virusspecific and genomic regionspecific thresholds), which makes it more universally applicable and extremely useful in situations, when the lack of training data does not allow to establish reliable relatedness thresholds.
The clusteringbased ReD approach may be further improved using a more scalable clustering similar to the algorithm proposed in [17]. The simulationbased approach VOICE presented here may be further improved by incorporating more complex viral evolution models taking into account cell proliferation rate and immune responses against viral variants.
All algorithms are planned to be integrated into the pipeline of cloudbased websystem “Global Hepatitis Outbreak and Surveillance Technology” (GHOST), which is currently being developed by US Centers for Disease Control and Prevention (https://webappx.cdc.gov/GHOST/).
Declarations
Funding
AZ was partially supported by NSF Grant CCF16119110 and NIH Grant 1R01EB02502201; PS was partially supported by NIH Grant 1R01EB02502201; OG, SK, AM, and AA were partially supported by GSU Molecular Basis of Disease Fellowship. The publication costs were funded by NSF Grant CCF1611911.
Availability of data and materials
ReD and VOICE are freely available at https://bitbucket.org/osaofgsu/red and https://bitbucket.org/osaofgsu/voicerep, respectively.
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 10, 2017: Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume18supplement10.
Authors’ contributions
OG and SK designed, implemented and tested the algorithms; AM and AA implemented and tested the algorithms; YK designed the algorithms and analyzed the algorithms’ results; AZ and PS designed and implemented the algorithms, analyzed the results and supervised the research. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Drake JW, Holland JJ. Mutation rates among rna viruses. Proc Natl Acad Sci. 1999; 96(24):13910–3.View ArticlePubMedPubMed CentralGoogle Scholar
 Domingo E, Holland J. Rna virus mutations and fitness for survival. Annu Rev Microbiol. 1997; 51(1):151–78.View ArticlePubMedGoogle Scholar
 Domingo E, Sheldon J, Perales C. Viral quasispecies evolution. Microbiol Mol Biol Rev. 2012; 76(2):159–216.View ArticlePubMedPubMed CentralGoogle Scholar
 Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, Gharizadeh B, Ronaghi M, Shafer RW, Beerenwinkel N. Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008; 4(5):1000074.View ArticleGoogle Scholar
 Archer J, Braverman MS, Taillon BE, Desany B, James I, Harrigan PR, Lewis M, Robertson DL. Detection of lowfrequency pretherapy chemokine (cxc motif) receptor 4using hiv1 with ultradeep pyrosequencing. AIDS (London, England). 2009; 23(10):1209.View ArticleGoogle Scholar
 Hoffmann C, Minkah N, Leipzig J, Wang G, Arens MQ, Tebas P, Bushman FD. Dna bar coding and pyrosequencing to identify rare hiv drug resistance mutations. Nucleic Acids Res. 2007; 35(13):91.View ArticleGoogle Scholar
 Wang W, Zhang X, Xu Y, Weinstock GM, Di Bisceglie AM, Fan X. Highresolution quantification of hepatitis c virus genomewide mutation load and its correlation with the outcome of peginterferonalpha2a and ribavirin combination therapy. PLoS ONE. 2014; 9(6):100131.View ArticleGoogle Scholar
 Skums P, Campo DS, Dimitrova Z, Vaughan G, Lau DT, Khudyakov Y. Numerical detection, measuring and analysis of differential interferon resistance for individual hcv intrahost variants and its influence on the therapy response. Silico Biol. 2011; 11(5):263–9.Google Scholar
 Campo DS, Skums P, Dimitrova Z, Vaughan G, Forbi JC, Teo CG, Khudyakov Y, Lau DT. Drug resistance of a viral population and its individual intrahost variants during the first 48 h of therapy. Clin Pharmacol Ther. 2014; 95(6):627–35.View ArticlePubMedPubMed CentralGoogle Scholar
 Wertheim JO, Brown AJL, Hepler NL, Pond SLK. The global transmission network of hiv1. J Infect Dis. 2014; 209(2):304–13.View ArticlePubMedGoogle Scholar
 Wertheim JO, Pond SLK, Forgione LA, Mehta SR, Murrell B, Shah S, Smith DM, Scheffler K, Torian LV. Social and genetic networks of hiv1 transmission in new york city. PLoS Pathog. 2017; 13(1):1006000.View ArticleGoogle Scholar
 Campo DS, Xia GL, Dimitrova Z, Lin Y, Forbi JC, GanovaRaeva L, Punkova L, Ramachandran S, Thai H, Skums P, et al. Accurate genetic detection of hepatitis c virus transmissions in outbreak settings. J Infect Dis. 2016; 213(6):957–65.View ArticlePubMedGoogle Scholar
 RomeroSeverson EO, Bulla I, Leitner T. Phylogenetically resolving epidemiologic linkage. Proc Natl Acad Sci. 2016; 113(10):2690–5. doi:10.1073/pnas.1522930113. http://arxiv.org/abs/http://www.pnas.org/content/113/10/2690.full.pdf.View ArticlePubMedPubMed CentralGoogle Scholar
 De Maio N, Wu CH, Wilson DJ. Scotti: Efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput Biol. 2016; 12(9):1005130.View ArticleGoogle Scholar
 Fischer GE, Schaefer MK, Labus BJ, Sands L, Rowley P, Azzam IA, Armour P, Khudyakov YE, Lin Y, Xia G. Hepatitis c virus infections from unsafe injection practices at an endoscopy clinic in las vegas, nevada, 2007–2008. Clin Infect Dis. 2010; 51(3):267–73.View ArticlePubMedGoogle Scholar
 Apostolou A, Bartholomew ML, Greeley R, Guilfoyle SM, Gordon M, Genese C, Davis JP, Montana B, Borlaug G. Transmission of hepatitis c virus associated with surgical proceduresnew jersey 2010 and wisconsin 2011. MMWR Morb Mortal Wkly Rep. 2015; 64(7):165–70.PubMedGoogle Scholar
 Skums P, Artyomenko A, Glebova O, Ramachandran S, Mandoiu I, Campo DS, Dimitrova Z, Zelikovsky A, Khudyakov Y. Computational framework for nextgeneration sequencing of heterogeneous viral populations using combinatorial pooling. Bioinformatics. 2015; 31(5):682–90. doi:10.1093/bioinformatics/btu726. http://bioinformatics.oxfordjournals.org/content/31/5/682.full.pdf+html.View ArticlePubMedGoogle Scholar
 Astrakhantseva IV, Campo DS, Araujo A, Teo CG, Khudyakov Y, Kamili S. Differences in variability of hypervariable region 1 of hepatitis c virus (hcv) between acute and chronic stages of hcv infection. Silico Biol. 2011; 11(5):163–73.Google Scholar
 Quirin A, Cordón O, GuerreroBote VP, VargasQuesada B, MoyaAnegón F. A quick mstbased algorithm to obtain pathfinder networks. J Am Soc Inf Sci Technol. 2008; 59(12):1912–24.View ArticleGoogle Scholar
 Campo DS, Dimitrova Z, Yamasaki L, Skums P, Lau DT, Vaughan G, Forbi JC, Teo CG, Khudyakov Y. Nextgeneration sequencing reveals large connected networks of intrahost hcv variants. BMC Genomics. 2014; 15(Suppl 5):4.View ArticleGoogle Scholar
 Deza MM, Deza E. Encyclopedia of Distances.SpringerVerlag Berlin Heidelberg; 2009.Google Scholar