Single-virion sequencing of lamivudine-treated HBV populations reveal population evolution dynamics and demographic history

Background Viral populations are complex, dynamic, and fast evolving. The evolution of groups of closely related viruses in a competitive environment is termed quasispecies. To fully understand the role that quasispecies play in viral evolution, characterizing the trajectories of viral genotypes in an evolving population is the key. In particular, long-range haplotype information for thousands of individual viruses is critical; yet generating this information is non-trivial. Popular deep sequencing methods generate relatively short reads that do not preserve linkage information, while third generation sequencing methods have higher error rates that make detection of low frequency mutations a bioinformatics challenge. Here we applied BAsE-Seq, an Illumina-based single-virion sequencing technology, to eight samples from four chronic hepatitis B (CHB) patients – once before antiviral treatment and once after viral rebound due to resistance. Results With single-virion sequencing, we obtained 248–8796 single-virion sequences per sample, which allowed us to find evidence for both hard and soft selective sweeps. We were able to reconstruct population demographic history that was independently verified by clinically collected data. We further verified four of the samples independently through PacBio SMRT and Illumina Pooled deep sequencing. Conclusions Overall, we showed that single-virion sequencing yields insight into viral evolution and population dynamics in an efficient and high throughput manner. We believe that single-virion sequencing is widely applicable to the study of viral evolution in the context of drug resistance and host adaptation, allows differentiation between soft or hard selective sweeps, and may be useful in the reconstruction of intra-host viral population demographic history. Electronic supplementary material The online version of this article (10.1186/s12864-017-4217-1) contains supplementary material, which is available to authorized users.


16
Picking Reference Genomes 17 There are at least 8 well-characterized HBV genotypes with up to 10% pairwise 18 sequence difference (Sunbul 2014). Rather than mapping all patient samples to a single 19 sequence and identify viral genotype post-hoc, and to include some flexibility for the 20 identification of mixed-genotype infections, one complete sequence from each of 21 genotypes A through H was compiled into a species 'pan-genome' [SI Fig 1]. Longest 22 published sequences that were the most recently published were selected. The final 23 sequences used came from HB-JI444AF (genotype A, Genbank: AP007263.1), P2-24 121214 (genotype B, Genbank: AB981583.1), P1-090725 (genotype C, Genbank: 25 AB981580.1), MK096_EP-CHB (genotype D, Genbank: KM524346.1),  (genotype E, Genbank: HE974384.1), HBV-BL592 (genotype F, Genbank: 27 AB166850.1), CLB-DonX (genotype G, Genbank: GU565217.1), and B-MHJ9014 28 (genotype H, Genbank: AB846650.1). Because the HBV genome is circular, not all 29 published sequences start and end at the same position along the genome. Published 30 sequences were reassigned base positions such that they begin and end in positions 31 complimentary to our PCR primer products (SI Figure 1). A multi-sequence alignment of 32 the 8 genotypes revealed a 6bp insertion in genotype A, a 33bp deletion in genotype D, a 33 3bp deletion in genotype E, and a 36bp insertion in genotype G. An aligned fasta file with 1 all 8 genotypes is appended. 2 3 4 5 SI Figure 1. Phylogenetic tree of eight reference genomes used in the pan-reference. 6 7 8 The pairwise sequence identity between the genotypes ranged between 85%-92%, as 9 shown below (SI Table 2). 10 11 12 SI Table 2. Pairwise sequence identity between eight reference genomes used in the pan-reference.

14
PacBio Analysis Pipeline Development 15 PacBio raw reads were first processed with the SMRT Portal in house analysis programs. 16 Because single pass reads have base error rates averaging 1.4x10 -2 /base, circular 17 consensus sequences (CCS) from each library were called with a stringent cutoff of at 18 least 10x subreads within a polymerase read and a minimum subread length of 2500bp (to 19 retrieve full functional genomes). Theoretically, this lowers base error to 2.89x10 -19 /base. 20 While the required number of passed required to call a CCS can be adjusted, we picked 21 10x to maximize quality while maintaining a minimum level of usable data (SI Figure 2

5
Bases within the CCS reads with quality scores <75 were masked as Ns so as to filter out 6 false positive SNPs, and the resulting (nearly) full viral sequences were BWA-SW 7 mapped as extremely long reads to the concatenated HBV pan-genome consisting of all 8 8 major genotypes A-H. It is worth noting that because small indel errors were still present 9 at an estimated rate of ~3/kb (SI Figure 3), small indels were not called in our analysis. 10 Segregating sites within the viral populations were called with LoFreq. 11 12 1 SI Figure 3. A screen cap of the small indel errors seen in PacBio 10x CCS reads.

3 4
BAsE-Seq Analysis Pipeline Development 5 BAsE-Seq 2×150bp reads (sequenced on a Mi-Seq machine) were processed with a 6 custom pipeline modified from a previous publication (Hong et al., 2014). To remove 7 unique barcodes and primer contamination, fastq reads were first trimmed by adaptor 8 sequences and base quality >30. Resulting fragments were filtered for read pairs with 9 both reads passing the minimum length requirement of 20bp. A subset of 1,000 trimmed 10 read pairs were mapped to each of the 8 major genotypes A-H independently. The best 11 match genotype was identified by lowest average mismatches, and all reads were BWA-12 MEM remapped to this single genotype sequence. This was done such that mapping 13 errors could be minimized as much as possible. All concordantly mapped read pairs were 14 sorted into individual sam files by their unique barcodes, identifying them as reads from 15 the same viral genome. Each sam file was duplicate-marked, realigned, recalibrated, and 16 SNPs were called with LoFreq for incorporation into the viral sequence. (At this stage, an 17 additional step can be added for mixed genotype infections. Sam files with excessive 18 mismatches can be remapped to next best match reference if necessary.) Because each 19 sam file represents data from a single viral genome, SNPs were called if coverage depth 20 >4x and 100% of reads supported the alternative allele. All bases with coverage depth 21 <4x and/or <100% of reads supporting the same allele were marked as "N". This step 22 was aimed at minimizing false positives at error prone regions. Average genome 23 coverage and read depth across all individual sam files were tested across a range for best 24 cutoff to select an optimum number of usable sequences (<50% "N"s) for subsequent 25 analysis. Viral sequences that passed the quality filters were written into an original sam 26 files as long reads that map to position 1 of the reference. LoFreq was used to call 1 segregating sites within the population. 2 3 4 LoFreq Variant Calling 5 Variants were called with LoFreq in Pooled deep sequencing and PacBio using default 6 filters. 7 In BAsE-Seq, LoFreq was employed thrice. The first time, LoFreq was used with default 8 settings on the sam file containing all reads (before assignment into individual sam files 9 by barcode). This step was found to be necessary for marking positions with strand bias 10 error that will lead to false positives in the individual genomes. The second time, during 11 the variant calling in individual genomes, coverage was often too low to accurately detect 12 strand bias and will lead to false positives. Positions prone to strand bias errors as 13 previously noted are ignored in this step. When mapped to the reference panel, fastq reads gave an average coverage depth of 42 ~23,000x within genotype C. Two regions of the genotype C sequence were expected to 43 be missing from fastq reads -bases 0-40 were not amplified during PCR, and bases 1380-44 1445 contained a known deletion. Both these regions had coverage depths below 1000x 45 (<5% average genomic coverage) (SI Figure 4). 46 1 2 SI Figure 4. Coverage (Y-axis) plot against Genotype C (X-axis) for Clone 2 Pooled deep sequencing 3 library. The first 40bp were not covered due to primer location, and a 66bp deletion is also missing 4 coverage starting at position 1380.

6
Outside of the Genotype C region, there was also a short aberrant peak in the last 300bp 7 of genotype E (SI Figure 5 top panel).

6
This region was 3.5% divergent from genotype C, and was presumed to be prone to mis-7 mapping due to high sequence similarity. Excluding 3 nucleotide differences present in 8 bases 0-40, all other 57 known SNPs were accurately called. 7 false positives SNPs were 9 also called, 4 in low coverage regions and 3 due to strand bias. We found that a SNP 10 quality cutoff of 1000, a coverage cutoff of >5% average (within the best match 11 genotype), and an allelic frequency cutoff of 1% were required to minimize false 12 positives. 13 The Clone 2 BAsE-Seq library yielded consensus sequences for 8,796 viral 14 genomes with >1500 bases (~50% of genome) covered to >4x read depth. Insertions 15 could be theoretically called for each sequence but are usually low confidence due to the 16 low coverage; therefore indels were not part of the pipeline and were not incorporated 17 into the sequences. Known deletion regions showed no coverage across all reconstructed 1 sequences, which is a possible clue for identifying novel deletions. 2 SNPs that were 2 strand bias induced false positives were subsequently removed. All 57 SNPs expected 3 were correctly identified with a SNP quality cutoff of 1000 and an allelic frequency 4 cutoff of 1%. 5 On PacBio, Clone 2 library was run on a full cell (v6). All 5,110 CCSs were 6 BWA-SW mapped as extremely long reads to the reference panel. Four CCSs (0.08%) 7 were mis-mapped onto incorrect references -two mapped to genotype E due to large 8 numbers of low quality bases (>50% of read), and the remaining two mapped to genotype 9 D due to sequence chimerism. These reads were removed. Variant calling with the same 10 filters of 1% frequency and SNP quality >1,000 similarly removed all false positives, and 11 all 57 SNPs were called. 12 We aligned full sequences (BAsE-Seq) or multi-sequence aligned CCSs (Pacbio) 13 against the reference genotype C, and constructed maximum likelihood trees. Both 14 BAsE-Seq (SI Figure 6 gave a single cluster of sequences a similar distance away from the reference. The longer 6 branches on the Pacbio tree are due to multiple alignment artifacts from deletion prone 7 Pacbio reads (SI Figure 3). 8 9 10 Clone-1:Clone-2 mixed population. Viral populations within a patient can be diverse. 11 We sequenced a 50:50 mixture of two strains Clone-1 and Clone-2 to test the sensitivity 12 of each method to a mixed population. 13 Illumina deep sequencing showed an average coverage depth of 30,000 within the 14 reference genotype C sequence. Two dips in coverage were observed where deletions are 15 present in either clone (SI Figure 8). 16 17 1 SI Figure 8. Coverage (Y-axis) plot against Genotype C (X-axis) for Clone 1:2 mix Pooled deep 2 sequencing library. The first 40bp were not covered due to primer location, and two dips are seen due to 3 deletions present on each clone -a 56bp deletion at position 1197 for Clone 1 and a 66bp deletion at 4 position 1380 for Clone 2.

6
In Pooled deep sequencing libraries, SNPs were called for all bases covered to 7 >5% average genomic coverage within the Genotype C sequence. Of the 65 SNPs 8 expected, all were identified post coverage, quality, frequency, and strand bias filtering. 9 2,151 BAsE-Seq sequences passed filtering (>4x coverage depth at >1,500 bases). 10 Of the 65 SNPs expected, all were identified post coverage, quality, frequency, and 11 strand bias filtering. 12 6,299 10x CCSs from a single PacBio run were mapped as extremely long reads 13 to the reference panel. 6,295 reads (>99.9%) mapped to the genotype C reference. 14 Applying the same filters of 1% frequency and SNP quality >1,000, all 65 expected SNPs 15 were identified. 16 All 17 known SNP differences between the clones registered at allele frequencies 17 of 0.4-0.6 (SI Figure 9). 18 19 1 SI Figure 9. Allele frequency (Y-axis) plotted against Genotype C (X-axis) for Clone 1:2 mixed library. showing shorter branches as compared to Pacbio (SI Figure 11). 7 8 SI Figure 11. Clone1:2 mixed library. Phylogenetic tree of full-length single virion sequences constructed 9 from BAsE-Seq rooted at reference Genotype C (green branch). Each individual branch is labeled at the 10 tips; only black solid lines represent branch length.

12 13
Clinical Sample Viral Populations Viral genotype holds clinical implications. We 1 tested the ability of each platform to accurately call and identify the dominant genotype 2 in a patient, as well as mixed co-infections, if any. For samples Clone2, Clone1:Clone2 3 mix, P1.1, P1.2, P2.1, and P2.2, all three platforms identified the same dominant 4 genotype for each sample (SI Figure 12). 5 6 7 SI Figure

12
Pooled deep sequencing short reads had higher mis-mapping rates that led to substantial 13 coverage outside of the true genotype (up to 43% such as in P1.2), whereas long reads in 14 BAsE-Seq and Pacbio gave nearly no mis-mapping. The low frequencies observed above 15 (0%-0.3% for BAsE-Seq, 0.07%-0.4% ) came from spike in controls. 16 17 Summary 18 All three methods perform well when it comes to identifying SNP polymorphisms 19 and their frequencies. With timepoint data, any of these methods can identify sharp 20 increases in variants that may be under selection, and suggest linkage between SNVs that 21 change in frequency together. However, Pooled deep sequencing does not have definitive 22 proof of exact linkage patterns, which can be limiting in more complex populations with 23 multiple adaptive haplotypes. One other limitation with Pooled deep sequencing concerns 24 the identification of mixed genotypes. If multiple genotypes are present at significant 25 percentages, coverage across the pan-genome will reveal the genotypes present, and 26 LoFreq results can be filtered accordingly, such as with Patient11.1 where both genotypes 27 are present at >30%. However, the same cannot be said of low frequency mixed 28 infections, such as the controlled spike-ins, which cannot be reliably identified through 29 Illumina mapping, but require long haplotype information such as those from BAsE-Seq 1 and PacBio single virion sequences. 2 BAsE-Seq and PacBio perform similarly in terms of the number of single virion 3 sequences obtainable from each library. The number of sequences obtained from every 4 library made is listed below (SI Figure 13). There is substantial variation from library to 5 library within each method, largely determined by patient sample quality, viral count, and 6 run quality. 7 8 9 SI Figure 13. Number of haplotypes retrieved from each library, separated by sequencing platform, and 10 sorted within each by count.

12
One limitation of BAsE-Seq is the absolute requirement of a reference genome. 13 Whereas PacBio reads are more prone to small indel errors in homopolymer runs. Both 14 methods run into some lengths limits. BAsE-Seq relies on a single PCR amplicon that 15 can be barcoded and amplified. This becomes increasingly difficult to achieve beyond 16 5kb. Longer amplicons may also introduce more bias into the circularization step. PacBio 17 relies on multiple reads of the same molecule. Given that the upper limit for a PacBio 18 read was ~60kb at the time of this manuscript, and keeping in mind that the majority of 19 the reads from a run will not be quite so long, the achievable haplotype length may not be 20 significantly longer than Base-Seq. HBV genome is short and proves the value of such 21 studies, and as technologies improve over time, larger and larger viral genomes can 22 hopefully be explored. 23 Indels remain a tricky issue. Both BAsE-Seq (due to low coverage for individual 24 bam files) and PacBio (due to error prone chemistry) will require specific experiments 25 tailored to the development of indel calling pipelines. This falls outside of the scope of 26 our manuscript 27