Sample preparation and nanopore sequencing
The clinical samples from which genomes were extracted were sourced from the Hepatitis C Incidence and Transmission Studies in prisons and in the community (HITS-p and HITS-c) which recruited HCV seronegative and RNA negative people who inject drugs in New South Wales, Australia between 2005 and 2014. The details of this cohort are published elsewhere [22, 23] and de-identified and stored plasma / serum samples from these cohorts were used for RNA extraction. Full-length HCV amplicons were generated by reverse transcription and a nested PCR was conducted as described previously by the authors [14]. In brief, 280 μl of patient plasma was used to extract viral RNA with the QIAmp Viral RNA mini kit according to the manufacturers’ instructions (Qiagen, Chadstone Centre, Vic, Australia, Catalogue number: 52906). Reverse transcription of viral RNA was carried out using SuperScript III First-Strand Synthesis System (Catalogue number: 180851, Life Technologies, Australia) and a pan-genotype primer (oligo dA20). The nested PCR was completed with a combination of genotype specific and universal HCV primers using Takara LA taq DNA polymerase (Catalogue number: RR002B, Scientifix life, Australia) [14]. Prior to sequencing, the final products were size selected with Monarch DNA Gel Extraction Kit (catalogue number: T1020S, New England Biolabs, USA) to pick out the band of interest at approximately 10 kb by gel extraction on an 0.8% agarose gel, and subsequently purified with magnetic beads (Agencourt AMPure XP, Beckman Coulter, USA, Cat: A63881) according to the manufacturer’s instructions. Nanopore sequencing was carried out according to the manufacturer’s protocols at the Kinghorn Centre for Clinical Genomics (a licenced ONT service provider), Garvan Institute of Medical Research in Sydney, Australia with an ONT MinION or GridION sequencer on a FLO-MIN107 v9.5 flow cell. Barcoding of PCR products was performed using one of two methods: ligation barcoding where oligonucleotide barcodes were attached to the nested PCR product via ligation (ONT ligation sequencing kit SQK-LSK109); and PCR barcoding where adapters were incorporated into the PCR product during the final round of nested PCR, which in turn was used to attach a barcode by an additional PCR. In the case of PCR barcoding, the products were size selected and purified for a second time prior to sequencing. Following sequencing, the signal level data were de-multiplexed with Guppy (version 2.3.5 and 3.0.3, http://staff.aist.go.jp/yutaka.ueno/guppy/) and further processed with the Nanopolish algorithm (version 0.11.1, https://github.com/jts/nanopolish). The de-multiplexed, cleaned reads were then used for further analysis.
Accuracy of nanopore reads in generating HCV consensus sequences
Full genome amplicons generated from the sera of four HCV-infected patients (subtype 1a) were sequenced with Nanopore technology as described above with ligation barcoding. The de-multiplexed data were aligned using GraphMap aligner (v0.5.2) [24] or Minimap2 (2.14-r921-dirty) aligner [25]. In addition, these subjects had the full genome of the virus sequenced on the Illumina MiSeq platform, as previously described [12]. A subtype-specific reference sequence was used to align the Illumina reads with Geneious Prime version 2020.0.5 (https://www.geneious.com) using the ‘map to reference’ option, with the Geneious aligner (settings: medium sensitivity / fast, 5 iterations). Two rounds of alignment were completed. The consensus from one round was used as the reference for the next round to minimize the bias from the non-autologous reference. The consensus generated from nanopore reads were compared against the Illumina-generated consensus quantitatively by counting pairwise differences. To determine the minimum coverage required for an accurate consensus sequence, different numbers of reads in multiples of 100 were randomly picked from each de-multiplexed sequence set for each patient and were aligned using a non-autologous subtype specific reference. This was compared to the nanopore consensus generated from all sequences and the pairwise differences were plotted and averaged across the four subjects.
Sensitivity of nanopore sequencing to recover variants in a mix of sequences
Estimation of the sensitivity of nanopore sequencing to recognize low frequency variants in a relatively homogenous sequence mix was accomplished by two simulation experiments in which six HCV Envelope sequence mixtures (E1E2, length: 1800 nt. as inserts of a plasmid which was subsequently cloned and extracted) of the same subtype (1a or 1b) were combined in varied proportions to generate 15 different sample mixtures. The protocol for this step is provided in Supplementary files 1 and 2. Each HCV insert in a plasmid originated from a different patient, other than two obtained at well-separated timepoints from the same patient. The proportions of each of the 6 plasmids in a mixture varied between 0.1–93% across the 15 mixes with a uniform representation across the spectrum of prevalence. All sample mixtures were nanopore sequenced with ligation barcoding, and the de-multiplexed read outputs were aligned against each of the six reference sequences. The read count for each alignment was considered as a proxy measure of prevalence of the variant in the mixture. The consensus sequence generated per plasmid alignment was compared to the original plasmid sequences to quantify any pairwise differences. The diversity within such an alignment (attributable to technical errors of nanopore reads) was quantified and averaged across all alignments to identify the cut-off for differentiation of technical errors of nanopore sequencing from low prevalence single nucleotide polymorphisms (SNPs).
Scale up of ONT sequencing to assess cost effectiveness
Fifty-two HCV amplicon mixes, each from a different patient sample were PCR-barcoded and pooled in a single nanopore sequencing run on a GridION platform. For this step, DNA amplicons were generated from preserved RNA from a HCV sequencing study published previously [12]. The reads were then de-multiplexed as described above and consensus sequences were generated to compare with Illumina consensus sequences for the same subjects, provided the minimum required coverage per position was met (> 300 per nt position). The cost of nanopore sequencing (plus library preparation and service charges) per patient sample were compared with that of Illumina sequencing (sequencing on a MiSeq platform with Nextera XT barcodes per sample).
Designing a novel bioinformatic pipeline to differentiate within-host variants
The newly designed bioinformatic tool, referred to as nanopore quasi-species tool (Nano-Q) performed the following steps serially: firstly, the pipeline arranged reads according to length and then selected all reads above a user defined cut-off length; secondly, each read was cleaned using the consensus sequence as a guide; and finally, a hierarchical clustering algorithm was used to identify within-host variants. The Nano-Q tool has been implemented in Python (3.7.2) using packages implemented in Biopython [26], Scipy [27, 28], Numpy [29], Matplotlib [30], Pysam (https://github.com/pysam-developers/pysam) and is available for download at https://github.com/PrestonLeung/ONT-Tool. This fully automated pipeline has several user-defined variables which allows a conservative or a liberal approach to characterizing within host variants. In the latter approach more low abundance variants are identified but some of these may be spurious variants given the error rate of nanopore sequences. A brief summary of the bioinformatic workflow is given below. An example of an output from the Nano-Q tool is given in supplementary files 3, 4, 5, 6.
Pre-preparation
The de-multiplexed fastq files for each subject were first aligned to a subtype specific generic HCV reference sequence using Minimap2 [31] with parameters -ax map-ont --MD to produce a Sequence Alignment Map (sam file). Using samtools view, (version 0.1.19) this was converted to binary format (bam file) with parameter -bSF 2304, filtering out any secondary and supplementary alignments. The bam file was subsequently sorted and indexed using samtools sort and samtools index options. The final bam file was used as the input for Nano-Q.
Nano-Q tool
The Nano-Q tool adopts a conservative approach to clean nanopore reads if they meet a user-defined minimum length. Shorter reads are discarded. Using the consensus sequence as a guide, Nano-Q identified and converted base mismatches of each nanopore read to that of the consensus sequence if the quality score was below a user defined cut-off. Those above the quality score cut-off were retained as a true SNP. Similarly, as the HCV open reading frame (ORF) is translated as a single polyprotein without intervening stop codons, any indels and stop codons present in the reads were considered as technical errors and removed. In the case of stop codons, the violating codon was replaced by that of the consensus. A pairwise calculation of Hamming distances was then performed between every cleaned read [N(N-1))/2 total calculations where N is the number of reads]. These values were stored into a distance matrix and hierarchical clustering (implemented in Scipy package 1.2.1 [27, 28]) was performed to search for clusters of reads within a user defined Hamming distance threshold. This threshold was calibrated by the readouts from the HCV plasmid mixture experiments. For plasmids with an actual 5–15% pairwise difference (equivalent to between-host HCV variants), a Hamming distance cut-off of 160–180 accurately separated them into clusters where the consensus of each cluster was a near-accurate reconstruction of each plasmid. For the pair of plasmids originating from the same patient (i.e. true within-host variants with less than 5% actual pairwise difference) a Hamming distance cut off of 80–96 allowed adequate resolution to separate these clusters. A higher cut off merged these populations as a single cluster. Each plasmid has an HCV Envelope insert of approximately 1800nts; hence by linear extrapolation it is recommended to use a Hamming distance of 800–900 to differentiate between within-subject full-length variants (from a non-barcoded mix of reads originating from different subjects) and 400–480 to differentiate within-subject full-length variants (from barcoded and demultiplexed reads originating from a single host). In the next step the clusters were filtered by their size to select the largest clusters. As the final step, for each cluster, a consensus sequence was generated (a within-host variant) and these were merged if identical.
Using a lower Hamming distance threshold resulted in more read clusters, but each with a fewer number of reads. Given the high error rate in nanopore reads (up 10% compared to < 1% with Illumina technology) having many clusters with a only very few reads increased the risk of generating spurious variants. Therefore, the tool allows the user to define a minimum cluster size and those with fewer reads to be discarded. This approach allowed filtering of inappropriately corrected reads in the above steps, as these are unlikely to have a similar partner to cluster with. The number of reads within each “accepted” cluster was normalised against the total number of reads from all accepted clusters to estimate the frequency of occurrence of the viral variant within the sample. The final output of the algorithm was a sequence file in FASTA format containing all consensus sequences of variants generated from clusters, and the header of each sequence indicated the relative abundance of that variant as a fraction between 0 and 1.
Nano-Q is currently executable on Linux systems (Desktop: Ubuntu 14.04, 32 GB RAM, and an Intel Core i7–4890 3.60 Ghz, Internal Server: RedHat 6.9, 529 GB RAM and 4 × Intel Core Xeon E5-4650L 2.60) on command prompt with nine user-defined variables and three optional parameters (full parameter description and instructions available at https://github.com/PrestonLeung/ONT-Tool).
An example of the command line is shown below;
python indelRemover003H.py -b ../example.sorted.bam -l 8500 -nr 1 -q 5 -j 50 -c 1 -ht 400 -mc 30.
-l: length cut-off, selecting a shorter cut-off will increase the number of eligible reads and the computational requirements will also increase exponentially.
-nr: number of references for the alignment (usually one).
-q: threshold for base quality score for cleaning reads. Any base mismatch below this threshold will be considered an error and corrected to that of the consensus while those with a quality score above this value will be retained as a true SNP.
-c: starting codon (in the reference) for eligible reads. Any read without this position will be discarded. It is recommended to visually inspect the .bam file to see the first codon common to a majority of reads and select this position (number according to the reference) to make maximum use of all reads.
-ht: Hamming distance cut-off where all reads within this value will fall into a single cluster. We recommend using 400–480 to differentiate within host full-length HCV variants based on the in silico clonal experiments. Using a smaller threshold will increase the number of clusters (each with fewer reads) while larger values will reduce the number of clusters and hence the resolution to separate low frequency variants.
-mc: minimum acceptable number of reads per cluster. Only clusters with a read count above this threshold will be retained as a true cluster to generate a consensus sequence (a within host variant). If this number is lowered the number of clusters will increase and so would the estimated within host variants. However, as each cluster would have fewer reads the risk of generating spurious variants is higher as each nanopore read has a 10–15% error rate.
-l, −ht and -mc parameters significantly influence the number of low frequency variants detected. Having a larger number of reads (filtered by a shorter length) will increase the accuracy of detecting true low frequency variants. Lowering -mc or -ht will increase the number of low frequency variants detected but will have a minimum impact on the frequency of major variants if the total number of reads are high (e.g. in a dataset where the total number of eligible reads are 10,000 and a major variant is present as a cluster of 3000 reads, if -mc is set at 30, the lowest possible frequency of a minor variant will have an abundance of 0.3%. If this threshold is lowered to 20, one or more variants will appear each with an abundance of 0.2–0.3%. This is unlikely to have a significant impact on the abundance estimate of the major variant.