Worldwide, almost 3% of people are infected with hepatitis C virus (HCV) [1]. Approximately 80% of HCV infections develop into the chronic state [2]. Of these, 15–30% will be diagnosed with liver fibrosis or cirrhosis, and 5% will die from cirrhosis or hepatocellular carcinoma (HCC) [3]. Globally, liver cancer is the second most common form of cancer death [2], and in the United States, occurrences are increasing at a higher rate than any other form of cancer except thyroid cancer [4]. An estimated 2.7–3.9 million Americans live with HCV infection [5]. In the United States, in 2007, the number of deaths related to HCV overtook the number of human immunodeficiency virus (HIV)-related deaths [6]. In 2012, 22,972 Americans died of liver cancer [7], 28,972 were newly diagnosed [7], and HCV-related deaths surpassed all 60 other nationally notifiable diseases combined [8].
Like HIV, HCV is primarily transmitted through parenteral exposures. In the 1960s, 1970s, and 1980s, before the virus was discovered and HCV screening of the blood supply was standard practice, HCV infection was expanding worldwide. In the general US population, HCV infection is particularly high among individuals born between 1945 and 1965 [9,10,11]. However, there has been a 151% increase in reported HCV infections in the United States between 2010 and 2013 and typically in non-urban areas, corresponding to the surge in opioid addiction and injection drug use (IDU) among individuals born after 1986 [12]. HCV is the most common infection with a transmission path through IDU which accounts for the most significant proportion of newly acquired HCV infections [13,14,15,16,17]. During a recent HIV outbreak investigation in Indiana, among the 181 initial HIV infected individuals identified, 92% were found to be coinfected with HCV [18]. Successful HCV surveillance programs are fundamental for the implementation of public health interventions aimed at interrupting HCV transmission.
HCV exists as a population of numerous variants in each infected individual [19]. It has been observed that minority variants in the source are often those responsible for transmission [20, 21], a situation that precludes the use of a single sequence per individual because many such transmissions would be missed [22]. Computational analysis of the NGS data for the detection of HCV transmission is a very complex process and requires significant expertise in application of phylogenetic methods and interpretation of phylogenetic data within an epidemiological context.
We previously developed and validated a methodology for the rapid, accurate, and cost-effective identification of transmission clusters using large samples of intra-host HCV variants obtained by next-generation sequencing (NGS) using a genetic distance threshold derived with EPLD PCR data and validated on NGS 454Jr data. When applied to the Hypervariable Region 1 (HVR1), the method discriminated clusters of related samples from unrelated samples with 100% sensitivity and 100% specificity [23]. Calculating the distances between all sequences in a set of samples is an extremely computationally demanding task, and so we have also evaluated a set of filters that can eliminate sample pair comparisons and greatly reduce the computational cost [24]. The Hamming radius filter was found to perform best individually, accurately filtering up to 91% of all pairwise sequence comparisons from consideration [24]. Here, we validate this threshold against the Illumina MiSeq platform employing a modified Hamming radius filter, a cloud-based distributed infrastructure, and other computational techniques to accommodate the scale and characteristics of MiSeq data.
We introduce Global Hepatitis Outbreak and Surveillance Technology (GHOST) - a cloud-based system that is composed of a set of bioinformatics tools for controlling quality of NGS data and automatic extraction of information on transmission clusters. GHOST integrates bioinformatics and information technologies and enables all users, regardless of their computational expertise, to conduct independent, accurate, and reproducible HCV molecular surveillance. We have adapted GHOST to the Illumina platform. Here, we describe several implemented algorithms and explore performance of the system in analysis of known genetically related and unrelated HCV strains. Relevant public health information is automatically obtained by GHOST from HCV genetic data in a form suitable for guiding effective intervention measures. Access to GHOST is available to all authenticated users for conducting accurate outbreak investigations and molecular surveillance.
Implementation
Sequencing platform
The original transmission detection algorithms [24] were designed for the 454Jr platform and have now been adapted to the Illumina platform, particularly MiSeq. The MiSeq’s deeper sequencing capability provides a more comprehensive snapshot of the population spectrum per sample and a cost-effective way to gain greater sample-pooling per run. Whereas the 454 error correction method was concerned with substitutions, insertions, and deletions, including those associated with homopolymer errors, GHOST’S MiSeq error correction focuses on substitution.
GHOST hepatitis C virus (HCV) sequencing protocol
The GHOST HCV sequencing protocol uses a novel amplicon-based sequencing method that targets the HVR1 of the HCV genome. This region was chosen for its relatively high variability, allowing for fine-grained assessment of evolutionary distances arising from recent transmission events. In order to sequence several samples per run and reduce costs, it utilizes a hierarchical multiplexing scheme with an additional pair of identifiers that persist after standard Illumina demultiplexing to minimize intra-run mis-assignments. GHOST processes Illumina MiSeq 300 bp paired-end reads with full overlap of the forward and reverse reads to redundantly sequence each amplicon, and the redundant information is used to reduce sequencing errors. A detailed HCV MiSeq sequencing protocol is available on the GHOST website to authenticated users or upon request.
GHOST web interface
GHOST users are typically local public health researchers, outbreak investigators and those involved in molecular surveillance. The first step for GHOST users is to submit a user account request. The GHOST administrators authenticate and validate the user request and to ensure the user will have secure access to the system. After access is granted, the user can choose either of two main web-based tasks: Quality Control or Analysis (Fig. 1). Each workflow is described below.
GHOST quality control tasks
The Quality Control (QC) task is designed for the upload of sequence data directly after Illumina sequencing and demultiplexing. It takes gzip-compressed fastq formatted data and expects filenames in accordance with the conventional Illumina-named gzip-compressed fastq naming. After standard demultiplexing, read pairs are filtered out if a read has more than three N’s or has a length less than 185 bp. Each identifier on both forward and reverse reads are examined and the pair is discarded if either identifier is found to not be an exact match to a given list of valid identifiers. Pairs containing valid identifiers are discarded if they are not a constituent of the majority identifier tuple. If 25% or more of the read pairs are found to contain valid identifiers that are not the majority tuple, the entire sample is discarded from analysis without further processing. Owing to computational limitations, a random subsample of N = 20,000 read pairs are taken by the unweighted reservoir sampling method [25, 26] and searched for the forward and reverse reads. Primer sequences are located in each read using fuzzy matching and only allow substitutions ≤2, insertions (relative to the reference) ≤ 1, deletions (relative to the reference) ≤ 1, and a combination of total errors ≤3. Read pairs where either primer cannot be found are discarded. The primer locations are used to orient the reads into the uniform orientation. Read pairs are unified into a single error-corrected sequence using the Casper error correction method [27] with a quality threshold of 15, k-mer length of 17, k-mer neighborhood of 8, and minimum match threshold of 95%. Overlap fitness is evaluated by the classical Hamming Distance. The overlap corresponding to the highest ratio of correct positions to overlap length is selected, with the longest overlap being preferred in the event of there being more than one overlap with equal ratios. Merged sequences are discarded if a nonsense-free reading frame cannot be found. Those not discarded are collapsed into unique occurrences with associated frequencies, thereby reducing subsequent computation time and associated cost. Sequences are then segregated into subtypes using the blastn program included in blast + toolkit v2.3.0 [28] with an in-house curated reference database and the following adjusted parameters: minimum E-value 30.0, word size 7, gap opening penalty 2, and a minimum raw gapped score 95. The total normalized bit score of each high-scoring segment is calculated with respect to the genotype and subtype of each reference sequence. The log probability of observing the bit score larger than this is calculated using Eq. 5 in Karlin and Altschul [29], and the best match is used to classify the sequence into a subtype category. Any sequences whose best score is less than the log probability of −135 is discarded as non-HCV. The sequences are aligned using a hybrid strategy of traditional multiple sequence alignment using MAFFT v7.215 [30] for the most frequently occurring 1000 sequence variations, and the resultant alignment is used to create a Hidden Markov Model (HMM) seed for subsequent HMM-based alignment of the remaining sequences using HMMer v3.1b1 [31]. The consensus, nucleotide diversity, and largest Levenshtein distance from the consensus (radius) are calculated per subtype present in the sample. The QC task then writes preprocessed files in the in-house HD5-derived GH5 format containing the haplotypes found with associated frequencies and other metadata (Fig. 2).
GHOST analysis tasks
The Analysis task uses as input a user-defined set of cleaned GH5 files that result from the QC task workflow. Currently, the analysis task has a single module for the detection of HCV transmissions, where genetic distances between all sample pairs are measured to determine if any fall below the experimentally validated distance threshold [23]. The use of Ultra-Deep Sequencing (UDS) data immensely increases the sensitivity of transmission detection but brings a considerable computational challenge: calculating the minimum distance between all samples. Several techniques were employed to minimize both runtime and memory usage, the four main ones being: (i) random subsampling of the original file as aforementioned, (ii) a variation of the Hamming radius filter of sample-pair candidates we termed the metric filter [24], (iii) HMM-based multiple sequence alignment (MSA), and (iv) optimized distance calculation [24].
In Rytsareva et al. [24], it was found that for two single-subtype samples S1 and S2, with consensuses C1 and C2 and Hamming radii R1 and R2, then the samples cannot have a sequence pair with distance lower than the threshold (T) if dist(C1, C2) – (R1 + R2) > LT, where L is the length of the sequence alignment. We made two modifications to this filter: (i) We implemented a variation of this filter employing a modified Hamming distance we termed “corrected Hamming distance” that does not count positions with insertions or deletions as differences, and (ii) the alignment-independent Levenshtein distance was used for radii calculation.
For each subtype in each sample, the consensus and radius produced in the preprocessing step are used to establish the metric filter parameters, and groups are removed from the candidate list accordingly. This filter significantly reduces the proportion of full distance calculations performed and greatly reduces the computational cost without any loss of information. For group pairs not removed by the metric filter, alignments for the remaining distance calculations use the same HMM method described above. Corrected Hamming distance calculations are performed with an optimized distance calculator named HDIST – an in-house algorithm optimized to minimize pipeline stalls and maximize cache usage by converting sequence pairs into groups of non-overlapping 3-mers, then to base-5 integers that are used as indices in a pre-calculated look-up table. The choice of the k-mer was empirically tested using a range of k-mers and the choice of 3-mers was found to be the size maximizing cache memory hits. Sequence pairs whose distance is below the threshold are not considered if either sequence has a frequency of one. The Analysis task outputs an intuitive transmission network graph. Nodes represent input samples, and edges connect sample pairs found to have subpopulations with a distance below the threshold.
The computational platform
The GHOST back-end and HCV transmission analyses are implemented using a combination of Python, Cython, and command line programs. Python libraries include numpy/scipy for general computational support, biopython for sequence manipulation, regex for fuzzy regular expression matching, h5py for data storage, networkx for storage and processing of transmission networks. Back-end execution is performed via the Amazon Web Services (AWS) using AWS Simple Storage Service (S3) for storage and the Amazon Elastic Compute Cloud (EC2) with 26 configured nodes with two acting as management nodes and 24 acting as compute nodes. The front-end and middle-tier are a composite of technologies including HTML, D3, Javascript, Java, JSON, and XML. A set of services were developed to standardize communication from the front-end to the AWS platform: (i) the “Zuul” service is responsible for moving data into and out of S3 and communicating with AWS components using the AWS SQS API. Zuul also provides the status of task execution processes back to users. (ii) The “Stantz” service acts as a control point within the EC2 platform communicating with Zuul and performing cluster management and oversight functions using Open Grid Engine and the Distributed Resource Management Application API (DRMAA) (Fig. 3).