Viral borne diseases exert a significant impact on human health with millions of individuals infected yearly and diseases such as HIV/AIDS ranking as a leading cause of death worldwide (
http://www.who.int/mediacentre/factsheets/fs310/en/index.html). Critical to the design of effective vaccines and therapeutics to combat this burden is a comprehensive map of the genetic composition of viral populations and the characterization of the selective pressures that shape these populations. Compiling such a map is a challenge: in viral infections the low fidelity of the genome replication process and various evolutionary pressures can result in a single infected host harboring a heterogeneous population of genetic variants
. Previous studies
[2–9] have shown that many viruses including Dengue, HCV, HIV, Influenza, Polio and West Nile all maintain diverse populations within a single host. This genetic diversity means that the population may already contain variants that are advantageous in the face of challenges such as host immune responses and drug treatments. As such, understanding the extent and composition of intra-host population diversity, even at low frequencies, can be very important in evaluating disease progression, transmission, and response to changes in therapy.
However, identifying intra-host variation depends on an alignment of the sequence data. In cases where the reads cover identical sequences, multiple alignment can be used to solve this problem
, but this is currently not feasible for whole genome sequences as no platform exists to sequence whole genomes as single reads at high throughput. Alternatively, the sequencing data can be aligned to a reference, as is commonly done for variant detection in human and other organisms
[11, 12]. However, the use of a reference genome that is too genetically distant from the sample population will yield inaccurate read alignments and, as previously reported, substantial data loss; both factors decrease the ability to detect biological variants
[13, 14]. Because viral consensus can vary substantially between patients it may be difficult to find an existing reference that allows unbiased and complete alignment of reads
[2, 15]. One solution to this problem is to start the analysis by de novo assembly of each patient sample, allowing use of the patient consensus as the reference for variant detection
. Since the sample consensus will be near the centroid of the intra-host variation, it should be optimal for alignment of all sequence data to a single reference.
Further, the consensus sequence itself is of value. Notably, the majority of publicly available genomic sequences were captured using bulk Sanger sequencing strategies and as such a single consensus assembly is the only data available to compare against to derive biological insights. A consensus serves as a single datum that represents the entire underlying population or some subset of the population and thereby enables the identification of dominant genetic mutations that vary between two populations or subsets of the same population
[2, 15, 16]. Lastly, for samples infected with unknown viruses a reference guided mapping strategy will not be applicable and a de novo approach is required
[2, 17, 18].
There are two major frameworks for de novo genome assembly. Overlap-layout-consensus based methods
[19–21] first identify reads that share good suffix-prefix alignment. This operation divides the input reads into disjoint sets, termed contigs. Then, multiple sequence alignment is performed for each contig to derive the consensus sequence, an approximation to a target genome fragment. The relative positions and orientations of these contigs are estimated by paired reads that land in different contigs. The de Bruijn graph based methods
[13, 22–27] first decompose input reads into kmers, denoted as vertices, then create a directed edge between any pair of vertices if the last (k−1) bases of the source vertex is identical to the first (k−1) bases of the target vertex. The graph is then simplified by chain compaction to shorten paths that have a unique entry and exit, and edited to remove small tips and bubbles that are likely attributed to sequencing errors. Finally, contigs are generated and oriented during graph traversal, guided by paired reads and coverage information.
Assembly algorithms devised specifically for the sequencing of less diverse haploid and diploid genomes by short reads tend to fare poorly on data derived from variant populations such as viruses, both because of the difficulty of separating continuous variation from error, and because of other process-related challenges
. In addition, sample preps of RNA viral genomes currently require reverse transcription and usually require amplification. These protocols tend to result in highly variable coverage along the length of the genome and may introduce large amounts of contamination from other RNA species present in the starting material. These artifacts of the process tend to further confuse existing assembly algorithms, which rely on depth of coverage to both indicate copy number of sequences and distinguish errors from true variants.
Recent work utilizing an overlap-layout-consensus strategy has shown that it is possible to generate high quality de novo assemblies from relatively deep coverage data produced by massively parallel pyrosequencing of diverse populations
. However, this method adapts poorly to other platforms such as the Illumina and Ion Torrent, due to a much larger amount of data produced, which increases both the computational complexity, and the difficulty of merging divergent genotypes and handling process error. An alternative strategy was recently reported by Iqbal et al. These authors demonstrated de novo assembly of diploid genomes or a population consisting of a small number of eukaryotic genomes using a de Bruijn graph strategy, but the amount of diversity inherent in viral populations is beyond the target range for this algorithm.
Here we present VICUNA, a de novo assembler of very high but variable coverage short read data from a population of diverse but non-repetitive genomes. VICUNA is an overlap-layout-consensus based assembly algorithm. Unlike assemblers optimized for large repetitive genomes, VICUNA aggressively merges similar sequences, and has the capacity to retain low frequency single nucleotide and length polymorphisms. We validated VICUNA on 12 viral population samples. These 12 samples were obtained from patients infected with Dengue Virus (DENV), Human Immunodeficiency Virus (HIV) and West Nile Virus (WNV), which represent a spectrum of intra-host population variation. VICUNA recovers the full target regions with high fidelity in all samples. The algorithm captures low frequency non-dominant length variants, specified by a tunable threshold. In a handful of these samples, we recovered alternate consensus containing large deletions (≥ 500bp). VICUNA runs on workstations or blades, and hence is readily accessible to research labs with limited compute resources. Although our immediate target application is the sequencing of RNA viruses from infected hosts, we anticipate that our method may also have utility for a range of applications which pose similar challenges, including rRNA sequencing and whole genome metagenomic analyses.