De novo assembly of highly diverse viral populations

Background Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage. Results We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/ viral-genomics-analysis-software. Conclusions We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.

Filter creation for target-alike reads. We illustrated our method by creating a filter for HIV1B. The same method can be applied to other viral genomes. 1057 available full length HIV1B genomes were obtained from the LANL database (http: //www.hiv.lanl.gov/), representing a wide range of genome diversity. These sequences were directly aligned using MUSCLE 1 , resulting in a multiple sequence alignment (MSA) of length 21679. This is over the twice the length of the standard HIV genome size (∼10kbp). 17 sequences that create single large insertions in the alignment were removed, since they were likely misassembled. The remaining sequences were re-aligned, resulting in an alignment with a more reasonable length of 14232. We can further remove spurious genomes and re-align the remaining ones until satisfactory. The final alignment serves as the filter. Note that to create MSA for a large number of sequences is compute intensive, and is less accurate as the number of sequences increases. However, we do not require the filter to be free of mis-assembled sequences, nor do we require the filter to include all previously assembled genomes.
Profiling. The MSA filter renders each genome equivalent in length by introducing gaps in the alignment. We divide the MSA into bins, each specified by a 2-tuple b i = s i , e i , where s i (e i ) denotes the start (end) position of the i th (0 ≤ i < n b , n b is a user specified parameter) bin on the MSA. To calculate b i , first identify the longest genome G, and assign |G|/n b bases to each bin. Then b i = |G| * i/n b , |G| * (i + 1)/n b for 0 ≤ i ≤ n b − 1 if G contains no gaps. Otherwise, b i is adjusted to include gaps. b i determines the subsequence of each genome that belongs to the i th bin, where the k-spectrum is then calculated. To account for the case that a read may overlap with two adjacent bins, we include any kmer that overlaps with position s i in the k-spectrum of b i . In addition, low frequency kmers are removed from consideration.
Procedure for assigning read r to bins. 1) For every kmer x in r k , assign it to the i th bin if its k-spectrum contains a d-neighbor of x. Note, x can be assigned to multiple bins. 2) Consider all kmers in r k that were assigned to the i th bin, if the total number of positions covered by these kmers in r divided by |r| is above a given threshold t, the i th bin is held as a candidate for r to be assigned. To account for the case when r spans two bins, we consider kmers in r k that were assigned to adjacent bins. 3) Consider the paired reads (r, r ): assign both concurrently to bins that obey the maximum distance constraint of the paired read library size. Otherwise, we use a more stringent threshold t (> t) to assign them individually.  Supplementary Figure 2: An example of contig construction. The k-spectrum (k = 4) of reads r 1 , r 2 , and r 3 are computed and hashed to an integral space. r 1 and r 2 share a common min hash value of 7, and hence can be clustered and aligned. When we further use a gapped seed, 101011, where a '0' denotes an ignored position, to generate gapped-4-mers of r 2 and r 3 , a common 4-mer "TGTA" can be identified, leading to the clustering of r 2 and r 3 . In both cases, the 4-mers that lead to the clustering are underlined. We can see both techniques tolerate base differences (colored red) that may due to sequencing error or true variation. Note that the illustration of min hash technique in this example is a simplified version compared to the one used in our algorithm.   Supplementary Figure 4: Alignment of two contigs 1 and 2 that are represented as long lines. Common kmers between them are denoted as rectangles (top panel), these kmers are extended to form maximal common substrings (middle panel), which are forced to be aligned to each other. Inter-alignment regions may be resulted from length polymorphisms and sequencing errors (insertion, deletion and substitutions), which are further aligned using Needleman-Wunsch algorithm (bottom panel).

Supplementary Algorithm 1 Contig construction via min hash and spaced-seed.
Require: R = {r 1 , r 2 , . . . , r n }, spaced-seed s 1: For each read r i ∈ R, generate two min hash values, respectively, for its forward and reverse complementary strands. 2: Cluster reads that share common min hash values to form initial contigs 3: D ← ∅ 4: for each r i ∈ R do 5: for j = 0 → |r i | − |s| + 1 do 7: x ← apply s to r i [j, j + |s| − 1] 8: end for 10: if ∃S ∈ S i such that S.key = S .key, where S ∈ D then 11: Let r denote the read, where S .key belongs 12: Identify the two contigs C i (may not exist) and C , where r i ∈ C i and r ∈ C

13:
if neither C i nor C contains both r i and r then 14: Add r i to C Identify the first one in the list, let it be C 7: Identify neighbors of C using M rc .

8:
Merge C with its neighbors and flag all contigs involved as processed Require: an input contig C, parameters max d , max rt , min ol 1: Initialize contig list C † to be empty 2: repeat 3: C cur ← C

4:
Generate consensus for C cur

5:
Initialize contig C rem ← ∅ 6: repeat 7: for each read r in C do 8: Measure the distance d between r and C 9: if d > max d then 10: C cur = C cur \ {r} and update consensus 11: until no change was applied to C 15: Add C cur to C 16: C cur ← C rem 17: until C cur = ∅ 18: for each contig C ∈ C do 19: Generate the layout (r 1 , r 2 , . . . , r |C| ) of C

20:
Calculate n b na for each read in the layout

21:
Split C at r i when either the overlap between r i and r i−1 is < min ol or n b na > max rt for r i−1

22:
Replace C with the resulting contigs if split occurred 23: end for † C stores the resulting list of contigs.

Supplementary Algorithm 4 Contig extension.
Require: an input vector of contigs C.
1: Sort C in an order of decreasing length 2: Generate a 2-tuple id(r), id(C) for each read r contained in contig C. The results are stored in M rc 3: while existing more contigs to be processed do 4: Select target contig C l ← the first element of C

5:
Get neighbors N of C l via M rc

6:
Sort N in an increasing order of the number of paired-reads shared with C l 7: Compute delegates for C l and each contig in N 8: while N is not empty do 9: C r ← the last element of N 10: Compare delegate dg l of C l with dg r (algorithm 5)

11:
if a significant prefix-suffix alignment is identified then 12: Merge contig C l to C r & update dg r 13: Update N to include neighbors of C r & calculate delegates for newly included contigs