The effect of variant interference on de novo assembly for viral deep sequencing

Castro, Christina J.; Marine, Rachel L.; Ramos, Edward; Ng, Terry Fei Fan

doi:10.1186/s12864-020-06801-w

Research article
Open access
Published: 22 June 2020

The effect of variant interference on de novo assembly for viral deep sequencing

Christina J. Castro^1,2,
Rachel L. Marine¹,
Edward Ramos³ &
…
Terry Fei Fan Ng ORCID: orcid.org/0000-0002-4815-8697¹

BMC Genomics volume 21, Article number: 421 (2020) Cite this article

2578 Accesses
4 Citations
5 Altmetric
Metrics details

Abstract

Background

Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored.

Results

Our results from > 15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs.

Conclusions

These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.

Background

For many years, Sanger sequencing has been used to complement classical epidemiological and laboratory methods for investigating viral infections [1]. As technologies have evolved, the emergence of next-generation sequencing (NGS), which drastically reduced the cost per base to generate sequence data for complete viral genomes, has allowed scientists to apply viral sequencing on a grander scale [2,3,4]. Genomic sequencing is ideal for elucidating viral transmission pathways, characterizing emerging viruses, and locating genomic regions which are functionally important for evading the host immune system or antivirals [2, 5].

Genomic surveillance of viruses is particularly important in light of their rapid rate of evolution. Viruses have higher mutation rates than cellular-based taxa, with RNA viruses having mutation rates as high as 1.5 × 10^− 3 mutations per nucleotide, per genomic replication cycle [6]. Due to this high mutation rate, it is well established that most RNA viruses exist as a swarm of quasispecies, [7] with each quasispecies containing unique single nucleotide polymorphisms (SNPs). The presence of these variants plays a key role in viral adaptation.

Due to viruses’ rapid evolution, a single clinical sample often contains a mixture of many closely related viruses. Viral quasispecies are mainly derived from intra-host evolution, with RNA viruses such as poliovirus, human immunodeficiency virus (HIV), hepatitis C (HCV), influenza, dengue, and West Nile viruses maintaining diverse quasispecies populations within a host [8,9,10,11,12,13,14,15]. Conversely, the term “viral strains” often refers to different lineages of viruses found in separate hosts, or a co-infection of viruses in the same host due to multiple infection events. As a result, sequence divergence is usually higher when comparing viral strains compared to quasispecies. In this study, we use the term “variant” to encompass both quasispecies and strains regardless of how the variants originated in the biological samples.

Since many sequencing technologies produce reads that are significantly shorter than the target genome size, a process to construct contigs, scaffolds, and full-length genomes is needed. Reference-mapping and de novo assembly are the two primary bioinformatic strategies for genome assembly. Reference-mapping requires a closely-related genome as input to align reads, while de novo assembly generates contigs without the use of a reference genome. Therefore, de novo assembly is the most suitable strategy for analyzing underexplored taxa [16] or for viruses with high mutation and/or recombination rates.

The two most common graph algorithms employed by de novo assembly programs are: overlap graphs for overlap-layout-consensus (OLC) methods, and k-mer based graphs for de Bruijn graph (DBG) methods. OLC methods involve determining overlaps by performing a series of pair-wise sequence alignments. Such assemblies may be computationally expensive (especially for large datasets), and generally work better with longer reads [17, 18]. Conversely, DBG assemblers split reads into smaller k-mers, with k-mers connected when they share a common prefix and suffix of length k – 1. DBG methods are usually faster to run than OLC methods, but this strategy is known to be sensitive to repeats, sequencing error, and the presence of variants, which increase the k-mer complexity and ambiguity during sequence reconstruction [19, 20]. These challenges could lead to fragmented contigs when analyzing viral assemblies from clinical or environmental samples [21].

In this study, we first examined how often NGS and de novo assembly were applied for viral sequences deposited in the GenBank nucleotide database (www.ncbi.nlm.nih.gov/nucleotide/). Then, we investigated how the presence of variants affected assembly results - simulated and clinical NGS datasets were analyzed using multiple assembly programs to explore the effects of genome variant relatedness, read length, genome GC and genome length on the resulting contig distribution. As viruses in different taxa vary in length and GC content, these experiments demonstrate how assembly of viral variants is impacted by basic genome structure characteristics, as well as by the nucleotide similarity between variants and sequencing read length.

Results

The rise of NGS and de novo assembler use in GenBank viral sequences

GenBank viral entries from 1982 to 2019 were collected and analyzed, with extensive analyses performed to evaluate technologies and bioinformatics programs cited in records deposited between 2011 and 2019. Through 2019, there were over 2.7 million viral entries in GenBank; however, over 70% (1.9 million) do not specify a sequencing technology (Supplement Table S1) due to the looser data requirement in earlier years. When looking at recently deposited records (2014–2019), the Illumina sequencing platform was the most common NGS platform used for viral sequencing, with over a 2-fold increase over the next most popular NGS platform (Fig. 1d & e). When long sequences (≥2000 nt) are considered, NGS technologies surpassed Sanger in 2017 as the dominant strategy for sequencing, comprising 53.8% (14,653/27,217) of entries compared to 46.2% of entries (12,564/27,217) for Sanger. This trend held true in 2018 and 2019 as well (Fig. 1f and Supplement Table S2).

Hybrid sequencing approaches, where researchers use more than one sequencing technology to generate complete viral sequences, have also become more common over the past several years. The most common combination observed was 454 and Sanger (18,124 entries), likely due to the early emergence of the 454 technology compared to other NGS platforms (Fig. 1c and Supplement Table S3). However, combining Illumina with various other sequencing platforms is quite commonplace (> 19,000 entries).

De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) have increased from less than 1% of viral sequence entries ≥2000 nt in 2012, to 20% of all viral sequence entries in 2019 (Fig. 1h & i). A similar increase was observed for reference-mapping programs (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 12.5% in 2019. Multifunctional programs that offer both assembly options were the most common programs cited for the years 2013–2019, but since the exact sequence assembly strategy used for these records is unknown (Tables S1-S5), the contributions of de novo assembly are likely underestimated. An expanded summary of the sequencing technologies and assembly approaches used for viral GenBank records is available in the Supplement text and Supplement Tables S1-S6.

Effect of variant assembly using popular de novo assemblers

After establishing the growing use of NGS technologies for viral sequencing, we next focused on understanding how the presence of viral variants may influence de novo assembly output. We generated 247 simulated viral NGS datasets representing a continuum of pairwise identity (PID) between two viral variants, from 75% PID (one nucleotide difference every 4 nucleotides), to 99.6% PID (one nucleotide difference every 250 nucleotides) (Fig. 2). For Experiment 1, these datasets were assembled using 10 of the most used de novo assembly programs (Fig. 2 and Supplement Figure S1a) to evaluate their ability to assemble the two variants into their own respective contigs as the PID between the variants increases.

One key observation is that the assembly result can change from two (correct) contigs to many (unresolvable) contigs simply by having variant reads; the presence of viral variants affected the contig assembly output of all 10 assemblers tested. The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA assemblers shared a few commonalities, demonstrated by a conceptual model in Fig. 3a. First, below a certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de novo assemblers produced two contigs correctly (Fig. 3). We refer to this as “variant distinction” (VD), with the highest pairwise identity where this occurs as the VD threshold. Above this threshold, the assemblers produced tens to thousands of contigs (Fig. 3), a phenomenon we define as “variant interference” (VI). As PID between the variants continue to increase, the de novo assemblers can no longer distinguish between the variants and assembled all the reads into a single contig, a phenomenon we define as “variant singularity” (VS). (Fig. 3). The lowest pairwise identity where a single contig is assembled is the VS threshold.

Slight differences in the variant interference patterns (relative to the canonical variant interference model) were observed for the 10 assemblers investigated. VD was observed for SPAdes, MetaSPAdes, and ABySS assemblers. While it was not observed with Cap3 and IDBA with the current simulated data parameters, we speculate that VD may occur at a lower PID level for these assemblers than tested in this study. The PID range where VI was observed was distinct for each de novo assembler (Fig. 3). During VI, SPAdes produced as many as 134 contigs and ABySS produced 3076 contigs, while MetaSPAdes, Cap3, and IDBA produced up to 10.

A different pattern was observed for Mira, Trinity, and SOAPdenovo2 assemblers. The average number of contigs generated by Mira, Trinity, and SOAPdenovo2 was 5, 36, and 283, respectively across all variant PIDs from 75 to 99.96%. Specifically, Mira and Trinity generated fewer contigs at low PID, but produced many contigs when the two variants reach 97.1% PID and 96.0% PID, respectively. For SOAPdenovo2, a larger number of contigs were produced regardless of the PID. This indicates that these assemblers generally have major challenges producing a single genome; this has been observed in previous studies comparing assembly performance [22].

Finally, Geneious and CLC were the least affected by VI in the simulated datasets tested, returning only 1–5 contigs for all pairwise identities. CLC’s assembly algorithm primarily returned a single contig over the range of PIDs tested (218/247 simulations; 88.3%), thus favoring VS. In comparison, Geneious predominantly distinguished the two variants (234/247 simulations; 94.7%), favoring VD.

Effect of GC content and genome length on variant assembly

For Experiment 2, we focused our study on evaluating whether VI observed in SPAdes de novo assembly is influenced by the GC content or genome length of the pathogen. SPAdes was chosen because it produced a well-defined variant interference that closely resembled the conceptual model (Fig. 2). It is also one of the leading assemblers for viral assembly (Fig. 1), possibly due to its ability to assemble viral variants without variant interference in most PID. Two datasets were used for the evaluation: reads generated from four artificial genomes ranging in length from 2 Kb to 1 Mb, as well as from genome sequences of poliovirus (NC_002058; 7440 nt in length) and coronavirus (NC_002645; 27,317 nt in length). No discernable correlation was observed between the GC content of variant genomes and the degree of VI for any of the simulated datasets (Supplemental Figure S1, p < 0.0001). Therefore, for subsequent analyses examining the effects of genome length on VI, the number of contigs at each PID level was obtained by averaging the 13 GC simulations.

Notably, no matter the genome length, SPAdes produced vastly more contigs (i.e., VI) in a constant, narrow range of PID (99–99.21%; Fig. 4a & b). The effect of variants on assembly was characterized by the three distinct intervals described previously: VD at lower PIDs, VI (Fig. 4b), and VS at higher PIDs for all genome lengths. For example, during VS, a single contig was generated when the two variants shared ≥99.22% PID, but tens to thousands of contigs were generated at a slightly lower PID of 99.21%. This PID threshold, 99.21%, marked the drastic transition from VI to VS, whereas the transition from VD to VI (i.e., the VD threshold) occurred at 98.99% PID (Fig. 4b). A correlation was observed between genome length and the number of contigs produced during VI, where longer genomes returned proportionally more contigs as expected as total VI occurrence should increase with length (r² = 0.967; p < 0.0001 Fig. 4b and c).

Effect of read length on variant assembly

The read length of a given NGS dataset will vary depending on the sequencing platform and kits utilized to generate the data. Since read length is an important factor for de novo assembly success, [23] we hypothesized that it may also influence the ability to distinguish viral variants. For Experiment 3, using SPAdes we investigated assemblies with four typical read lengths: 50, 100, 150, and 250 nt. At longer read lengths, the VD threshold occurred at higher PIDs (Fig. 4d & e). Also, with increasing read length, the width of the PID window where VI occurs gradually decreased from a 1.52% spread to a 0.21% spread (Fig. 4e). This indicates that longer reads are better for distinguishing viral variants with high PIDs.

In silico experiments examining variant assembly with NGS data derived from clinical samples

For clinical samples, assembly of viral genomes is affected by multiple factors other than the presence of variants, including sequencing error rate, host background reads, depth of genome coverage, and the distribution (i.e., pattern) of genome coverage. We next utilized viral NGS data generated from four picornavirus-positive clinical samples (one coxsackievirus B5, one enterovirus A71, and two parechovirus A3) to explore VI in datasets representative of data that may be encountered during routine NGS. The NGS data for each sample was partitioned into four bins of read data: (1) total reads after quality control (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background non-viral reads only (MB) (Fig. 5). These binned datasets were then assembled separately using three assembly programs: SPAdes, Cap3, and Geneious. These programs were chosen as representatives of different assembly algorithms: SPAdes is a leading de Bruijn graph (DBG) assembler, Cap3 is a leading overlap-layout-consensus (OLC) assembler, and Geneious is a proprietary software. By comparing these manipulations, we aimed to test the hypothesis that minor variants directly affect the performance of assembly through VI in real clinical NGS data.

Even with an adequate depth of coverage for genome reconstruction, assembly of total reads (T) in 11/12 experiments resulted in unresolved genome construction – resulting in numerous fragmented viral contigs (Fig. 6). The only exception was one experiment where one single PeV-A3 (S1) genome was assembled using Cap3. When only reads from the major variant were assembled (M), full genomes were obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Geneious. Conversely, assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of contigs for 9 of the 12 sample and assembly software combinations tested (Fig. 6), indicating that VI due to the addition of the minor variant reads likely adversely affected the assembly. The presence of background reads with major variant reads (MB) did not appear to affect viral genome assembly, as the UG₅₀% value, a performance metric which only considers unique, non-overlapping contigs for target viruses [24], was similar between M and MB datasets.

Discussion

Our analysis of the GenBank entries quantified the decade-long expansion of NGS technologies and de novo assembly for viral sequencing (Fig. 1). As the number of viral sequences in public databases continues to grow, an important question that naturally arises is how well current de novo assembly programs perform for datasets with viral variants. Viral variants are expected in biological samples, with the number of variants and the extent of the sequence divergence between variants related to the mutation rate of the virus and the types of specimens that are being investigated. For example, samples containing rapidly evolving RNA viruses, such as poliovirus, HIV, and HCV [9, 11, 25], environmental samples, [26] and clinical samples from immunosuppressed individuals [27, 28] usually harbor many variants. The ability to accurately distinguish variants is imperative to inform treatments (in the case of HIV and HCV), or determine whether a subpopulation of a more virulent variant is present.

Several experiments using simulated and clinical sample NGS data were performed to evaluate the ability of genome assembly programs to distinguish genome variants. All assemblers investigated generated fragmented assemblies when the data contained reads from two closely related variants due to “variant interference” (VI). Changes in pairwise identity (PID) as small as 0.01% between the two variants triggered an assembler to change from producing one or two contigs to producing hundreds of contigs. A quintessential example of this phenomenon was the SPAdes assembly of EV-A71 sequences during the in silico experiments with clinical NGS data. Assembly of major variant reads resulted in one full length contig (Fig. 5), whereas assembly of datasets containing the major and minor variant reads (Mm and T) were characterized by a number of contigs, resulting in “cobwebs” of contig fragments when visualized using Bandage (Supplement Figure S2) [29]. Even though the de novo assembly graph linked the different contig fragments, the assembly could not differentiate the multiple routes of possible contig construction. We speculate this is the main reason why VI occurs in the context of de Bruijn graph assemblers.

The simulated experiments suggested that genome length and read length influence VI; A longer genome length will produce proportionally more contigs during VI, whereas a longer read length decreases the PID range where VI occurs (Fig. 4). While longer read length improves assembly, unfortunately, platforms that produce long reads such as Oxford Nanopore and PacBio have higher error rates [30]. Until long reads can be produced at high fidelity, researchers must continue to rely on combining long- and short-read NGS datasets, and genome polishing techniques [30].

The large number of contigs generated due to VI may be overwhelming for most researchers, and for viral ecology studies, could lead to over-estimation of species richness for methods that use contig spectra to infer richness, such as PHACCS or CatchAll [31,32,33]. This phenomenon may also impact studies differently depending on the overall goal for generating viral sequence data. For example, some researchers may only be concerned with generating a single major consensus genome, even when variants are detected in the data. This is common during outbreak responses for pathogens such as Ebola virus or Middle East respiratory syndrome coronavirus, where detection of SNPs (indicative of minor variants) is not immediately important. On the other hand, some investigations could favor distinguishing variants, such as for investigating the presence of vaccine-derived poliovirus, where a small number of SNPs may distinguish a vaccine-derived strain from a normal vaccine strain genome [28].

The effects of VI could potentially be mitigated by running multiple assembly programs. A previous study testing bioinformatics strategies for assembling viral NGS data found that employing sequential use of de Bruijn graph and overlap-layout-consensus assemblers produced better assemblies [22]. We speculate that this “ensemble strategy” [22] may perform better because the multiple assemblers complement one another by having different VI PID thresholds. Future assembly approaches could also consider resolving the VI problem by possibly discriminating the major and minor variant reads first (perhaps by coverage or SNP analysis), and then assembling major and minor variant reads separately.

Since we observed VI occurring in simulated data from 2 Kb to 1 Mb genome lengths, we speculate that it may not only affect viral data but also larger draft contigs of bacteria and other microorganisms. Even though bacterial mutation rates are much lower than those of most viruses, bacterial variants are common. For environmental studies, bacterial metagenomes are known to contain many related taxa and variants [34,35,36,37], and in clinical investigations, minor bacterial variants can harbor SNPs that provide resistance against antimicrobials. This warrants future investigation into how the presence of variants may impact the assembly of other microbial datasets.

Conclusion

This study aimed to understand how variants affect assembly. As an initial investigation, many confounding factors were simplified for experimentation. Simulated variants studied here only depicted periodic mutations, set at regular intervals. However, in real viral data, SNPs are never evenly distributed across the genome, with zones of divergence and similarity [38, 39]. Other important factors which influence genome assembly include sequencing error rates, presence of repetitive regions, and coverage depth. We limited our experiments to keep these factors constant in order to investigate the sole effect of VI. A pilot experiment analyzing three variants demonstrated the VI theory generally holds true even with multiple variants with varying PID (Supplement Figure S3). Introducing a third variant within the VI threshold (Set B) destabilized assembly and caused the contig number to inflate. Through this study, we demonstrated that reads from related genome variants adversely affect de novo assembly. As NGS and de novo assembly have become essential for generating full-length viral genomes, future studies should investigate the combined effects of the number and relative proportion of minor variants, as well as additional assembly factors (e.g., error rates) to supplement this work.

Methods

Analyzing NGS and assembler usage in the virus nucleotide collection in GenBank

Viral sequence entries from the GenBank non-redundant nucleotide collection were obtained by downloading all sequences under the virus taxonomy through the end of 2019. A total of 2,793,810 GenBank entries were investigated.

The total number of viral sequences submitted annually in GenBank through December 2019 was calculated by filtering GenBank submissions by “virus,” followed by application of the following additional filtering steps: “genomic DNA/RNA” was selected and a “release date: Jan 1 through Dec 31” was applied to find the total number of viruses for a given year. A custom script was used to filter and count all documented sequencing technologies and assembly methods used for each GenBank entry.

Creation of simulated variant genomes and reads

Simulated genomes were generated using custom scripts that randomly assign each nucleotide over a designated genome length with a weighted distribution dependent on the GC content (Supplement Figure S1). The random genomes were then screened using NCBI BLAST to ensure no similarity/identity existed to any classified organism (i.e., no BLAST hits). These simulated genomes served as the initial variant genome (variant 1). To generate the mutated variant genomes (variant 2), a custom script was used to systematically introduce evenly distributed random mutations at rates from 1 mutation in every 4 nucleotides (75% PID) to 1 mutation in every 250 nucleotides (99.6% PID), incrementing by 1 nucleotide.

Following the generation of initial and mutated variant genomes, high-quality fastq reads were generated using ART, [40] simulating Illumina MiSeq paired-end runs at 50X coverage with 250 nt reads, DNA/RNA mean fragments size of 500, and quality score of 93. Fastq reads were combined in equal numbers for the initial and mutated variants and used as input for subsequent de novo assembly experiments (Supplement Figure S1). The same process was utilized to generate the artificial genomes, initial and mutated variant genomes, and reads for each of the experiments.

Experiment 1: analyzing simulated reads from variants using different de novo assembly programs

The simulated datasets containing reads from two variant genomes with nucleotide pairwise identity ranging from 75 to 99.6% were analyzed using 10 different genome assembly programs using default parameters. The de novo assembly algorithms used were either overlap-layout-consensus (OLC) [Cap [41] and Mira [42, 43]], de Bruijn graph (DBG) [ABySS [44], IDBA [45], MetaSPAdes [46], SOAPdenovo2 [47], SPAdes [48], and Trinity [49]], or commercial software packages [CLC (https://www.qiagenbioinformatics.com/) and Geneious [50]] whose assembly algorithms are proprietary (Supplement Table S6). The simulation settings for the reads were paired-end reads, 250 nt read length, and 50X coverage. A total of 2470 assemblies (247 datasets per genome X 10 assemblers) were analyzed (Supplement Figure S1a).

Experiment 2: simulated data by varying genome length and GC content

Artificial genomes were constructed for four genome lengths: 2 Kb, 10 Kb, 100 Kb, and 1 Mb, with varying GC content from 20 to 80%, in 5% increments (Supplement Figure S1b). Datasets derived using one poliovirus genome (NC_002058) and one coronavirus genome (NC_002645) were also included in this analysis, representing the lower and upper genome length range typical of RNA viruses. The original GC content was kept constant for the poliovirus and coronavirus genomes. For all of these genomes, simulated reads for initial and mutated variants were generated as above.

A total of 13,338 SPAdes assemblies were generated, which included 12,844 assemblies for the four artificial genomes (247 datasets per genome X 4 artificial genome lengths X 13 GC content proportions X 1 assembler) and 494 assemblies for the poliovirus and coronavirus datasets (247 datasets per genome X 2 genomes X 1 assembler) (Supplement Figure S1b). JMP v13.0.0 (www.sas.com) was used to calculate Pearson’s correlation and Spearman’s ρ values to compare the association between percent GC levels and the number of contigs produced at each PID level. Since there was little statistical difference when comparing the contig numbers generated at varying percent GC for each of the four genome length datasets (Spearman’s ρ = 0.8299 to 0.9801, p < 0.001) (Supplement Excel file), the final contig number was averaged across the 13 GC percentages at a given PID. The average contig number was used for plotting the contig assembly results vs percent PID for each simulated genome length (Fig. 4a-b).

Experiment 3: simulated data by varying read length

Genome variants were generated as described above (“Creation of simulated variant genomes and reads”) for a genome of size 100 Kb with 50% GC; this was the starting initial variant genome. In this simulation, initial and mutation variant reads at four sequencing read lengths (50, 100, 150, and 250 nt) were created using ART. A total of 538 SPAdes assemblies were generated (47, 97, 147, and 247 datasets for the 50, 100, 150 and 250 nt read lengths, respectively) (Supplement Figure S1c).

Evaluation of NGS datasets from clinical samples

Four datasets derived from clinical samples containing picornaviruses (one enterovirus A71 [EV-A71], one coxsackievirus B5 [CV-B5] and two parechovirus A3 [PeV-A3]) were analyzed for this experiment, as previous sequencing analysis using Geneious indicated the presence of genome variants. The datasets were analyzed using an in-house pipeline (VPipe), [25] which performs various quality control (QC) steps and de novo assembly using SPAdes. The post-QC reads were considered total reads (T) and mapped to their respective reference genome in order to determine the major and minor variants present in each sample. Total reads which mapped with high similarity (≥99%) to the major variant were categorized as reads representing the major variant (M). Unbinned reads from the major variant reference recruitment were used to construct the minor variant consensus using a second round of reference recruitment, and these reads were categorized as the minor variant (m). Remaining reads from the previous two steps were considered background (B) reads.

De novo assembly for each of the four clinical samples was performed for the following binned NGS datasets: (1) total reads only (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background reads only (MB). This was repeated with three assembly programs: SPAdes, Cap3, and Geneious. The length of the longest contig produced from each assembly and the performance metric UG₅₀% [24] were calculated to compare the results for these 48 assemblies (4 experiments X 4 viruses X 3 assemblers).

Availability of data and materials

Sequencing reads for the experiments conducted using clinical specimens are available through the NCBI Sequence Read Archive (SRA) – BioProject accession number PRJNA577924. Reads from simulated datasets (Experiments 1–3) are available upon request.

Abbreviations

CV-B5:: Coxsackievirus B5
DBG:: de Bruijn graph
EV-A71:: Enterovirus A71
HCV:: Hepatitis C
HIV:: Human immunodeficiency virus
M:: Major variant reads
MB:: Major variant and background non-viral reads
Mm:: Major and minor variants reads
NGS:: Next-generation sequencing
nt:: Nucleotide
OLC:: Overlap-layout-consensus
PeV-A3:: Parechovirus A3
PID:: Pairwise identity
SNP:: Single nucleotide polymorphism
T:: Total reads after quality control
VD:: Variant distinction
VI:: Variant interference
VS:: Variant singularity

References

Rasmussen AL, Katze MG. Genomic signatures of emerging viruses: a new era of systems epidemiology. Cell Host Microbe. 2016;19(5):611–8.
CAS PubMed PubMed Central Google Scholar
Leung P, Eltahla AA, Lloyd AR, Bull RA, Luciani F. Understanding the complex evolution of rapidly mutating viruses with deep sequencing: beyond the analysis of viral diversity. Virus Res. 2017;239:43–54.
CAS PubMed Google Scholar
Huang SW, Hung SJ, Wang JR. Application of deep sequencing methods for inferring viral population diversity. J Virol Methods. 2019;266:95–102.
CAS PubMed Google Scholar
Perez-Losada M, Arenas M, Galan JC, Bracho MA, Hillung J, Garcia-Gonzalez N, Gonzalez-Candelas F. High-throughput sequencing (HTS) for the analysis of viral populations. Infect Genet Evol. 2020;80:104208.
PubMed Google Scholar
Pierce BG, Keck ZY, Foung SK. Viral evasion and challenges of hepatitis C virus vaccine development. Curr Opin Virol. 2016;20:55–63.
CAS PubMed PubMed Central Google Scholar
Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet. 2008;9(4):267–76.
CAS PubMed Google Scholar
Andino R, Domingo E. Viral quasispecies. Virology. 2015;479-480:46–51.
CAS PubMed Google Scholar
Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 2012;8(3):e1002529.
CAS PubMed PubMed Central Google Scholar
Herbeck JT, Rolland M, Liu Y, McLaughlin S, McNevin J, Zhao H, Wong K, Stoddard JN, Raugi D, Sorensen S, et al. Demographic processes affect HIV-1 evolution in primary infection before the onset of selective processes. J Virol. 2011;85(15):7523–34.
CAS PubMed PubMed Central Google Scholar
Jerzak G, Bernard KA, Kramer LD, Ebel GD. Genetic variation in West Nile virus from naturally infected mosquitoes and birds suggests quasispecies structure and strong purifying selection. The Journal of general virology. 2005;86(Pt 8):2175–83.
CAS PubMed Google Scholar
Lauck M, Alvarado-Mora MV, Becker EA, Bhattacharya D, Striker R, Hughes AL, Carrilho FJ, O'Connor DH, Pinho JRR. Analysis of hepatitis C virus intrahost diversity across the coding region by ultradeep pyrosequencing. J Virol. 2012;86(7):3952–60.
CAS PubMed PubMed Central Google Scholar
Lin S-R, Hsieh S-C, Yueh Y-Y, Lin T-H, Chao D-Y, Chen W-J, King C-C, Wang W-K. Study of sequence variation of dengue type 3 virus in naturally infected mosquitoes and human hosts: implications for transmission and evolution. J Virol. 2004;78(22):12717–21.
CAS PubMed PubMed Central Google Scholar
Murcia PR, Baillie GJ, Daly J, Elton D, Jervis C, Mumford JA, Newton R, Parrish CR, Hoelzer K, Dougan G, et al. Intra- and interhost evolutionary dynamics of equine influenza virus. J Virol. 2010;84(14):6943–54.
CAS PubMed PubMed Central Google Scholar
Vignuzzi M, Stone JK, Arnold JJ, Cameron CE, Andino R. Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature. 2006;439(7074):344–8.
CAS PubMed Google Scholar
Thai KTD, Henn MR, Zody MC, Tricou V, Nguyet NM, Charlebois P, Lennon NJ, Green L, de Vries PJ, Hien TT, et al. High-resolution analysis of intrahost genetic diversity in dengue virus serotype 1 infection identifies mixed infections. J Virol. 2012;86(2):835–43.
CAS PubMed PubMed Central Google Scholar
Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, Qu J, Ryan EM, Zody MC, Henn MR. De novo assembly of highly diverse viral populations. BMC Genomics. 2012;13:475.
CAS PubMed PubMed Central Google Scholar
Khan AR, Pervez MT, Babar ME, Naveed N, Shoaib M. A comprehensive study of De novo genome assemblers: current challenges and future prospective. Evol Bioinformatics Online. 2018;14:1176934318758650.
Google Scholar
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95(6):315–27.
CAS PubMed Google Scholar
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform. 2019;20(4):1140–50.
PubMed Google Scholar
Rizzi R, Beretta S, Patterson M, Pirola Y, Previtali M, Della Vedova G, Bonizzoni P. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quantitative Biology. 2019;7(4):278–92.
Google Scholar
Sutton TDS, Clooney AG, Ryan FJ, Ross RP, Hill C. Choice of assembly software has a critical impact on virome characterisation. Microbiome. 2019;7(1):12.
PubMed PubMed Central Google Scholar
Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data. Nucleic Acids Res. 2015;43(7):e46.
PubMed PubMed Central Google Scholar
Wommack KE, Bhavsar J, Ravel J. Metagenomics: read length matters. Appl Environ Microbiol. 2008;74(5):1453–63.
CAS PubMed PubMed Central Google Scholar
Castro CJ, Ng TFF. U50: a new metric for measuring assembly output based on non-overlapping, Target-Specific Contigs. J Comput Biol. 2017;24(11):1071–80.
CAS PubMed Google Scholar
Montmayeur AM, Ng TF, Schmidt A, Zhao K, Magana L, Iber J, Castro CJ, Chen Q, Henderson E, Ramos E, et al. High-throughput next-generation sequencing of polioviruses. J Clin Microbiol. 2017;55(2):606–15.
CAS PubMed PubMed Central Google Scholar
Ng TFF, Marine R, Wang C, Simmonds P, Kapusinszky B, Bodhidatta L, Oderinde BS, Wommack KE, Delwart E. High variety of known and new RNA and DNA viruses of diverse origins in untreated sewage. J Virol. 2012;86(22):12161.
CAS PubMed PubMed Central Google Scholar
Ma S, Du Z, Feng M, Che Y, Li Q. A severe case of co-infection with Enterovirus 71 and vaccine-derived poliovirus type II. Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology. 2015;72:25–9.
Google Scholar
Jorba J, Diop OM, Iber J, Henderson E, Zhao K, Sutter RW, Wassilak SGF, Burns CC. Update on vaccine-derived polioviruses - worldwide, January 2017-June 2018. MMWR Morb Mortal Wkly Rep. 2018;67(42):1189–94.
PubMed PubMed Central Google Scholar
Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics (Oxford, England). 2015;31(20):3350–2.
CAS Google Scholar
Lu H, Giordano F, Ning Z. Oxford Nanopore MinION sequencing and genome assembly. Genomics, Proteomics & Bioinformatics. 2016;14(5):265–79.
Google Scholar
Herath D, Jayasundara D, Ackland D, Saeed I, Tang SL, Halgamuge S. Assessing species diversity using Metavirome data: methods and challenges. Comput Struct Biotechnol J. 2017;15:447–55.
PubMed PubMed Central Google Scholar
Bunge J, Woodard L, Bohning D, Foster JA, Connolly S, Allen HK. Estimating population diversity with CatchAll. Bioinformatics. 2012;28(7):1045–7.
CAS PubMed PubMed Central Google Scholar
Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics. 2005;6:41.
PubMed PubMed Central Google Scholar
Wang NF, Zhang T, Yang X, Wang S, Yu Y, Dong LL, Guo YD, Ma YX, Zang JY. Diversity and composition of bacterial Community in Soils and Lake Sediments from an Arctic Lake area. Front Microbiol. 2016;7:1170.
PubMed PubMed Central Google Scholar
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, et al. The sorcerer II Global Ocean sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007;5(3):e77.
PubMed PubMed Central Google Scholar
The Human Microbiome Project C, Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, Creasy HH, Earl AM, FitzGerald MG, et al. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207.
Google Scholar
Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449:804.
CAS PubMed PubMed Central Google Scholar
Schneider WL, Roossinck MJ. Genetic diversity in RNA virus Quasispecies is controlled by host-virus interactions. J Virol. 2001;75(14):6566.
CAS PubMed PubMed Central Google Scholar
Gregori J, Perales C, Rodriguez-Frias F, Esteban JI, Quer J, Domingo E. Viral quasispecies complexity measures. Virology. 2016;493:227–37.
CAS PubMed Google Scholar
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics (Oxford, England). 2012;28(4):593–4.
Google Scholar
Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9(9):868–77.
CAS PubMed PubMed Central Google Scholar
Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WE, Wetter T, Suhai S. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004;14(6):1147–59.
CAS PubMed PubMed Central Google Scholar
Chevreux B, Wetter T, Suhai S. Genome sequence assembly using trace signals and additional sequence information. German conference on bioinformatics. 1999;99(1):45–56.
Google Scholar
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, et al. ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res. 2017;27(5):768–77.
CAS PubMed PubMed Central Google Scholar
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. In: Research in Computational Molecular Biology: 2010// 2010; Berlin, Heidelberg. Berlin Heidelberg: Springer; 2010. p. 426–40.
Google Scholar
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27(5):824–34.
CAS PubMed PubMed Central Google Scholar
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18.
PubMed PubMed Central Google Scholar
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.
CAS PubMed PubMed Central Google Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.
CAS PubMed PubMed Central Google Scholar
Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper A, Markowitz S, Duran C, et al. Geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28(12):1647–9.
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank Dr. Steve Oberste for thoughtful suggestions on this work. We also thank the CDC Office of Advanced Molecular Detection Scientific Computing Team for their computing support.

Funding

C.J.C, R.L.M., and T.F.F.N. were supported by Federal appropriations to the Centers for Disease Control and Prevention.

Author information

Authors and Affiliations

Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA
Christina J. Castro, Rachel L. Marine & Terry Fei Fan Ng
Oak Ridge Institute for Science and Education, Oak Ridge, TN, USA
Christina J. Castro
General Dynamics Information Technology, Inc., contracting agency to the Office of Informatics, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Falls Church, VA, USA
Edward Ramos

Authors

Christina J. Castro
View author publications
You can also search for this author in PubMed Google Scholar
Rachel L. Marine
View author publications
You can also search for this author in PubMed Google Scholar
Edward Ramos
View author publications
You can also search for this author in PubMed Google Scholar
Terry Fei Fan Ng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the conceptualization, data analysis, preparation, and review of this manuscript. C.J.C, R.L.M., and T.F.F.N. wrote this manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Terry Fei Fan Ng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Supplemental Text: Analysis of viral GenBank records demonstrated the advent of NGS in viral sequencing in the last two decades. Supplement Figure S1. Workflow diagrams of simulated data from data creation through de novo assembly. Supplement Figure S2. Analysis of the final contig assembly graphs for a clinical sample containing enterovirus A71 (EV-A71) variants using Bandage. Supplement Figure S3. Assembly with three simulated variants. Supplement Table S1. Total counts from NCBI’s GenBank non-redundant nucleotide database. Supplement Table S2. Total count of sequencing technologies for sequences >2000 nt in the NCBI GenBank non-redundant nucleotide database for years 2012–2019. Supplement Table S3. Total counts from NCBI’s GenBank non-redundant nucleotide database with multiple sequencing technologies listed per entry. Supplement Table S4. Total counts from NCBI’s GenBank non-redundant nucleotide database of all entries with three and four sequencing technologies listed. Supplement Table S5. Total count of assembly programs used to generate sequences >2000 nt in the NCBI GenBank non-redundant nucleotide database. Supplement Table S6. The 10 de novo assemblers used for analysis of the simulated data, as categorized by their underlying assembly algorithms.

Additional file 2.

Number of contigs generated by SPades using the simulated data of two variants differed from 75% PID to 99.6% PID, across 20%-80% GC content.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Castro, C.J., Marine, R.L., Ramos, E. et al. The effect of variant interference on de novo assembly for viral deep sequencing. BMC Genomics 21, 421 (2020). https://doi.org/10.1186/s12864-020-06801-w

Download citation

Received: 02 March 2020
Accepted: 02 June 2020
Published: 22 June 2020
DOI: https://doi.org/10.1186/s12864-020-06801-w

The effect of variant interference on de novo assembly for viral deep sequencing

Abstract

Background

Results

Conclusions

Background

Results

The rise of NGS and de novo assembler use in GenBank viral sequences

Effect of variant assembly using popular de novo assemblers

Effect of GC content and genome length on variant assembly

Effect of read length on variant assembly

In silico experiments examining variant assembly with NGS data derived from clinical samples

Discussion

Conclusion

Methods

Analyzing NGS and assembler usage in the virus nucleotide collection in GenBank

Creation of simulated variant genomes and reads

Experiment 1: analyzing simulated reads from variants using different de novo assembly programs

Experiment 2: simulated data by varying genome length and GC content

Experiment 3: simulated data by varying read length

Evaluation of NGS datasets from clinical samples

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us