The effect of variant interference on de novo assembly for viral deep sequencing

Background Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. Results Our results from > 15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs. Conclusions These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.


Background
For many years, Sanger sequencing has been used to complement classical epidemiological and laboratory methods for investigating viral infections [1]. As technologies have evolved, the emergence of next-generation sequencing (NGS), which drastically reduced the cost per base to generate sequence data for complete viral genomes, has allowed scientists to apply viral sequencing on a grander scale [2][3][4]. Genomic sequencing is ideal for elucidating viral transmission pathways, characterizing emerging viruses, and locating genomic regions which are functionally important for evading the host immune system or antivirals [2,5].
Genomic surveillance of viruses is particularly important in light of their rapid rate of evolution. Viruses have higher mutation rates than cellular-based taxa, with RNA viruses having mutation rates as high as 1.5 × 10 − 3 mutations per nucleotide, per genomic replication cycle [6]. Due to this high mutation rate, it is well established that most RNA viruses exist as a swarm of quasispecies, [7] with each quasispecies containing unique single nucleotide polymorphisms (SNPs). The presence of these variants plays a key role in viral adaptation.
Due to viruses' rapid evolution, a single clinical sample often contains a mixture of many closely related viruses. Viral quasispecies are mainly derived from intra-host evolution, with RNA viruses such as poliovirus, human immunodeficiency virus (HIV), hepatitis C (HCV), influenza, dengue, and West Nile viruses maintaining diverse quasispecies populations within a host [8][9][10][11][12][13][14][15]. Conversely, the term "viral strains" often refers to different lineages of viruses found in separate hosts, or a coinfection of viruses in the same host due to multiple infection events. As a result, sequence divergence is usually higher when comparing viral strains compared to quasispecies. In this study, we use the term "variant" to encompass both quasispecies and strains regardless of how the variants originated in the biological samples.
Since many sequencing technologies produce reads that are significantly shorter than the target genome size, a process to construct contigs, scaffolds, and full-length genomes is needed. Reference-mapping and de novo assembly are the two primary bioinformatic strategies for genome assembly. Reference-mapping requires a closelyrelated genome as input to align reads, while de novo assembly generates contigs without the use of a reference genome. Therefore, de novo assembly is the most suitable strategy for analyzing underexplored taxa [16] or for viruses with high mutation and/or recombination rates.
The two most common graph algorithms employed by de novo assembly programs are: overlap graphs for overlap-layout-consensus (OLC) methods, and k-mer based graphs for de Bruijn graph (DBG) methods. OLC methods involve determining overlaps by performing a series of pair-wise sequence alignments. Such assemblies may be computationally expensive (especially for large datasets), and generally work better with longer reads [17,18]. Conversely, DBG assemblers split reads into smaller k-mers, with k-mers connected when they share a common prefix and suffix of length k -1. DBG methods are usually faster to run than OLC methods, but this strategy is known to be sensitive to repeats, sequencing error, and the presence of variants, which increase the k-mer complexity and ambiguity during sequence reconstruction [19,20]. These challenges could lead to fragmented contigs when analyzing viral assemblies from clinical or environmental samples [21].
In this study, we first examined how often NGS and de novo assembly were applied for viral sequences deposited in the GenBank nucleotide database (www.ncbi. nlm.nih.gov/nucleotide/). Then, we investigated how the presence of variants affected assembly results -simulated and clinical NGS datasets were analyzed using multiple assembly programs to explore the effects of genome variant relatedness, read length, genome GC and genome length on the resulting contig distribution. As viruses in different taxa vary in length and GC content, these experiments demonstrate how assembly of viral variants is impacted by basic genome structure characteristics, as well as by the nucleotide similarity between variants and sequencing read length.

Results
The rise of NGS and de novo assembler use in GenBank viral sequences GenBank viral entries from 1982 to 2019 were collected and analyzed, with extensive analyses performed to evaluate technologies and bioinformatics programs cited in records deposited between 2011 and 2019. Through 2019, there were over 2.7 million viral entries in GenBank; however, over 70% (1.9 million) do not specify a sequencing technology (Supplement Table S1) due to the looser data requirement in earlier years. When looking at recently deposited records (2014-2019), the Illumina sequencing platform was the most common NGS platform used for viral sequencing, with over a 2fold increase over the next most popular NGS platform ( Fig. 1d & e). When long sequences (≥2000 nt) are considered, NGS technologies surpassed Sanger in 2017 as the dominant strategy for sequencing, comprising 53.8% (14,653/27,217) of entries compared to 46.2% of entries (12,564/27,217) for Sanger. This trend held true in 2018 and 2019 as well ( Fig. 1f and Supplement Table S2).
Hybrid sequencing approaches, where researchers use more than one sequencing technology to generate complete viral sequences, have also become more common over the past several years. The most common combination observed was 454 and Sanger (18,124 entries), likely due to the early emergence of the 454 technology compared to other NGS platforms ( Fig. 1c and Supplement Table S3). However, combining Illumina with various other sequencing platforms is quite commonplace (> 19,000 entries).
De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) have increased from less than 1% of viral sequence entries ≥2000 nt in 2012, to 20% of all viral sequence entries in 2019 ( Fig. 1h & i). A similar increase was observed for reference-mapping programs (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 12.5% in 2019. Multifunctional programs that offer both assembly options were the most common programs cited for the years 2013-2019, but since the exact sequence assembly strategy used for these records is unknown (Tables S1-S5), the contributions of de novo assembly are likely underestimated. An expanded summary of the sequencing technologies and assembly approaches used for viral GenBank records is available in the Supplement text and Supplement Tables S1-S6.

Effect of variant assembly using popular de novo assemblers
After establishing the growing use of NGS technologies for viral sequencing, we next focused on understanding how the presence of viral variants may influence de novo assembly output. We generated 247 simulated viral NGS datasets representing a continuum of pairwise identity (PID) between two viral variants, from 75% PID (one nucleotide difference every 4 nucleotides), to 99.6% PID (one nucleotide difference every 250 nucleotides) (Fig. 2). For Experiment 1, these datasets were assembled using  Figure S1a) to evaluate their ability to assemble the two variants into their own respective contigs as the PID between the variants increases.
One key observation is that the assembly result can change from two (correct) contigs to many (unresolvable) contigs simply by having variant reads; the presence of viral variants affected the contig assembly output of all 10 assemblers tested. The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA assemblers shared a few commonalities, demonstrated by a conceptual model in Fig. 3a. First, below a certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de novo assemblers produced two contigs correctly (Fig. 3). We refer to this as "variant distinction" (VD), with the highest pairwise identity where this occurs as the VD threshold. Above this threshold, the assemblers produced tens to thousands of contigs (Fig. 3), a phenomenon we define as "variant interference" (VI). As PID between the variants continue to increase, the de novo assemblers can no longer distinguish between the variants and assembled all the reads into a single contig, a phenomenon we define as "variant singularity" (VS). (Fig. 3). The lowest pairwise identity where a single contig is assembled is the VS threshold.
Slight differences in the variant interference patterns (relative to the canonical variant interference model) were observed for the 10 assemblers investigated. VD was observed for SPAdes, MetaSPAdes, and ABySS assemblers. While it was not observed with Cap3 and IDBA with the current simulated data parameters, we speculate that VD may occur at a lower PID level for these assemblers than tested in this study. The PID range where VI was observed was distinct for each de novo assembler (Fig. 3). During VI, SPAdes produced as many as 134 contigs and ABySS produced 3076 contigs, while MetaSPAdes, Cap3, and IDBA produced up to 10.
A different pattern was observed for Mira, Trinity, and SOAPdenovo2 assemblers. The average number of contigs generated by Mira, Trinity, and SOAPdenovo2 was 5, 36, and 283, respectively across all variant PIDs from 75 to 99.96%. Specifically, Mira and Trinity generated fewer contigs at low PID, but produced many contigs when the two variants reach 97.1% PID and 96.0% PID, respectively. For SOAPdenovo2, a larger number of contigs were produced regardless of the PID. This indicates that these assemblers generally have major challenges producing a single genome; this has been observed in previous studies comparing assembly performance [22].
Finally, Geneious and CLC were the least affected by VI in the simulated datasets tested, returning only 1-5 contigs for all pairwise identities. CLC's assembly algorithm primarily returned a single contig over the range of PIDs tested (218/247 simulations; 88.3%), thus favoring VS. In comparison, Geneious predominantly distinguished the two variants (234/247 simulations; 94.7%), favoring VD.

Effect of GC content and genome length on variant assembly
For Experiment 2, we focused our study on evaluating whether VI observed in SPAdes de novo assembly is influenced by the GC content or genome length of the pathogen. SPAdes was chosen because it produced a Fig. 2 Workflow diagram of the investigation of variant simulated NGS reads through de novo assembly. First, in step 1, an artificial reference genome and corresponding initial variant reads were created with varying constraints such as genome length, GC content, read length, and assemblers, according to the experiment types as detailed in Supplement Figure S1. In the second step, an artificial mutated variant genome was created. The process is repeated to generate 247 different mutated variants with controlled mutation parameters-starting with 1 mutation every 4 nucleotides (75% PID) and ending with 1 mutation in every 250 nucleotides (99.6% PID). Mutated variant reads are also generated for each of the mutation parameters. In the third and fourth steps, the initial and mutated variants were then combined and used as input for de novo assembly for the three experiments, as detailed in Supplement Figure S1 well-defined variant interference that closely resembled the conceptual model (Fig. 2). It is also one of the leading assemblers for viral assembly (Fig. 1), possibly due to its ability to assemble viral variants without variant interference in most PID. Two datasets were used for the evaluation: reads generated from four artificial genomes ranging in length from 2 Kb to 1 Mb, as well as from genome sequences of poliovirus (NC_002058; 7440 nt in length) and coronavirus (NC_002645; 27,317 nt in length). No discernable correlation was observed between the GC content of variant genomes and the degree of VI for any of the simulated datasets (Supplemental Figure S1, p < 0.0001). Therefore, for subsequent analyses examining the effects of genome length on VI, the number of contigs at each PID level was obtained by averaging the 13 GC simulations.
Notably, no matter the genome length, SPAdes produced vastly more contigs (i.e., VI) in a constant, narrow range of PID (99-99.21%; Fig. 4a & b). The effect of variants on assembly was characterized by the three distinct intervals described previously: VD at lower PIDs, VI (Fig. 4b), and VS at higher PIDs for all genome lengths. For example, during VS, a single contig was generated when the two variants shared ≥99.22% PID, but tens to thousands of contigs were generated at a slightly lower PID of 99.21%. This PID threshold, 99.21%, marked the drastic transition from VI to VS, whereas the transition from VD to VI (i.e., the VD threshold) occurred at 98.99% PID (Fig. 4b). A correlation was observed between genome length and the number of contigs produced during VI, where longer genomes returned proportionally more contigs as expected as total VI occurrence should increase with length (r 2 = 0.967; p < 0.0001 Fig. 4b and c).

Effect of read length on variant assembly
The read length of a given NGS dataset will vary depending on the sequencing platform and kits utilized to generate the data. Since read length is an important factor for de novo assembly success, [23] we hypothesized that it may also influence the ability to distinguish viral variants. For Experiment 3, using SPAdes we investigated assemblies with four typical read lengths: 50, 100, 150, and 250 nt. At longer read lengths, the VD threshold occurred at higher PIDs (Fig. 4d & e). Also, with increasing read length, the width of the PID window where VI occurs gradually decreased from a 1.52% spread to a 0.21% spread (Fig. 4e). This indicates that longer reads are better for distinguishing viral variants with high PIDs. In silico experiments examining variant assembly with NGS data derived from clinical samples For clinical samples, assembly of viral genomes is affected by multiple factors other than the presence of variants, including sequencing error rate, host background reads, depth of genome coverage, and the distribution (i.e., pattern) of genome coverage. We next utilized viral NGS data generated from four picornavirus-positive clinical samples (one coxsackievirus B5, one enterovirus A71, and two parechovirus A3) to explore VI in datasets representative of data that may be encountered during routine NGS. The NGS data for each sample was partitioned into four bins of read data: (1) total reads after quality control (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background non-viral reads only (MB) (Fig. 5). These binned datasets were then assembled separately using three assembly programs: SPAdes, Cap3, and Geneious. These programs were chosen as representatives of different assembly algorithms: SPAdes is a leading de Bruijn graph (DBG) assembler, Cap3 is a leading overlap-layout-consensus (OLC) assembler, and Geneious is a proprietary software. By comparing these manipulations, we aimed to test the hypothesis that minor variants directly affect the performance of assembly through VI in real clinical NGS data.
Even with an adequate depth of coverage for genome reconstruction, assembly of total reads (T) in 11/12 experiments resulted in unresolved genome constructionresulting in numerous fragmented viral contigs (Fig. 6). The only exception was one experiment where one single PeV-A3 (S1) genome was assembled using Cap3. When only reads from the major variant were assembled (M), full genomes were obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Geneious. Conversely, assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of contigs for 9 of the 12 sample and assembly software combinations tested (Fig. 6), indicating that VI due to the addition of the minor variant reads likely Fig. 4 The effect of genome length and read length on de novo assembly of simulated variants across a range of percentage identities (PID). a & b Comparison of genome lengths. Six different genome lengths were assembled and the final contig counts were tallied across varying PID thresholds (75-99.6%). For the simulated genome lengths of 2Kb, 10 kb, 100 Kb, and 1 Mb, the average of contig number at each PID was plotted. Panel (b) shows the close-up view where interference was the most prominent. For all six genome lengths and each of the 13 iterations, VI consistently occurred in the same range of PID (99.00-99.24%). The assembly makes a transition from VD to VI at the threshold of 99.00%, and it makes a transition from VI to VS at the threshold of 99.24%. Also, the longer the genome length, the more contigs produced during VI. c The relationship between genome length and the total number of contigs produced. Data from panel (a) were plotted on a logarithmic scale. The total number of contigs produced is significantly dependent on the genome size (r 2 = 0.967; p-value< 0.0001). d and e The effect of read length in variant assembly with a genome size of 100 K. Simulated data with four different read lengths were created and assembled, and the final contig counts were tallied across varying PID thresholds (75-99.6%). Panel (e) shows the close-up view where interference was the most apparent. When longer read lengths were used, the variant interference PID range was much narrower than when shorter read lengths were used to build contigs adversely affected the assembly. The presence of background reads with major variant reads (MB) did not appear to affect viral genome assembly, as the UG 50 % value, a performance metric which only considers unique, non-overlapping contigs for target viruses [24], was similar between M and MB datasets.

Discussion
Our analysis of the GenBank entries quantified the decade-long expansion of NGS technologies and de novo assembly for viral sequencing (Fig. 1). As the number of viral sequences in public databases continues to grow, an important question that naturally arises is how well showing the abundance and length of the generated contigs reveals the impact of variant interference on de novo assembly. The bar graphs show the UG 50 % metric and the length of the longest contig. UG 50 % is a percentage-based metric that estimates length of the unique, nonoverlapping contigs as proportional to the length of the reference genome [24]. Unlike N 50 , UG 50 % is suitable for comparisons across different platforms or samples/viruses. More clinical samples and viruses are analyzed similarly in Fig. 6 Fig . 6 The effect of variant interference on the assembly of four clinical datasets using three assembly programs. Fastq reads were partitioned into four categories: total reads (T), major variant (M), minor variant (m), and background (B). These reads were then combined into four different categories: T, M, major and minor variants (Mm), and major variant and background (MB). Datasets were assembled using SPAdes, Cap3, and Geneious. The bar graphs show the UG 50 % metric and the length of the longest contig. Coxsackievirus B5, CV-B5; Enterovirus A71, EV-A71; Parechovirus A3 (Sample 1), PeV-A3 (S1); Parechovirus A3 (Sample 2), PeV-A3 (S2) current de novo assembly programs perform for datasets with viral variants. Viral variants are expected in biological samples, with the number of variants and the extent of the sequence divergence between variants related to the mutation rate of the virus and the types of specimens that are being investigated. For example, samples containing rapidly evolving RNA viruses, such as poliovirus, HIV, and HCV [9,11,25], environmental samples, [26] and clinical samples from immunosuppressed individuals [27,28] usually harbor many variants. The ability to accurately distinguish variants is imperative to inform treatments (in the case of HIV and HCV), or determine whether a subpopulation of a more virulent variant is present.
Several experiments using simulated and clinical sample NGS data were performed to evaluate the ability of genome assembly programs to distinguish genome variants. All assemblers investigated generated fragmented assemblies when the data contained reads from two closely related variants due to "variant interference" (VI). Changes in pairwise identity (PID) as small as 0.01% between the two variants triggered an assembler to change from producing one or two contigs to producing hundreds of contigs. A quintessential example of this phenomenon was the SPAdes assembly of EV-A71 sequences during the in silico experiments with clinical NGS data. Assembly of major variant reads resulted in one full length contig (Fig. 5), whereas assembly of datasets containing the major and minor variant reads (Mm and T) were characterized by a number of contigs, resulting in "cobwebs" of contig fragments when visualized using Bandage (Supplement Figure S2) [29]. Even though the de novo assembly graph linked the different contig fragments, the assembly could not differentiate the multiple routes of possible contig construction. We speculate this is the main reason why VI occurs in the context of de Bruijn graph assemblers.
The simulated experiments suggested that genome length and read length influence VI; A longer genome length will produce proportionally more contigs during VI, whereas a longer read length decreases the PID range where VI occurs (Fig. 4). While longer read length improves assembly, unfortunately, platforms that produce long reads such as Oxford Nanopore and PacBio have higher error rates [30]. Until long reads can be produced at high fidelity, researchers must continue to rely on combining long-and short-read NGS datasets, and genome polishing techniques [30].
The large number of contigs generated due to VI may be overwhelming for most researchers, and for viral ecology studies, could lead to over-estimation of species richness for methods that use contig spectra to infer richness, such as PHACCS or CatchAll [31][32][33]. This phenomenon may also impact studies differently depending on the overall goal for generating viral sequence data. For example, some researchers may only be concerned with generating a single major consensus genome, even when variants are detected in the data. This is common during outbreak responses for pathogens such as Ebola virus or Middle East respiratory syndrome coronavirus, where detection of SNPs (indicative of minor variants) is not immediately important. On the other hand, some investigations could favor distinguishing variants, such as for investigating the presence of vaccine-derived poliovirus, where a small number of SNPs may distinguish a vaccine-derived strain from a normal vaccine strain genome [28].
The effects of VI could potentially be mitigated by running multiple assembly programs. A previous study testing bioinformatics strategies for assembling viral NGS data found that employing sequential use of de Bruijn graph and overlap-layout-consensus assemblers produced better assemblies [22]. We speculate that this "ensemble strategy" [22] may perform better because the multiple assemblers complement one another by having different VI PID thresholds. Future assembly approaches could also consider resolving the VI problem by possibly discriminating the major and minor variant reads first (perhaps by coverage or SNP analysis), and then assembling major and minor variant reads separately.
Since we observed VI occurring in simulated data from 2 Kb to 1 Mb genome lengths, we speculate that it may not only affect viral data but also larger draft contigs of bacteria and other microorganisms. Even though bacterial mutation rates are much lower than those of most viruses, bacterial variants are common. For environmental studies, bacterial metagenomes are known to contain many related taxa and variants [34][35][36][37], and in clinical investigations, minor bacterial variants can harbor SNPs that provide resistance against antimicrobials. This warrants future investigation into how the presence of variants may impact the assembly of other microbial datasets.

Conclusion
This study aimed to understand how variants affect assembly. As an initial investigation, many confounding factors were simplified for experimentation. Simulated variants studied here only depicted periodic mutations, set at regular intervals. However, in real viral data, SNPs are never evenly distributed across the genome, with zones of divergence and similarity [38,39]. Other important factors which influence genome assembly include sequencing error rates, presence of repetitive regions, and coverage depth. We limited our experiments to keep these factors constant in order to investigate the sole effect of VI. A pilot experiment analyzing three variants demonstrated the VI theory generally holds true even with multiple variants with varying PID (Supplement Figure S3). Introducing a third variant within the VI threshold (Set B) destabilized assembly and caused the contig number to inflate. Through this study, we demonstrated that reads from related genome variants adversely affect de novo assembly. As NGS and de novo assembly have become essential for generating full-length viral genomes, future studies should investigate the combined effects of the number and relative proportion of minor variants, as well as additional assembly factors (e.g., error rates) to supplement this work.

Methods
Analyzing NGS and assembler usage in the virus nucleotide collection in GenBank Viral sequence entries from the GenBank non-redundant nucleotide collection were obtained by downloading all sequences under the virus taxonomy through the end of 2019. A total of 2,793,810 GenBank entries were investigated.
The total number of viral sequences submitted annually in GenBank through December 2019 was calculated by filtering GenBank submissions by "virus," followed by application of the following additional filtering steps: "genomic DNA/RNA" was selected and a "release date: Jan 1 through Dec 31" was applied to find the total number of viruses for a given year. A custom script was used to filter and count all documented sequencing technologies and assembly methods used for each GenBank entry.

Creation of simulated variant genomes and reads
Simulated genomes were generated using custom scripts that randomly assign each nucleotide over a designated genome length with a weighted distribution dependent on the GC content (Supplement Figure S1). The random genomes were then screened using NCBI BLAST to ensure no similarity/identity existed to any classified organism (i.e., no BLAST hits). These simulated genomes served as the initial variant genome (variant 1). To generate the mutated variant genomes (variant 2), a custom script was used to systematically introduce evenly distributed random mutations at rates from 1 mutation in every 4 nucleotides (75% PID) to 1 mutation in every 250 nucleotides (99.6% PID), incrementing by 1 nucleotide.
Following the generation of initial and mutated variant genomes, high-quality fastq reads were generated using ART, [40] simulating Illumina MiSeq paired-end runs at 50X coverage with 250 nt reads, DNA/RNA mean fragments size of 500, and quality score of 93. Fastq reads were combined in equal numbers for the initial and mutated variants and used as input for subsequent de novo assembly experiments (Supplement Figure S1). The same process was utilized to generate the artificial genomes, initial and mutated variant genomes, and reads for each of the experiments. Experiment 1: analyzing simulated reads from variants using different de novo assembly programs The simulated datasets containing reads from two variant genomes with nucleotide pairwise identity ranging from 75 to 99.6% were analyzed using 10 different genome assembly programs using default parameters. The de novo assembly algorithms used were either overlaplayout-consensus (OLC) [Cap [41] and Mira [42,43]], de Bruijn graph (DBG) [ABySS [44], IDBA [45], MetaS-PAdes [46], SOAPdenovo2 [47], SPAdes [48], and Trinity [49]], or commercial software packages [CLC (https://www.qiagenbioinformatics.com/) and Geneious [50]] whose assembly algorithms are proprietary (Supplement Table S6). The simulation settings for the reads were paired-end reads, 250 nt read length, and 50X coverage. A total of 2470 assemblies (247 datasets per genome X 10 assemblers) were analyzed (Supplement Figure S1a).

Experiment 2: simulated data by varying genome length and GC content
Artificial genomes were constructed for four genome lengths: 2 Kb, 10 Kb, 100 Kb, and 1 Mb, with varying GC content from 20 to 80%, in 5% increments (Supplement Figure S1b). Datasets derived using one poliovirus genome (NC_002058) and one coronavirus genome (NC_002645) were also included in this analysis, representing the lower and upper genome length range typical of RNA viruses. The original GC content was kept constant for the poliovirus and coronavirus genomes. For all of these genomes, simulated reads for initial and mutated variants were generated as above.
A total of 13,338 SPAdes assemblies were generated, which included 12,844 assemblies for the four artificial genomes (247 datasets per genome X 4 artificial genome lengths X 13 GC content proportions X 1 assembler) and 494 assemblies for the poliovirus and coronavirus datasets (247 datasets per genome X 2 genomes X 1 assembler) (Supplement Figure S1b). JMP v13.0.0 (www. sas.com) was used to calculate Pearson's correlation and Spearman's ρ values to compare the association between percent GC levels and the number of contigs produced at each PID level. Since there was little statistical difference when comparing the contig numbers generated at varying percent GC for each of the four genome length datasets (Spearman's ρ = 0.8299 to 0.9801, p < 0.001) (Supplement Excel file), the final contig number was averaged across the 13 GC percentages at a given PID. The average contig number was used for plotting the contig assembly results vs percent PID for each simulated genome length (Fig. 4a-b).

Experiment 3: simulated data by varying read length
Genome variants were generated as described above ("Creation of simulated variant genomes and reads") for a genome of size 100 Kb with 50% GC; this was the starting initial variant genome. In this simulation, initial and mutation variant reads at four sequencing read lengths (50, 100, 150, and 250 nt) were created using ART. A total of 538 SPAdes assemblies were generated (47,97,147, and 247 datasets for the 50, 100, 150 and 250 nt read lengths, respectively) (Supplement Figure S1c).

Evaluation of NGS datasets from clinical samples
Four datasets derived from clinical samples containing picornaviruses (one enterovirus A71 [EV-A71], one coxsackievirus B5 [CV-B5] and two parechovirus A3 [PeV-A3]) were analyzed for this experiment, as previous sequencing analysis using Geneious indicated the presence of genome variants. The datasets were analyzed using an in-house pipeline (VPipe), [25] which performs various quality control (QC) steps and de novo assembly using SPAdes. The post-QC reads were considered total reads (T) and mapped to their respective reference genome in order to determine the major and minor variants present in each sample. Total reads which mapped with high similarity (≥99%) to the major variant were categorized as reads representing the major variant (M). Unbinned reads from the major variant reference recruitment were used to construct the minor variant consensus using a second round of reference recruitment, and these reads were categorized as the minor variant (m). Remaining reads from the previous two steps were considered background (B) reads.
De novo assembly for each of the four clinical samples was performed for the following binned NGS datasets: (1) total reads only (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background reads only (MB). This was repeated with three assembly programs: SPAdes, Cap3, and Geneious. The length of the longest contig produced from each assembly and the performance metric UG 50 % [24] were calculated to compare the results for these 48 assemblies (4 experiments X 4 viruses X 3 assemblers).
Additional file 1. Supplemental Text: Analysis of viral GenBank records demonstrated the advent of NGS in viral sequencing in the last two decades. Supplement Figure S1. Workflow diagrams of simulated data from data creation through de novo assembly. Supplement Figure S2. Analysis of the final contig assembly graphs for a clinical sample containing enterovirus A71 (EV-A71) variants using Bandage. Supplement Figure S3. Assembly with three simulated variants. Supplement Table S1. Total counts from NCBI's GenBank nonredundant nucleotide database. Supplement Table S2. Total count of sequencing technologies for sequences >2000 nt in the NCBI GenBank non-redundant nucleotide database for years 2012-2019. Supplement Table S3. Total counts from NCBI's GenBank non-redundant nucleotide database with multiple sequencing technologies listed per entry. Supplement Table S4. Total counts from NCBI's GenBank non-redundant nucleotide database of all entries with three and four sequencing technologies listed. Supplement Table S5. Total count of assembly programs used to generate sequences >2000 nt in the NCBI GenBank nonredundant nucleotide database. Supplement Table S6. The 10 de novo assemblers used for analysis of the simulated data, as categorized by their underlying assembly algorithms.
Additional file 2. Number of contigs generated by SPades using the simulated data of two variants differed from 75% PID to 99.6% PID, across 20%-80% GC content.