Benchmarking different approaches for Norovirus genome assembly in metagenome samples

Background Genome assembly of viruses with high mutation rates, such as Norovirus and other RNA viruses, or from metagenome samples, poses a challenge for the scientific community due to the coexistence of several viral quasispecies and strains. Furthermore, there is no standard method for obtaining whole-genome sequences in non-related patients. After polyA RNA isolation and sequencing in eight patients with acute gastroenteritis, we evaluated two de Bruijn graph assemblers (SPAdes and MEGAHIT), combined with four different and common pre-assembly strategies, and compared those yielding whole genome Norovirus contigs. Results Reference-genome guided strategies with both host and target virus did not present any advantages compared to the assembly of non-filtered data in the case of SPAdes, and in the case of MEGAHIT, only host genome filtering presented improvements. MEGAHIT performed better than SPAdes in most samples, reaching complete genome sequences in most of them for all the strategies employed. Read binning with CD-HIT improved assembly when paired with different analysis strategies, and more notably in the case of SPAdes. Conclusions Not all metagenome assemblies are equal and the choice in the workflow depends on the species studied and the prior steps to analysis. We may need different approaches even for samples treated equally due to the presence of high intra host variability. We tested and compared different workflows for the accurate assembly of Norovirus genomes and established their assembly capacities for this purpose. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-08067-2.


Background
Many viruses have high mutation and recombination rates, producing heterogeneous mixtures of viral strains. This rapid evolution favors the development of functional advantages such as evasion of the host immune response [1][2][3][4] and vaccine/drug resistance [5,6]. Characteristics of infection and pathogenicity in these viruses are also influenced by selective pressure [7][8][9].
The heterogeneity underlying viral genomes complicates their characterization using sequencing experiments. Over recent decades, high-throughput sequencing (HTS) has emerged as an important resource for viral studies [10][11][12].
Among viruses with high mutation rates and presence of large numbers of genetic variations or quasispecies are Noroviruses, a group of the Caliciviridae family. They are positive and single-stranded RNA viruses without a lipid envelope, with genome lengths varying from 7.5 to 7.7 Kb [13]. These viruses are known to be highly pathogenic and infectious, and are responsible for most cases of acute gastroenteritis (50 % of all outbreaks worldwide), although symptoms are generally non-lethal [14]. Their genome is composed of three open reading frames (ORF), the first one encoding a polyprotein cleaved into six non-structural proteins including an RNA-dependent RNA polymerase (RdRp) [15]. ORF1 overlaps in a short region with ORF2, which encodes for the capsid protein VP1 (major capsid protein) while ORF3 encodes for VP2 (minor capsid protein) [16]. Of the ten currently identified genogroups, in humans the represented genogroups causing infections are GI, GII, GIV, and recently, GVIII and GIX [17]. These genogroups are established based on a minimum 43 % difference between VP1-coding sequences [17,18]. However, 60 % of norovirus infections are attributable to genotype GII.4 [16,19,20]. As with other RNA viruses, the intra-genus variability of Noroviruses provides them with fast evolving capacity and makes their characterization and sequencing challenging [21][22][23].
Nowadays, HTS-based short read assemblers are widely used for reconstructing bacterial and viral genomes. Although HTS platforms and bioinformatics methods have evolved over the past few years, "de novo" assembly is still arduous and computationally expensive. Many of these assemblers are based on de Bruijn graph methods [24]. The strategy is to generate substrings of length k (k-mers) and form a path with overlapping sequences, constituting a graph and thus generating large contigs to reconstruct genomic regions [24,25]. These computational algorithms were designed with the increasing use of short-read sequencing approaches to obtain contigs for assembly into scaffolds. Gene-centric and genome-centric assemblies require different approaches, as there are different ways to tackle short-read sequencing in metagenomics. SPAdes and MEGAHIT are frequently used de Bruijn graph-based assemblers in genome-centric studies due to their large contig yield [26].
One notable difference between SPAdes and MEGA-HIT is that the former is more computationally expensive, working with the whole set of sequences in all assembly iterations, whereas the latter saves computational resources by considering only k-mers occurring over a determined cutoff length. metaSPAdes (SPAdes with --meta flag) is an alternative since it constructs a consensus sequence from different strain variants [27].
Compared with other organisms, virus genomes are generally difficult to assemble, not only because of the interspecies variability present in metagenome samples, but also due to the high genetic variability presented in viral particles [28][29][30][31][32]. Our goal was to reconstruct the genomes of Norovirus strains from stool samples from patients with gastroenteritis and diarrhea, testing different workflows and evaluating the use of read binning along with metagenome assembly. Our aim was to obtain large contigs spanning the whole genome of Norovirus for all the non-related samples, for which we used different analysis strategies along with MEGAHIT [33] and SPAdes [34].

Assembly
Raw data obtained from eight human Norovirus samples passed FASTQC (v0.11.5, Babraham Bioinformatics) quality filters regarding the parameters per base sequence quality, per sequence average quality, N content and adapter sequences after the trimming steps described in the methods section. Mean read length was 100 bp as expected from library preparation. As shown in Table 1, sequencing experiments yielded a mean of 40 million total paired reads.
Different workflows with varying filtering steps before assembly were tested (Fig. 1). Identity percentages between the assembled genomes and the most related references from the viral RefSeq genomes database are represented in Table 1.
After the characterization with Norovirus genotyping tool of the longest NoV contig obtained by means of BLAST against the selected GenBank references (Additional File 1; filtered BLAST results for contigs >7 kb), completeness was assessed using checkV [35]. Four different strategies were tested: pipelines A, B, C or D (pA, pB, pC or pD, respectively). As complete NoV genomes were obtained from different strategies for the same samples (Table 2), final complete contigs were chosen and made publically available [36,37]. The final contig chosen for sample A was from raw SPAdes, and CD-HIT SPAdes for sample C. For the rest, MEGAHIT was chosen: raw MEGAHIT in the case of sample B, and CD-HIT MEGAHIT in the case of samples D, E, F, G and H. (Table 1). Even though these were the final contig genomes published, all complete contigs are detailed in Table 2, and a summary with the average completeness per strategy can be found in Table 3. Identity percentages shared between final contigs and the rest of contigs yielded by different strategies are detailed in Additional File 2; alignment images between the main contig per strategy in a sample and the closest GenBank reference genome are presented in Additional files 8, 9, 10, 11, 12, 13, 14 and 15. In the case of SPAdes, only contigs obtained with custom kmer-lengths are indicated as autoadjusted kmer-lengths (21,33,55) used by default did not result in complete NoV genome contigs for most samples (Additional File 3).  * not complete. SPADes complete (or nearly complete) NoV genome contigs are shown with the kmer length that generated the contig.^contig lost in following kmer-length steps. pA, pB, pC and pD: pipelines A, B, C or D respectively. Genotypes are identical between pipelines, as indicated in Table 1 Overall, MEGAHIT performed better than spades in most approaches, with at most 2 NoV contigs with length < 7.5 kb (1 in pA; 2 in pB; 1 in pC and 1 in pD). pA showed the best results for MEGAHIT with an average of NoV genome completeness of 99.82 %. 7/8 final NoV contigs were 100 % complete excepting sample G (98.59 %). In the case of SPADES, the strategies pA, pB and pC left 3 samples with genome completeness below 97 %. With the use of read binning (pD) along SPAdes 8/8 samples were completely assembled (samples B and F nearly complete in 99.26 and 99.34 % respectively).
Regarding pA, an average of 72.75 % of total reads were selected after removing host-mapped reads (detailed by sample in Table 1). These reads were selected for assembly using SPAdes and MEGAHIT.
With MEGAHIT, complete NoV contigs were obtained for 8/8 samples (sample G was nearly complete 98.59 %). In the case of SPAdes with custom kmerlengths, genomes surpassing 7. Percentages of unique reads mapping NoV reference genomes used in pB strategy were not even for the different samples as shown in Table 1. The average of these percentages was highly similar to that of the unique number of reads covering the complete genomes assembled (47 % vs. 45 %), indicating that NoV reads mapped successfully against the references used and that is resembled in the completeness achieved with this strategy. Adjusted kmer-lengths for SPAdes accomplished contigs over 7. 5

Assembly statistics
We compared assembly qualities between MEGAHIT and SPAdes data from the approaches tested. Tables 4  and 5 show the results obtained. Tables with assembly statistics per sample are in Supplementary Data (Additional File 4 for all obtained contigs and Additional File 5 for Norovirus contigs).
Regarding MEGAHIT, the total number of contigs obtained in pD was reduced by 22 % compared to pC. From the total number of assembled contigs, 0.22 % belonged to NoV in pC, whereas using CD-HIT (pD) this value was 2.41 times higher (0.53 %), being the number of NoV contigs 1.88-fold higher in pD. Mean N50 was also increased with CD-HIT to 5010 in NoV contigs (pC mean N50 4554), improving the assembly from 98.83 to 99.14 % average completeness.
MEGAHIT assembled Norovirus whole genomes in the whole set of samples with pA (sample G 98.59 % completed). Average completeness for this approach was the highest for MEGAHIT (99.82 %). Average NoV contigs N50 value was 4567, highly similar to that in pC, and average NoV contigs proportion (0.39) was lower to that in pD. The number of contigs respecting to pC was reduced by 16 %. pB MEGAHIT had an average genome completeness of 99.30 %. The number of total contigs was 9-fold lower than pC and from these, 1.8 % belonged to NoV.
MEGAHIT performed successfully assembling complete NoV genomes along all the approaches and even though pC had the lowest average completeness, the results are almost equivalent in the 4 strategies.
Regarding SPAdes, the percentage of NoV contigs was 7.7-fold higher in pC (1 %) than pD (0.13 %). The total number of NoV contigs was reduced by 11 % using pD. However, the proportion of useful NoV contigs (>5000pb) was 7.75 times higher in pD. N50 was similar in the two approaches considering all contigs and only In this case strategies pA and pB did not present any advantage compared to pC which is the simplest approach. The level of completeness in pB was 95.3 % and even though in pA this value was improved (97.13 %), 4/8 samples do not reach NoV contig lengths over 7.5 kb (so as in pC and pB). The proportions of NoV contigs over 5 kb were 0.6 and 0.8 % in pA and pB for SPAdes. Again, this proportion was 11 and 8-fold higher in pD (pipelines A, B and C produced more NoV contigs but only a few reached completeness, whereas in pD the number of total NoV contigs was lower).

SPAdes kmer-lengths
Spades autoadjusted kmer-lengths used for assembly regarding read length (100 bp) to 21, 33 and 55 were used for the assembly at first but SPAdes did not yield contigs near 7Kb in the majority of samples (Additional file 3).
After obtaining notable differences in the number of assembled NoV genomes using SPAdes versus MEGAHIT we sought to test whether these differences were due to the kmer-lengths used by each assembler. Kmer-lengths used by MEGAHIT were 21, 29, 39, 59, 79, 99 and 119.
For that reason, we tested all the strategies with SPAdes using a set of kmer-lengths of 21, 33, 55, 77, 99 and 119 (intermediate kmer-lengths chosen are the recommended in the SPAdes manual for 250 × 2 bp read lengths). The last kmer-length was not used as it surpassed read length. The rest of the parameters used were the same and the option -meta was also included. As shown in Table 2, contigs over 7k were accomplished but excepting pD strategy, only half of the samples NoV genomes were assembled. In other words, complete assemblies were only reached with pD strategy and custom kmer-lengths. Besides, some assemblies were reached at different kmerlengths and in certain cases, these were lost in the following kmer-length steps ( Table 2 shows kmer-lengths at which the contig was assembled in the case of SPAdes complete or nearly complete contigs; contigs marked witĥ are lost in the following kmer steps), not reporting the longest contig in the final assembly FASTA file. In the   Table 2 is the only sample appearing complete at step K55, which was the last used by default SPAdes.

Variants
The eight studied samples had an average number of 95.6 mutations against the final assembled genomes (including variants present in over 1 % of reads). The number of variants found against the final contigs for each sample were 230, 180, 187, 15,55,43,23,32 in the order A-H. A, B and C exhibited the highest number of variants (230, 180 and 187) over 1 %. However, when considering higher variant frequencies (>=10 %), only sample A maintains high variability, being the number of nucleotide changes identified 70, 5, 13, 3, 1, 7, 4, 3 variants in order from A to H. In general, variation frequencies are higher towards the 3' end of the virus genome (Fig. 2).
Sample A could exhibit a co-infection of various NoV strains due to the presence of a high number of nucleotide changes with respect to the final contig assembled (variant frequencies ranging from 10 to 57 %; Fig. 2). Interestingly, all the strategies shown in Table 2 result in the assembly of the same contig genotype although they represent a small part of the reads corresponding to NoV strains. The percentage of total reads covering the final contig with genotype GII.17 was 0.026 %. In Additional File 6 the 20 GenBank NoV references with most reads mapped for all samples are reported, showing different genotypes in the case of sample A (GII.12, GII.4, GII.17, GII.3, GII.2), whereas in the rest the majority were consistent with the final contig genotype.

Computation resources
We compared the computational resources used by both MEGAHIT and SPADES with the raw assembly and after read-binning. MEGAHIT in pC used maximum 50Gb memory for the four samples tested, taking 7 h to complete assembly (1.75 h/sample). Use of the 16 CPUs was 12 %. With the pD strategy, it required 20 % CPU and a peak memory of 60Gb, assembling all tested samples in just 1 h.
We studied the time and resources used by SPAdes with four test samples using intra-sample threading. It took 8 days to assemble these samples (mean, 44.25 h per sample) without error correction, 186 Gb of memory and 16 threads with metaSPAdes. Use of CPU was 100 % with peak memory of 160Gb. With the pD binning strategy, the same test required 30 min for all samples with 45Gb peak memory and maximum CPU usage under 10 %.

Discussion
Different strategies are needed for de novo genome assembly, especially for RNA virus genomes, which present a higher substitution rate than any other microorganism (10 −3 -10 −5 /site/year) [22,23,[38][39][40]. Although there are plenty of assemblers and even specific viral assembly pipelines [41,42], they do not always ensure genome completion. Due to the variable nature of metagenomic data, there is no strict workflow to obtain the best assembly and results show high variability. Even with samples treated equally, different assembly strategies may be required to obtain optimal results. Data exploration prior to data analysis is crucial and rigorous analysis is needed to achieve genome completion. Accordingly, in this study we explored several different strategies used to assemble Norovirus genomes in non-related patients.
Interspecies and intraspecies variability was a major limitation in the analysis, and the diversity present in our data is further confirmed by the low proportion of contigs belonging to NoV in Table 1. Assembly quality did not present any advantage with respect to the raw assembly pipeline when a specific NoV referenceoriented analysis was performed. Only the host-reads filter presented improvements when filtering or gathering reads that could belong to Norovirus in the case of MEGAHIT. When attempting to select reads mapping against NoV genomes, a great number of reads were filtered out, with a highly variable percentage of mapped reads between samples (Table 1). Neither were advantages found when removing human reads (pA) in the case of SPAdes, even though some studies have used mapping to host genomes to remove nonviral reads. Several studies address the use of reference-guided assemblies to reconstruct viral genomes. However, the presence of intra-host variability can cause biased alignments and references have to be chosen carefully [43,44].
The mRNA isolation strategy enabled Norovirus viral representation with read fractions ranging from 0.02 to 98.5 % due to a large variability in virus load of each patient (Table 1, total Norovirus reads covering contig and Ct value qRT-PCR). The variability in Norovirus representation suggested a need for metagenomic assembly, for which purpose we preferred to use the "meta" varieties in the case of SPADEs, expecting that contigs and scaffolds for other organisms would also be present.
Our strategy to counteract this variability was to use a clustering algorithm to reduce raw data complexity, selecting for the purpose CD-HIT, a tool widely used in metagenomics specifically for reducing redundancy and sequencing replicates in metagenomic samples. Among assemblers, MEGAHIT performed better than SPAdes, as samples were successfully assembled independently of CD-HIT use (all strategies were successful). CD-HIT was most advantageous when applied before SPAdes assembly, as it is computationally more expensive and could not yield Norovirus contigs reaching 7.5 kb in 4 samples without read binning in any of the approaches ( Table 2). By default, SPAdes worked better with samples with lower coverage (Table 1; the only sample (A) fully assembled with raw SPAdes with the default autoadjusted kmerlengths up to 55 had 455x mean coverage, 11-fold lower than the average coverage obtained in all samples). Samples with higher coverage obtained better assembly results after kmer-length 77 and read binning. The stringency used along CD-HIT may vary depending on the raw data and tuning steps are advisable to avoid the loss of coinfecting strains.
As reported in previous studies [14,32,45], SPAdes is a widely-used short-read assembler for general use in viruses, including Norovirus studies. Nevertheless, in our specific scenario we obtained more optimal results with MEGAHIT. All approaches assembled completely 6/8 samples with MEGAHIT. Moreover, pC MEGAHIT assemblies improved slightly when combined with CD-HIT (pD) in completeness, and also regarding the previously described efficient performance and low computational resource requirements. Our data thus support use of MEGAHIT for in-depth Norovirus assembly and by extension for other RNA viruses with high sequencing depths, whereas SPAdes will perform optimally with lower-coverage sequencing experiments. Combined with a step to reduce sequence redundancy, SPAdes will improve assembly quality while reducing data complexity [26,46,47].

Conclusions
We tested different workflows for the accurate and complete assembly of Norovirus genomes. These included different filtering steps and their subsequent assembly with both SPAdes and MEGAHIT. Even though there is no universal workflow for viral RNA assembly, NoV genomeoriented strategies did not present advantages compared to assembly without filters for any of the strategies, with the exception of host-filtering for MEGAHIT. We describe the performances of MEGAHIT and SPAdes and the use of read-binning to improve assembly statistics.  Table 1).

Norovirus genome assembly
All assembly steps were performed on a local server (16 Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz processors, 190 GB of RAM and 41 TB disk space) using 16 CPU threads. Use of computational resources was coordinated using GNU Parallel [48]. Tests with a fraction of the samples were performed to study time and RAM memory required to complete assemblies using NMON v14g [49] with both assembly from raw reads (pC) and read binning (pD).
Before assembling the Norovirus genomes, we performed read quality control using FASTQC (v0.11.5, Babraham Bioinformatics), and quality filtering using seqtk 1.2-r101-dirty with the default parameters (trimming up to 30 bp from each side following a 0.05 error rate threshold) [50].
Four variations of the assembly strategy were implemented to obtain Norovirus genomes (Fig. 1). Since our biological data is poly-A RNA, we expected to find human mRNA representation. Therefore, in the first derivation (pipeline pA), human reads were removed ( Fig. 1; pA). Trimmed FASTQ files were mapped on the hs37d5 reference assembly (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/.../hs37d5.fa.gz) using BWA mem v0.7.25-r1140 [52]. Reads not mapped against this reference were selected for assembly using both MEGAHIT and SPAdes with the previously described parameters, as well as --meta flag in SPAdes.
Pipeline pB directly selected Norovirus reads from the original FASTQ files by mapping against GenBank NoV genome sequences (483 genomes: updated in July 2021; Additional file 7 includes accession numbers) ( Fig. 1;  pB). Trimmed FASTQ files were mapped against the former FASTA reference file as in pA, selecting mapped reads against the NoV genomes reference for assembly in this case. Pipeline pC consisted of assembling trimmed FASTQs without applying any filtering steps ( Fig. 1; pC).
Finally, pipeline pD consisted of performing sequence binning via CD-HIT [53] on the raw quality-trimmed FASTQ files, clustering sequencing reads at 80 % identity to reduce sequence redundancy. SPAdes and MEGAHIT were run with the same parameters as previously used for a final round of metagenome assembly, using CD-HIT preprocessed FASTQ files as input ( Fig. 1; pD).
Norovirus contigs were identified using a local BLAST database built from reference Norovirus genomes obtained from GenBank (483 genomes: updated in July 2021; disclosed in Additional File 7). All assembled contigs were subjected to BLASTN [54] v2.2.31+ search, and Norovirus contigs were retrieved.
Quality statistics from all generated assemblies and filtered Norovirus contigs were assessed using QUAST v4.6.3 [55]. Raw quality-trimmed sequencing reads were mapped to the assembled Norovirus contigs using BWA mem to assess the volume of reads corresponding to Norovirus. Mean depth of coverage for each complete Norovirus genome per sample was calculated using Samtools v1.7 [56], on BAM files generated by mapping all sample reads to their corresponding final de novo assembled genomes using BWA mem. Open reading frames were predicted with GeneMark v3.25 [57] and inspected using Norovirus Genotyping tool v2.0 [58]. Final assemblies were chosen according to comparisons based on N50, contig length and completeness assessed with checkV [35].
For the sake of comparing the accuracy of the two assemblers used, the different workflows tested were compared (Fig. 1).

Norovirus variability among samples
After raw read mapping to the final assembled Norovirus contigs, variant calling was performed using Freebayes v1.2.0 [59]. Only variants at 1 % VAF and at least six alternate reads were considered.
Abbreviations HTS: High-Throughput Sequences; ORF: Open Reading Frame; pA, pB, pC or pD: pipelines A, B, C or D, respectively; SNVs: Single Nucleotide Variants; VAF: Variant Allele Frequency