A comprehensive evaluation of long read error correction methods

Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-020-07227-0).

• HG-CoLoR: commit cc18a95 was used. The maximum K-mer size (-maxK) of the variable-order de Bruijn graph was set to 100 as suggested in its paper. We set top alignments reported by BLASR [2] (-bestn) to 30, 40 and 50 for E. coli, yeast and fruit fly data sets respectively, according to the short reads sequencing depth, which is recommended in its manual.
• FMLRC: version 0.1.2 was used. Before running FMLRC, a BWT of short reads was constructed with msBWT [3] version 0.3.0 using default parameters according to its manual. FMLRC was run with default short and long k-mer length, which is 21 and 59 respectively.
• HALC: version 1.1 was used. Prior to running HALC, AbySS [4] version 2.0.2 was used to generate contigs from short reads using 64 as k-mer length as what is suggested in its paper. HALC was run with -o to use ordinary mode so that LoRDEC is used to refine the repeat corrected regions.
• CoLoRMap: OEA mode was not tested in the experiment since it did not improve correction quality significantly but increased its runtime by about two-fold. No other parameters are available for tuning.
• Jabba: version 1.0.0 was used. Prior to running Jabba, karect [5] and btownie (https://github.com/biointec/ brownie) were run to construct and correct de Bruijn graph with short reads. The de Bruijn k-mer size was set to 75 (-k 75) according to the user-manuals of btownie and Jabba.
• Nanocorr: the long reads were partitioned using the script provided 'partition.py'. The number of reads per file was set to 100, and the number of partition files per directory was also set to 100. The read files in the same directory were processed in parallel. To reduce disk usage, the temporary files were removed immediately after the extraction of corrected reads.
• prooveread: version 2.14.1 was used. The coverage of short reads was set according to the data sets used.
• ECTools: the long reads were partitioned and processed in the same way as Nanocorr. AbySS version 2.0.2 was also used to generate contigs from short reads using the same parameters that were used for HALC. Nucmer version 3.1 was used for alignment.
• LSC: version 2.0.1 was used. The default parameters were used.
• FLAS: commit 053c19b was used. The default parameters were used.
• LoRMA: version 0.5 was built with GATB version 1.4.0. LoRDEC version 0.8 was used. Default parameters were used for number of friends (-friends 7) and k-mer size (-k 19).
• Canu: version 1.7 was used. -correct was used to generate corrected reads rather than running the whole assembly pipeline. The genome size was set to the size of the reference genome. Parameter 'useGrid' was set to false to run Canu on one node. Parameter 'stopOnReadQuality' was also set to false to keep Canu running even if read quality is low. Parameters '-pacbio-raw' and '-nanopore-raw' were used for PacBio and ONT reads respectively.