Detecting pathogens such as bacteria or viruses that cause infections such as pneumonia and meningitis is an important step in clinical diagnosis. One problem in detecting pathogens is that traditional methods of pathogen detection are time-consuming, as an infectious disease may be caused by a large range of pathogens, which have to be checked one by one. Another problem is that up to 60% of the pathogens in some infectious diseases cannot be detected [1]. This can cause a delay in treatment or even mistreatment of patients.
Unbiased next-generation sequencing (NGS) can detect DNA fragments (reads) of all species in a metagenomic sample with a mixture of different species. Those NGS reads can be classified into different taxa by comparing them with a collection of reference sequences, and pathogens can be detected if some reads match them. In clinical diagnoses, it is essential that a classifier can detect a significant number of reads supporting the potential pathogens and report as few false classifications as possible to give a high abundance rank to the pathogen. Otherwise, the pathogen cannot be distinguished from background noise, and it will take doctors a long time to go through a long list of candidates to dig out its existence.
Existing metagenomic classifiers are not effective for detecting low-similarity pathogens, i.e., pathogen with a genome that is not similar to the reference. This is because most classifiers detect pathogens by constructing a characteristic profile (e.g., k-mers) for each reference and assigning reads to species by comparing them with the reference profiles [2, 3]. When the characteristic profile does not match the genome of low-similarity pathogens, this approach fails and results in many incorrect or non-specific classifications.
Some tools assign reads to reference sequences by local or semi-global alignment. Using an alignment-based method, more reads can be assigned to the causal pathogen, but at the cost of much longer analysis time (over 4 h for a typical 1 Gb dataset). However, the alignment score of reads from a low-similarity pathogen is conceivably low, and these reads often cannot be assigned to the causal pathogen specifically, so the number of reads supporting the causal pathogen is still too low.
To detect low-similarity pathogens, we developed MegaPath for NGS-based pathogen detection. It has two significant contributions. First, instead of assigning each read to a reference sequence one by one, MegaPath analyzes all aligned reads globally to sort out a subset of reads with confident alignments. Then, MegaPath reassigns the non-specifically aligned reads to the species with confident alignments and discards spurious alignments to avoid potential false classifications. The reassignment increases the number of reads supporting the causal pathogens and reduces the number of false-positive assignments. Second, MegaPath implements a fast alignment-based approach, utilizing an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm.
Let us take a metagenomic NGS sample of cerebrospinal fluid [4] as an example. The similarity of the pathogen to reference is 18.9%. Centrifuge [2], CLARK [5] and Kraken [3], based on characteristic profile, detected 31, 1 and 6 reads from the pathogen, respectively. The abundance rank of the pathogen was 710, 1488 and 384, respectively. With that, a medical doctor needs to go through a list of hundreds of species to reach the causal pathogen. Kraken2 [3] is the successor of Kraken that applies more sophisticated characteristic profiles. It detected 74 reads from the pathogen and the abundance rank of the pathogen went up to 176. SURPI [6] spent four hours on read alignment and detected 76 reads from the pathogen, abundance rank at 264. In contrast, with better alignment tools and global analysis of reads, MegaPath took less than one hour and detected 608 reads for the pathogen, abundance rank at 33. In our experiment, MegaPath performed better than the existing tools, ran faster than the alignment-based tools and has comparable running time with the less sensitivity profile-based tools.
In addition to detecting pathogens with known reference sequences, MegaPath can detect novel pathogens without any similar DNA-level sequences in the reference database. MegaPath uses MegaHit [7] to assemble the reads from the novel pathogens to longer DNA fragments. Since protein sequences are better-conserved than DNA sequences [8, 9], these DNA contigs from novel pathogens are then annotated by DNA-protein alignment [10] to detect related species, genera or families.
Implementation
Figure 1 shows the workflow of MegaPath. There are three major steps in MegaPath for detecting pathogens. First, it applies MegaGuide, an ultrafast NGS aligner specifically designed for pathogen detection to align reads to reference sequences of bacteria and viruses. Then it applies spike polishing to filter spurious alignments at highly repeated regions. Lastly, it applies a two-phase taxonomy assignment of reads.
Aligning reads with MegaGuide
MegaGuide is an ultra-fast aligner that follows a seed-and-extend procedure. In the seeding stage, MegaGuide searches for maximum-mappable seeds in the read using a BWT built from the reference sequences. The search will stop at (or a few bases after) a sequencing error or a genomic mutation. A new search will start at the end of the previous seed. The maximum-mappable seeding strategy reduces a vast number of short seeds that cannot be used to find the true alignments. The overlap between the seeds boosts sensitivity, especially for distant species. In the extension stage, MegaGuide implements an improved Smith-Waterman algorithm with each entry in the dynamic programming tables using 8 bits or less (instead of the normal implementation using 64 bits). Thus, by applying the 256-bit SIMD instructions, values of 32 entries can be calculated in parallel.
The reference database is large (the size of the latest RefSeq is over 30 Gb). It is neither efficient nor necessary to align all reads to all reference sequences. Thus, the following two types of reads, which are not informative for pathogen detection, will be filtered out from the downstream analysis. First, clinical metagenomic samples are usually dominated with human DNA (which could be as high as 95%) that carries no pathogen information. Second, short NGS reads sampled from the repetitive or homologous regions across species, e.g., ribosomal DNA, are not useful for pathogen detection. MegaPath filters out the two types of reads by aligning reads to the human reference genome and a database of homologous regions. Reads confidently aligned to the human reference genome or the homologous regions will be removed. The remaining reads will then be aligned to the pathogen genome database for pathogen detection.
Spike polishing
Aligning reads to human genome or homologous regions can detect most of the reads that are not informative for pathogen detection. However, not all of them can be detected due to the missing annotation or the incomplete classification of all possible homologous regions. To filter out reads sampled from unknown homologous regions, MegaPath makes use of the read depth information of each genome. Since MegaGuide aligns a read to all its possible genome positions, the read depth of the homologous regions is expected to be much higher than the other regions. MegaPath calculates the mean (u) and the standard deviation (sd) of the read depth of each genome. Continuous regions with a read depth higher than u + α·sd are defined as spike regions. All alignments in the spike regions will be removed from the downstream analysis. We have tried the value of α from 1 to 100. When α < 10, there are many regions misidentified as spike regions and are removed. When α > 50, only a few spike regions were detected and many homologous regions remain. The filtering has the best performance for α being from 10 to 50. Thus, we select α = 30 as default.
The two-stage taxonomy assignment algorithm
It is common that a read is aligned to different genomes with the same (or similar) confidence. These genomes may be from different species, genus, or even families. A straightforward way to assign a read to a taxon is assigning it to the lowest common ancestor (LCA) of all taxa the read aligned to. However, this approach leads to a large number of less specific assignments. MegaPath implements a two-stage assignment algorithm to increase the specificity. In the first stage, MegaPath assigns each read to the species they aligned to. We allow a read to be assigned to multiple species because the sequenced pathogen might have enriched mutations in some regions, and these regions can look very different from the correct reference genome. After assigning all the reads, in the second stage, MegaPath will try to reassign each of the shared reads to a species using the following rules.
First, all reads are assigned to one or more species according to their alignments. A read is tagged by ‘U’ if it is assigned to only one species, or ‘M’ if assigned to more than one species. Then, for each species, the numbers of ‘U’-tagged reads and the number of ‘M’-tagged reads are calculated. For a species S, we define UCount(S) as the number of ‘U’-tagged reads assigned to S, and AllCount(S) as the number of both ‘U’-tagged and ‘M’-tagged reads assigned to S. For two species S and T, we define MCount(S, T) as the number of reads assigned to both S and T. We say a species S weakly explains another species T, if 1) AllCount(S) - MCount(S, T) ≥ r * AllCount(S), and 2) UCount(T) < e * UCount(S). The default values of r and e are both 0.05. Descriptively, the criteria are interpreted as, 1) reads aligned to S are likely to be correct alignments (instead of misaligning read sampled from T) if quite a number of reads (r = 5%) are not similar to T; 2) the unique reads that support T might be a coincidence (e = 5%) due sequencing error or misalignment. We say S explains T if S weakly explains T and no species weakly explains S.
In the second stage, a read assigned to both S and T will be reassigned to S only if S explains T. After reassigning all the shared reads, MegaPath will apply the LCA algorithm to determine the taxon of each read.