Historically, identification of causal agents of disease has relied heavily on one’s ability to culture the organism in the laboratory and/or the use of organism-specific antibodies or sequence-based probes. However, these methods can be very limiting. For instance, some microorganisms are refractory to laboratory culture. In some cases, even microbiological assays for diseases that are manifested by cultivable organisms, such as endocarditis caused by staphylococci and streptococci, have a high false negative rate . Serological assays are typically limited to identifying known or closely related organisms and antigenic drift and shift can result in false negatives. Even highly sensitive PCR-based assays must be continually updated due to signature degradation . Additionally, for divergent viruses such as HIV-1, many PCR assays are unable to discriminate between closely related strains. This necessitates design of multiple probe sets, which is often a laborious task . Additionally, the sensitivity of these assays is often low due to high numbers of false negatives . Thus, there is a need for assays that are more robust and less pathogen specific.
Prior to the widespread adoption of high-throughput sequencing (HTS), high-density oligonucleotide microarrays were used to determine the presence of microorganisms. Syndrome-specific panels showed success in diagnosis of infectious disease . However, sequence features sufficiently different from the array probe will not hybridize, resulting in false negatives. By comparison, HTS represents a relatively unbiased approach to detection of causal agents of infectious disease. However, for metagenomic sequencing to be utilized in a routine clinical context would require some basic questions answered in terms of sensitivity and reproducibility. In this study, we compared three HTS platforms for their ability to detect pathogens in human blood. As compared to the traditional Roche-454 sequencer, the benchtop sequencers IonTorrent PGM and MiSeq were better able to detect a pathogen in human blood, in part by virtue of their increased throughput.
Our reported LoD for viral samples on the Roche-454 of 1X102.5 pfu/mL is similar to but slightly lower than a previously reported value of 1X103 pfu/mL . This may be a reflection of an increased number of reads in our study, differences in sequence library preparation or improvements in sequencing chemistry. Notably, this value is near the LoD for a validated qPCR assay . Our reported LoD of 1X102.5 pfu/mL corresponds to an upper limit of 31,600 – 53,720 genome copies/mL. Indeed, at that titer, the sequence reads were of sufficient number and length to unequivocally discriminate between Dengue virus subtypes. However, a strain-level designation was not possible. The inability to make a strain-level call could conceivably have potential clinical consequences. For instance, a recent phylogenetic study of circulating strains of Dengue Virus 2 indicated that a single substitution on the prM39 was responsible for fatal cases of Dengue Hemorrhagic Fever .
Using a different virus at a comparable titer, the Roche-454 platform was able to definitively identify the pathogen in only one of two replicates. Both the PGM and MiSeq surpassed the Roche-454 in terms of number of mapped reads as well as reproducibility. However, although the Roche-454 sequence data produced only a single hit, the length and quality of the hit were adequate to make a correct strain call (data not shown).
The apparent difference in the ability of each platform to detect a given pathogen is a function of the total output per platform. When the values are expressed in a normalized manner (pathogen reads per 100,000), it becomes apparent that the analytical sensitivity of the two benchtop platforms is roughly the same and surpasses that of Roche-454 (Table 2). Neither technology seems to have a significant advantage regarding detection or characterization of a pathogen if the number of reads is held constant. When cost per sequencing run or per megabase is factored in, the benchtop sequencers are a more economical option.
In this study, we also examined metagenomic sequencing for its ability to detect a genetically modified pathogen in a clinical sample. A BSL-2 strain of B. anthracis containing an inserted RFP gene was used as a surrogate for a genetically modified organism (GMO), and spiked into human blood at relatively high concentration. In this case, although reads were identified as likely originating from B. anthracis, in none of the samples was evidence of the inserted gene detected. This likely means that, for the time being, even with their substantial output, the benchtop sequencers are not suitable for detection of GMO from complex samples or characterization of threat agents from complex samples. Our results indicate, however, that although characterization of a given pathogen from a clinical sample by metagenome sequencing on a benchtop sequencer may not be possible without some pathogen-specific enrichment, identification of species and even strains is possible.
The results of this study suggest that the benchtop sequencers perform well at the task of identifying a putative pathogen present at low titers. Each of the three platforms tested provided a number of reads that was sufficient to unambiguously identify the pathogen to species level in the case of virus and to the genus level in the case of bacteria. The data from each platform was very reproducible for technical replicates within library preparations. Indeed, the data from each platform was remarkably consistent in terms of quantity and quality (Table 2; Additional file 2: Table S1). Additionally, there was little variation in the number of reads mapped or in taxonomic classification of contigs. These results suggest that library protocols and sequencing chemistries are robust and uniform enough to make a dependable identification of a given pathogen. Our results are in agreement with a recently published study in which gDNA from Bacillus anthracis was serially titrated into background DNA collected from air filters and soil. The results of this study demonstrated that, even with whole genome amplification prior to sequencing, it is difficult to assign a proper species classification to sequence reads from B. anthracis .
We did note several platform-specific variations in our data. For instance, in the case of the Influenza data from the PGM platform, segment 5 routinely exhibited coverage bias as compared to larger segments. There could be several possible explanations for this observation. The G/C-content of the Influenza A genome displays wide intra-segment and inter-segment variations (Additional file 4: Figure S2). A closer examination of the mapped reads seems to show a slight correlation with areas of low G/C-content, but follow on experiments would be required to conclusively elucidate the impact of G/C-content in this context. A number of studies have noted that NGS data have distinct biases in areas of high G/C-content [17, 26, 27]. Additionally, template amplification via emulsion PCR is a potential source of reduced library diversity . Moreover, inefficiencies during reverse transcription due to RNA secondary structure may be responsible for the observed coverage bias. Given that the initial step in PGM library construction involves random fragmentation with RNase III at 37°C, it is possible that some RNA strands did not completely unfold. This may be especially true for those segments with higher ΔG, such as segments 1, 2 and 5 (Additional file 4: Figure S2). Although the MiSeq replicates also showed a slight skew towards segment 5, the preference was not as extreme as that for the PGM. This may be due in part to the method for RNA library construction on the MiSeq platform. The initial step involves a chemical fragmentation step at high temperature. Thus, the diversity of the libraries may be different from the start. On the other hand, the PGM replicates demonstrated some variation in mapped reads (Figure 3) suggesting that increased throughput might produce greater diversity.
Another platform-specific characteristic observed in this study was the proportion of mismatches within a non-perfectly mapped read. This statistic was slightly higher for the MiSeq platform than for the PGM platform. Ideally, given the correct reference sequence, for a given platform this statistic would approach zero as sequencing error decreased. However, as in this case, this statistic may also be affected by nucleotide differences between the sequenced strain and the closest reference sequence available from NCBI, precluding us from making any conclusions as to sequencing error rates from these data. It is possible that, to some extent, the longer length of MiSeq reads allows for more opportunity for mapping of non-perfect matches, and this may contribute to a decreased LoD for MiSeq when mapping to a closely related but not identical reference sequence. However, the extent to which error rate versus strain level differences and read length affect this statistic as well as the LoD cannot be ascertained in the absence of a true reference sequence for the strain in question.
A number of recent reports have attempted to define the limitations of metagenomic sequencing data. One study made use of simulated data sets to compare assemblies from three sequencing technologies (Sanger, pyrosequencing and Illumina). Unsurprisingly, the study concluded that assembly quality decreased rapidly with increasing sample complexity. For low complexity samples (10 genomes) the assemblies were comparable in quality and inclusiveness, while Illumina data produced superior assemblies in a higher complexity sample (100 genomes) . These results mirror those presented here in that no one sequencing chemistry clearly surpassed another in terms of identification of a microorganism present in a low-complexity sample. It should be noted that the Illumina data in the study by Mende et al. were from the HiSeq platform and were extensively trimmed to provide high quality reads as input . A separate report estimated that genome coverage of 20X was required for proper taxonomic classification of species present in a given metagenomic community . This study is in agreement with our inability to make a correct species-level determination from our Bacillus anthracis samples. Additionally, our results complement an important conclusion from the previous report - that the efficiency of gene detection is most likely overestimated.
There has been much effort to understand and improve metagenomic data from complex samples comprised mostly of bacterial species. There are fewer published studies examining the effects of different sequence technologies on viral metagenomics. A recent research paper attempted a comparison of Roche-454 and Illumina data for estimation of diversity in viral quasispecies, in this case HIV. The authors of that study noted that the increased throughout and lower error rate of the Illumina platform enabled improved reconstruction of viral haplotypes. However, due to the longer read lengths, the Roche-454 was superior when long range reconstruction was necessary .
Our results indicate that library diversity and overall throughput are the two key metrics in determining how a researcher or clinician may use metagenomic data from benchtop sequencers. For instance, a recent survey of Human papillomavirus DNA present in human skin tumors demonstrated that the increased throughput of the PGM enabled identification of seven additional viral subtypes as compared to data from Roche-454 sequencing of the same samples . This result is similar to our observation that PGM data provided steady, reproducible identification of Influenza A virus in comparison to sequence data produced by the Roche-454. Overall, the MiSeq proved superior to both the Ion Torrent PGM and Roche-454 for both detection as well as classification of the pathogen present in our mock samples.
Although there is no published report of this, it remains formally possible to identify a previously unknown agent from a single novel microbial read present in a complex metagenomic sample. Indeed, identification of novel agents has been reported with as few as 14 reads out of over 100,000 . Whereas identification of an agent may require detection of only one or more reads, characterization, the crucial next step, is absolutely dependent on complete (100%) or nearly complete representation of the agent’s entire genome at adequate depth of coverage, especially in the case of RNA viruses or other microorganisms likely to exhibit functionally relevant minority populations or quasispecies, or genetically modified organisms. In this case, it is necessary for follow-on experiments to more fully characterize the genome of the microorganism, such as Sanger sequencing using primers based on the novel fragment(s). It would be optimal if some of the original sample were available for such experiments. However, in many cases, the original sample may be precious or limited in terms of volume. This challenge can be more pronounced when identifying viral agents as opposed to bacterial agents. Viral genomes are orders of magnitude smaller (~1X104-1X105 bps) than those of an average bacterial agent (~3-5X106 bp). Thus, the overall amount of viral nucleic acid may be in the picogram range, increasing the likelihood of two technical obstacles: 1) viral nucleic acid is outcompeted during amplification by other nucleic acids in the matrix, such as host ribosomal RNA if the matrix is tissue or, 2) if the overall amount of nucleic acids in the metagenome sample itself is low, then library preparation of the sample may fail as successive losses of genetic material occur in each step. Targeted amplification of organism-specific regions prior to sequencing has shown some promise. For instance, an assay in which multiplex PCR preceded sequencing was able to fully differentiate Bacillus anthracis, Yersinia pestis and Francisella tularensis.