- Open Access
A graphical, interactive and GPU-enabled workflow to process long-read sequencing data
BMC Genomics volume 22, Article number: 626 (2021)
Long-read sequencing has great promise in enabling portable, rapid molecular-assisted cancer diagnoses. A key challenge in democratizing long-read sequencing technology in the biomedical and clinical community is the lack of graphical bioinformatics software tools which can efficiently process the raw nanopore reads, support graphical output and interactive visualizations for interpretations of results. Another obstacle is that high performance software tools for long-read sequencing data analyses often leverage graphics processing units (GPU), which is challenging and time-consuming to configure, especially on the cloud.
We present a graphical cloud-enabled workflow for fast, interactive analysis of nanopore sequencing data using GPUs. Users customize parameters, monitor execution and visualize results through an accessible graphical interface. The workflow and its components are completely containerized to ensure reproducibility and facilitate installation of the GPU-enabled software. We also provide an Amazon Machine Image (AMI) with all software and drivers pre-installed for GPU computing on the cloud. Most importantly, we demonstrate the potential of applying our software tools to reduce the turnaround time of cancer diagnostics by generating blood cancer (NB4, K562, ME1, 238 MV4;11) cell line Nanopore data using the Flongle adapter. We observe a 29x speedup and a 93x reduction in costs for the rate-limiting basecalling step in the analysis of blood cancer cell line data.
Our interactive and efficient software tools will make analyses of Nanopore data using GPU and cloud computing accessible to biomedical and clinical scientists, thus facilitating the adoption of cost effective, fast, portable and real-time long-read sequencing.
Cancer is a leading cause of mortality worldwide with the highest burden of death affecting lower- and middle-income countries . Delays in medical care from the inability to detect cancer earlier are a key component contributing to higher morbidity, poor response to treatments and lower survival . Advances in molecular diagnosis have enabled detection of specific driver mutations that can be essential for prognosis, monitoring, and targeted therapy [3,4,5]. Examples of such “precision medicine” include the BCR-ABL fusion gene in chronic myeloid leukemia (CML), PML-RARA fusions in acute promyelocytic leukemia (APL) and FLT3 mutations in acute myeloid leukemia (AML) [6,7,8,9]. Potentially targetable mutations are also found in solid tumors, such as renal cell carcinoma [10,11,12]. Currently, detection of fusion genes by chromosomal analysis requires highly specialized laboratories. Chromosomal analyses may not have the precision necessary to identify the specific breakpoint in a patient or provide the sequence of the flanking segments around the fusion gene to allow for downstream development of patient specific monitoring assays. Routine PCR based assays can be rapid, but a priori knowledge of the fusion breakpoints is required and when fusions involve large intronic regions RNA input is generally needed. Thus, turn-around times for these methods can take three days to two weeks [13, 14]. To capitalize on the potential of precision medicine, faster analysis of sequencing data is needed to improve the potential of molecular-assisted cancer diagnoses .
For cancer management, next generation sequencing (NGS) has many limitations, such as phasing errors, mis-mapping from short reads, strand bias and amplification errors causing irregular variant allele frequency, and bioinformatically-challenging repetitive sequences [16, 17]. In contrast, long-read sequencing technology, such as Oxford Nanopore Technologies (ONT), generates continuous sequences up to a few megabases in length at this time . Nanopore sequencing provides both sequencing and phasing information because the read lengths are generally very long in comparison to NGS (NGS = 150 to 250 bp, nanopore = generally > 2 kb to 200,000 kb or longer), therefore it will not incur the same artifacts and mapping errors . Unlike NGS, which takes an average of three days to complete sample processing and library preparation plus one additional day for sequencing, nanopore sequencing can directly sequence DNA, resulting in much shorter turnaround times [20, 21]. Thus, long-read sequencing technologies hold promise in overcoming the current diagnostic gap in cancer research.
Computational methods and software tools tailored for long-read sequencing data are essential to enable use of this emerging and promising technology [22, 23]. In nanopore sequencing, electrical current alterations are recorded as different bases traverse the pore opening. Basecalling, which translates the signal (stored as fast5 files) into a sequence of base pairs is the key step determining accuracy of the sequencing experiment. The initial conversion is followed by error correction and data polishing to obtain the final sequence . Basecalling is computationally expensive and a rate-limiting step in the analysis of nanopore data. In contrast, for NGS data, aligning reads is the rate limiting step due to the greater number of reads and the absence of a separate basecalling step. Deep learning neural network models have been applied to basecalling to increase the accuracy . With standard CPU processing, these methods are prohibitively slow and require large numbers of computational cores operating in parallel to be practical. Graphics processing units (GPUs) can be used to accelerate the analysis but require specialized hardware and software. Hardware in the form of GPU instances are available on public cloud services such as Amazon Web Services (AWS). However, virtual machine instances do not come with the drivers or Compute Unified Device Architecture (CUDA) libraries installed. In addition, the versions of these drivers and libraries must be carefully matched to the software being executed.
Nanopore sequencing is fast and cost effective in terms of data collection, with a $90 USB Flongle attached to a laptop supporting real-time sequencing. Sequencing is potentially accessible to a broad range of biomedical scientists who do not have access to a traditional sequencer. A challenge in democratizing long-read sequencing technology is the difficulty of the processing of the data which requires command line tools and technically difficult installation of software, libraries and drivers. There is a lack of graphical bioinformatics software tools which can efficiently process the raw nanopore reads, and interactive visualizations for interpretations of results. An example of a command-line workflow is the MasterOfPores pipeline that performs pre-processing and analysis (prediction of RNA modifications and estimation of polyA tail lengths) of long-read data . MasterOfPores is a workflow using the NextFlow framework , a script-based engine that requires programming experience to deploy and modify. While MasterOfPores uses software containers for most of its components, it does not provide a container for the key basecalling step which requires the most setup and configuration to operate with GPU computing. In addition, MasterOfPores does not include the product-grade basecaller Guppy , which is available to ONT customers via their community site  and cannot be distributed in a container.
We present a graphical cloud-enabled containerized workflow for fast, interactive analysis of nanopore data using GPUs. Specifically, we extended the Biodepot-workflow-builder (Bwb)  to provide a modular and easy-to-use graphical interface that allows users to create, customize, execute, and monitor bioinformatics workflows. Figure 1 shows screenshots of the platform. The workflow consists of modules to download the data and genome files, basecallers Guppy  or Bonito , minimap2  for sequence alignment, and the Integrated Genome Viewer (IGV) [32, 33] for visualization of the BAM files. We provide containers for both the ONT proprietary Guppy  and the ONT open-source Bonito  basecallers. In accordance with the licensing constraints of Guppy, we provide a containerized setup module that creates the Guppy container locally when the user provides the download URL from the ONT community site. To facilitate deployment, we use Docker containers for all modules (including the Bwb platform) and provide an Amazon Machine Image (AMI) with all software and drivers pre-installed for GPU computing on the cloud.
Integrated and interactive workflow in the Biodepot-workflow-builder (Bwb)
Biodepot-workflow-builder (Bwb)  provides a modular and easy-to-use graphical interface for reproducible execution, customization and interactive visualization of the nanopore pipeline. Graphical widgets representing Docker containers that execute modular tasks are graphically linked to define bioinformatics workflows that can then be reproducibly deployed across different local and cloud platforms. New widgets (modules) can be added without writing code. In this work, we created new widgets in the Bwb for basecalling using Guppy  and Bonito , alignment using minimap2 , visualization of the resulting BAM files using the Integrated Genome Viewer (IGV) [32, 33]. Each of these widgets call a Docker container in the backend. Users can adjust input parameters of each widget using an intuitive form-based user interface, and check intermediate results using a console. A key characteristic of Bwb is integrated support for graphical output, enabling interactive tools such as Jupyter notebooks, spreadsheets, and visualization tools to be included in the workflow. We leverage this graphical output support feature of the Bwb to create an integrated, interactive workflow by integrating the IGV to visualize resulting BAM files. Our workflows can be shared using a Bwb native format or exported as shell scripts. In addition, the Bwb is distributed as a Docker container that can be easily deployed on any local or cloud platform.
GPU software versioning and compatibility ensured by using containers and providing an AMI with drivers and software
GPU-enabled executables require additional software layers to use the GPU hardware. For NVIDIA hardware, CUDA software provided by the manufacturer is necessary to perform general computations on the GPU. AMD GPUs use a different and incompatible set of software for their hardware. In addition, low level libraries (drivers) are required to communicate with the GPU card. Additional language and operating system dependent libraries and headers may also be required to integrate the CUDA software. All these layers of software interact with each other and as a result, compatibility is version dependent and even sensitive to the method of installation. Components installed using scripts may be incompatible with components installed using package managers.
An example of the complexities involved in deploying GPU software is the open-source Bonito basecaller. The current version of the Bonito caller will not install if one follows the instructions on the Github . This is due to dependency incompatibilities in the current Python PyTorch  packages and CuPy  libraries. We were able to install Bonito v0.38 by downgrading from CUDA 11.3 to CUDA 10.2, cuDNN 7, CuPy 10.2. and downgrading PyTorch from 1.8.0 to 1.7.1. By providing a Docker container we can ensure that Bonito is deployed in this compatible environment without the need for the user to install the exact versions of the software. However, this is not sufficient, the user still needs to install the compatible drivers and CUDA software on their host machine. This is true even for GPU instances on Amazon cloud. AWS does provide the basic Linux operating systems but requires that users install their own drivers and libraries. There are commercial distributions on the AWS marketplace that provide support, but we could find nothing among the free community offerings. As a resource for the nanopore research community, we therefore have provided a public freely available AMI for use with AWS GPU virtual machine instances with versions of drivers and libraries that we have tested with our workflows. The combination of AMI and containerized modules eliminates the arcane installation steps and makes the nanopore GPU software accessible to a broad audience in the biomedical and clinical community.
Data generation using cell lines
5,000 ng of DNA from four cell lines (NB4, K562, ME1, MV4;11) was dephosphorylated and cut with Cas9 enzyme complexed with RNA guides designed to target the genes involved in the translocations present in each line (BCR and ABL1 for K562, PML and RARA for NB4, CBFB and MYH11 for ME1, KMT2A and AFF1 for MV4;11). Nanopore adapters were then ligated to the newly created DNA ends and the library was loaded onto a flow cell or Flongle and sequenced using a MinION Mk1B.
Detection of fusion genes in blood cancer cell lines
Applications of nanopore sequencing coupled with our workflows in cancer diagnostics are shown in Fig. 2. We can reliably detect fusion genes from DNA sequences using cell lines (NB4, K562, ME1, and MV411) with known fusion genes. This provides an advancement to molecular diagnostics by its ability to detect specific breakpoints even if there is a large intronic region between the fusion genes of interest. DNA sequencing of NB4 on a comparatively low cost ($90) Flongle nanopore device, (Oxford, UK), confirmed fusion gene sequencing spanning the PML and RARA genes (Fig. 2a). DNA sequencing of K562 on a Flongle was able to capture the fusion gene spanning the BCR and ABL1 genes, as well as capturing the intervening large ABL1 intronic region 1 where the specific breakpoint occurs (Fig. 2b). Figure S1 in Additional file 1 presents additional results for the detection of fusion genes CBFB and MYH11 for the cell line ME1 (Figure S1A), KMT2A and AFF1 for MV4;11 (Figure S1B).
We compared the execution time of the basecallers Guppy  and Bonito  on GPU-enabled machines using the NB4 cell line data. We also measured the execution time of running Guppy using CPU and on a local host. The runtime of Guppy is reduced from over 42 min to just over 1 min using GPU computing, representing a 29x speedup and a 93x reduction in cloud computing costs. The results are summarized in Table 1.
Experimental setup for benchmarking
Guppy GPU was benchmarked on an AWS g4dn.4xlarge virtual machine instance with a NVIDIA Tesla T4 GPU with the template_r9.4.1_450bps_hac.jsn model file provided by ONT. Bonito GPU was also benchmarked on the same instance using the provided dna_r9.4.1 model file and the default settings (chunk size of 4000 and batch size of 32). Guppy CPU was benchmarked on a c5d18xlarge instance with 72 vCPUs, 72 threads/basecaller, and 1 basecaller. The Guppy GPU experiments on a local host were performed on a laptop with a GeForce RTX 2060 GPU. All benchmark experiments on AWS were based on 4 runs. We observed that Guppy GPU achieved the fastest average time at 88.9 s (1.5 min) with standard error of 1.2 s. Bonito GPU finished basecalling in 948.2 s (15.8 min) on average with standard error of 1.7 s. Guppy CPU achieved the slowest average time at 2551.8 s (42.5 min) with standard error of 22.4 s. The 29x speedup is computed by comparing the average runtime (in seconds) of Guppy CPU to Guppy GPU (2551.8/88.9 = 28.7).
Estimation of costs of basecalling step
A large selection of virtual machine instance types with different pricing structures are available on AWS . We conducted our empirical experiments using the C5 and G4 EC2 (Elastic Compute Cloud) instances, designed for compute-intensive workloads and GPU computing respectively. In the us-east-2 region, the on-demand pricing of the AWS c5d.18xlarge EC2 instance, with 72 vCPUs and 144GB memory, is $3.888 per hour. The pricing for a g4dn.4xlarge, with single GPU, 16 vCPUs and 64GB memory, is $1.204 per hour . The ratio of the costs of CPU vs. GPU is the time ratio 2551.8/88.9 multiplied by the pricing ratio 3.888/1.204 which works out to be 92.7-fold cheaper when the GPU instance is used for basecalling. These cost estimates are based on single samples.
A potential advantage of long-read sequencing is lower costs in comparison to standard NGS. We used low-cost Flongles to detect the PML-RARA and BCR-ABL1 fusions. In this work, we demonstrate the ability to detect fusions on nanopores devices that read DNA at speeds faster than 1 nt/µs. We present interactive software tools that not only make analyses of Nanopore data accessible to biomedical and clinical scientists, but also efficient and economical through the use of GPU computing. Most importantly, we have illustrated the applicability of our workflow for the analysis of cell line data as part of a rapid, cost effective assay to detect fusions. Using an intuitive graphical interface, our workflow integrates the processing of raw nanopore reads, with the visualization steps that are used to interpret the results. This provides the capability to identify and confirm pathognomonic fusion genes such as BCR-ABL1 in CML or PML-RARA in APL. This methodology can potentially both define specific breakpoints important in treating and tracking a patients’ specific disease, but also allow multiplex phasing to identify and follow multiple mutations in the same patient. Future directions include optimizing the assay for faster library preparation, sequencing times, and analysis to enable fusion detection to less than a day. Improvements to turnaround time in the laboratory combined with an accessible, efficient informatics workflow that enables most molecular technologists and pathologists to implement long-read sequencing into current clinical pathology workflows will advance the field of molecular pathology beyond what is currently possible with NGS. These small portable, low-cost devices, together with integrated bioinformatics support, will allow for rapid diagnostics to assist in point-of-care clinical decision making.
Availability and requirements
Project name: Nanopore-GPU.
Project home page: All code is publicly available and distributed under a custom academic license at https://github.com/BioDepot/nanopore-gpu. A public AMI (ami-0ecb1effab7fcfaa3) is provided with all necessary drivers and software pre-installed.
Operating system(s): Platform independent.
Programming language: Python.
Other requirements: Docker.
License: Custom non-commercial license.
Any restrictions to use by non-academics: The software is used solely for noncommercial purposes.
Availability of data and materials
Amazon Machine Image
acute myeloid leukemia
acute promyelocytic leukemia
Amazon Web Services
Compute Unified Device Architecture
Elastic Compute Cloud
Graphics Processing Unit
Integrated Genome Viewer
Oxford Nanopore Technologies
The Lancet. GLOBOCAN 2018: counting the toll of cancer. Lancet. 2018;392(10152):985.
Milner DA Jr, Holladay EB. Laboratories as the Core for Health Systems Building. Clin Lab Med. 2018;38(1):1–9.
Mehta S, Shelling A, Muthukaruppan A, Lasham A, Blenkiron C, Laking G, Print C. Predictive and prognostic molecular markers for cancer medicine. Ther Adv Med Oncol. 2010;2(2):125–48.
Yeung CCS, Radich J. Predicting Chemotherapy Resistance in AML. Curr Hematol Malig Rep. 2017;12(6):530–6.
Yaghmaie M, Yeung CC. Molecular Mechanisms of Resistance to Tyrosine Kinase Inhibitors. Curr Hematol Malig Rep. 2019;14(5):395–404.
Radich J, Yeung C, Wu D. New approaches to molecular monitoring in CML (and other diseases). Blood. 2019;134(19):1578–84.
Dohner H, Estey EH, Amadori S, Appelbaum FR, Buchner T, Burnett AK, Dombret H, Fenaux P, Grimwade D, Larson RA, et al. Diagnosis and management of acute myeloid leukemia in adults: recommendations from an international expert panel, on behalf of the European LeukemiaNet. Blood. 2010;115(3):453–74.
O’Donnell MR, Tallman MS, Abboud CN, Altman JK, Appelbaum FR, Arber DA, Attar E, Borate U, Coutre SE, Damon LE, et al. Acute myeloid leukemia, version 2.2013. J Natl Compr Canc Netw. 2013;11(9):1047–55.
Arber DA, Orazi A, Hasserjian R, Thiele J, Borowitz MJ, Le Beau MM, Bloomfield CD, Cazzola M, Vardiman JW. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood. 2016;127(20):2391–405.
Xiao X, Garbutt CC, Hornicek F, Guo Z, Duan Z. Advances in chromosomal translocations and fusion genes in sarcomas and potential therapeutic applications. Cancer Treat Rev. 2018;63:61–70.
Tretiakova MS, Wang W, Wu Y, Tykodi SS, True L, Liu YJ. Gene fusion analysis in renal cell carcinoma by FusionPlex RNA-sequencing and correlations of molecular findings with clinicopathological features. Genes Chromosomes Cancer 2019.
Parker BC, Zhang W. Fusion genes in solid tumors: an emerging target for cancer diagnosis and treatment. Chin J Cancer. 2013;32(11):594–603.
Tsongalis GJ, Al Turkmani MR, Suriawinata M, Babcock MJ, Mitchell K, Ding Y, Scicchitano L, Tira A, Buckingham L, Atkinson S, et al. Comparison of Tissue Molecular Biomarker Testing Turnaround Times and Concordance Between Standard of Care and the Biocartis Idylla Platform in Patients With Colorectal Cancer. Am J Clin Pathol. 2020;154(2):266–76.
Dawson AJ, McGowan-Jordan J, Chernos J, Xu J, Lavoie J, Wang JC, Steinraths M, Shetty S. Canadian College of Medical Geneticists guidelines for the indications, analysis, and reporting of cancer specimens. Curr Oncol. 2011;18(5):e250–5.
VanderLaan PA, Chen Y, DiStasio M, Rangachari D, Costa DB, Heher YK. Molecular Testing Turnaround Time in Non-Small-Cell Lung Cancer: Monitoring a Moving Target. Clin Lung Cancer. 2018;19(5):e589–90.
Jennings LJ, Arcila ME, Corless C, Kamel-Reid S, Lubin IM, Pfeifer J, Temple-Smolkin RL, Voelkerding KV, Nikiforova MN. Guidelines for Validation of Next-Generation Sequencing-Based Oncology Panels: A Joint Consensus Recommendation of the Association for Molecular Pathology and College of American Pathologists. J Mol Diagn. 2017;19(3):341–65.
Strom SP, Lee H, Das K, Vilain E, Nelson SF, Grody WW, Deignan JL. Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory. Genet Med. 2014;16(7):510–5.
Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol Detect Quantif. 2015;3:1–8.
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45.
Helmersen K, Aamot HV. DNA extraction of microbial DNA directly from infected tissue: an optimized protocol for use in nanopore sequencing. Sci Rep. 2020;10(1):2985.
Cumbo C, Minervini CF, Orsini P, Anelli L, Zagaria A, Minervini A, Coccaro N, Impera L, Tota G, Parciante E, et al: Nanopore Targeted Sequencing for Rapid Gene Mutations Detection in Acute Myeloid Leukemia. Genes (Basel) 2019, 10(12).
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet. 2018;19(6):329–46.
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1):30.
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 2019;20(1):129.
Cozzuto L, Liu H, Pryszcz LP, Pulido TH, Delgado-Tejedor A, Ponomarenko J, Novoa EM. MasterOfPores: A Workflow for the Analysis of Oxford Nanopore Direct RNA Sequencing Datasets. Front Genet. 2020;11:211.
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316–9.
Oxford Nanopore Technologies GitHub: Guppy [https://github.com/nanoporetech].
The Nanopore Community [https://nanoporetech.com/community].
Hung LH, Hu J, Meiss T, Ingersoll A, Lloyd W, Kristiyanto D, Xiong Y, Sobie E, Yeung KY. Building Containerized Workflows Using the BioDepot-Workflow-Builder. Cell Syst. 2019;9(5):508–14 e503.
Bonito. A PyTorch Basecaller for Oxford Nanopore Reads [https://github.com/nanoporetech/bonito].
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100.
Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6.
Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92.
CuPy. A NumPy-compatible array library accelerated by CUDA [https://cupy.dev/].
Amazon. EC2 pricing https://aws.amazon.com/ec2/pricing/.
We would like to acknowledge Katherine Melville from Oxford Nanopore Technologies and cloud credits from Amazon Web Services.
LHH and KYY are supported by NIH grant R01GM126019. JR is supported by NIH grants R01 CA175008-06 and UG1 CA233338-02. CY is supported by NCCN Young Investigator Award and Hyundai Hope on Wheel Scholars Hope Grant. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
The content is solely the responsibility of the authors, and does not necessarily represent the official views of the National Institutes of Health.
Ethics approval and consent to participate
Consent for publication
LHH and KYY also have equity interest in and employed by Biodepot LLC, which receives compensation from NCI SBIR contract number 75N91020C00009. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.
About this article
Cite this article
Reddy, S., Hung, LH., Sala-Torra, O. et al. A graphical, interactive and GPU-enabled workflow to process long-read sequencing data. BMC Genomics 22, 626 (2021). https://doi.org/10.1186/s12864-021-07927-1
- Cancer diagnostics
- Cloud computing
- Long-read sequencing