To overcome the limitations of existing mutation signature analysis tools, we have developed Helmsman, a new mutation signature analysis program optimized for performing mutation signature analysis on arbitrarily large datasets. Helmsman is implemented in Python, and is primarily designed to accept VCF files as input (though Helmsman can also accept data in other formats, such as MAF). Helmsman uses the powerful cyvcf2 Python library [10] for back-end processing of VCF files. Cyvcf2 is essentially a Python wrapper for the same htslib C libraries that serve as the back-end for standalone compiled VCF-processing toolkits like bcftools [11], and offers comparable speed and memory efficiency [10].
Generation of mutation spectra matrices
For each SNV in a VCF file, Helmsman defines the mutation type based on the reference and alternative alleles, then queries the corresponding reference genome for the trinucleotide context of the SNV, determining subtype j. The functions for querying the reference genome were derived from the pyfaidx Python library, which provides fast and memory-efficient random access to reference genome files, without requiring the entire file to be loaded into memory [12]. The genotypes of the N samples for this SNV are represented as an integer array, with the number of alternative alleles per sample coded as 0, 1, or 2 according to the observed genotype [10]. Based on the genotype for sample i, Helmsman increments the Mi, j entry of the mutation spectra matrix accordingly (i.e., if individual i is heterozygous, Mi, j is incremented by 1, but if individual i is homozygous for the reference allele, Mi, j remains unchanged). This procedure is fully vectorized, meaning that instead of performing N sequential operations (i.e., looping through the genotypes of the N samples for a given SNV and adding the value of each element to the corresponding Mi, j entry), Helmsman performs element-wise addition of the genotype array to the jth column of the M matrix in a single computational operation. Consequently, Helmsman’s processing time is independent of sample size and scales linearly with the number of SNVs. The only objects stored in memory at any given moment are the array of N genotypes for the SNV being processed and the Nx96 mutation spectra matrix, so memory usage is independent of the number of SNVs and scales linearly with sample size.
Mutation signature analysis
Once the mutation spectra matrix has been generated, Helmsman can apply non-negative matrix factorization (NMF) to this matrix to infer the underlying mutation signatures and their loadings within each sample, using functions from the nimfa [13] Python library. Alternatively, Helmsman can perform principal component analysis (PCA) to the mutation spectra matrix using functions from the scikit-learn [14] library. We note that because PCA does not enforce non-negativity, the resulting components do not have a useful biological interpretation like the NMF signatures do [15]; however, PCA remains useful as an orthogonal exploratory analysis to highlight patterns of similarity in the spectra of the samples (as in [16]), which can help guide understanding of how many distinct signatures may be contributing to the observed mutation spectra in a given dataset [17].
Alternatively, users can forgo the built-in mutation signature analysis and instead opt to write the mutation spectra matrix to a file and perform downstream analyses in their preferred environment. When this option is selected, users can specify one of several different R packages that they wish to use for their downstream analysis (e.g., SomaticSignatures or deconstructSigs) and Helmsman will automatically generate an R script with all code necessary to load the output matrix into R and format it for compatibility with the specified package. This feature was designed expressly to enable and encourage researchers to continue using existing mutation signature analysis tools that would otherwise be incapable of processing large datasets due to computational bottlenecks. Further, this feature better enables users to perform multiple complementary analyses. For example, after generating the mutation spectra matrix with Helmsman, users could perform supervised signature decomposition with deconstructSigs [4] to assess the presence of known signatures, then apply the de novo signature extraction methods of SomaticSignatures [3] or signeR [5] to determine if the data contain novel signatures, without ever needing to re-generate or manually reformat the mutation spectra matrix.
Additional features
In addition to being optimized for speed and low memory usage, Helmsman includes several features to accommodate various usage scenarios and minimize the amount of pre-processing necessary to analyze large mutation datasets. For example, if input data are spread across multiple files (e.g., by different sub-samples or genomic regions), Helmsman can process these files in parallel and aggregate them into a single mutation spectra matrix, providing additional performance improvements and avoiding the need to generate intermediate files. Similarly, in certain applications, it may be desirable to pool similar samples together (e.g., by tumor type) when generating the mutation spectra matrix. Helmsman can pool samples on-the-fly, without needing to pre-annotate or reshape the input file with the desired grouping variable. All features are described in detail in the online documentation.
Alternative deployment options
We have also created a Docker image to easily deploy Helmsman in an isolated and reproducible environment on virtually any personal computer, local server, or cloud server where Docker containers are supported. This Docker image contains the Helmsman source code along with a minimal Python environment and all the necessary library dependencies, as well as the bcftools suite [11] to perform any necessary pre-processing within the same container environment. When the Docker container is deployed, an interactive Jupyter notebook will be available to run Helmsman. Further, this Docker image is fully compatible with the Binder platform [18], which effectively enables Helmsman to be used as a web server for performing mutation signature analysis on small datasets. A Binder instance of Helmsman can be accessed at https://mybinder.org/v2/gh/carjed/helmsman/master.