This package was coded in R language with RStudio version 1.2.5042 built on R version 4.0 on macOS Mojave version 10.14. It depends on the R core packages grDevices, graphics, stats, and utils, and is maintained and released through the Bioconductor project with an Artistic License.
The kataegis package provides a four-step workflow for localized hypermutation regions identification and visualization: data read in, inter-mutational distances calculation, kataegis identification, and visualization (Fig. 1).
The readVCF() and readMAF() functions can read in the standard Variant Call Format (VCF) (http://github.com/samtools/hts-specs/blob/master/VCFv4.2.pdf) and Mutation Annotation Format (MAF) (http://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) files respectively. The VCF and MAF formats are both most commonly used file formats for storing variants information from high-throughput sequencing, e.g., whole genome sequencing. Both of these formats have standard specifications, and there are also mature tools to perform filtering and format conversion between them. The readVCF() function will read in the VCF format and suffixed files and do a crude filtering according to the VCF “FILTER” field. As the MAF file can hold the mutations’ annotation data of several samples, which is widely used by important bioinformatic databases like the The Cancer Genome Atlas (TCGA), etc., so we provide a function readMAF() to read in the MAF format and suffixed files with the samples merged or separated. If the user chooses to read in the MAF file with the samples separated, then the variants will be read in to a list of matrix, each matrix is named after the sample’s ID. The crude filtering also works for the readMAF() function. The package contains two simulated data sets, of which the VCF was generated form the mouse model of small cell lung cancer GSE149444_17686R (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149444) and the MAF was generated from the human adenoid cystic carcinoma (http://www.cbioportal.org/study/summary?id=acyc_mskcc_2013) [13].
If the data is read in without any errors, the second step is to calculate the inter-mutational distances between the mutations. The variants will firstly be sorted according to the genomic coordinates for each chromosome, and then the distances will be calculated between the neighboring variants. For a list of separated samples, the samples with too few variants for calculating the inter-mutational distances will be abandoned. And a warning of this situation and the sample IDs will arise. This step will produce a data frame containing the information of the chromosomes, variants’ locations, and the inter-mutational distances.
With the previous two steps, the data is ready for calling the localized hypermutation regions. The localized hypermutation regions are mutations “hotspot” regions, which are defined as more than a certain number of mutations in a range of the genome. It was reported as more than five [9] or six mutations in a range of 1000 bp of the human genome [6]. With this concept, it is reasonable to segment the genomes with the mutations. We used a segmentation method based on the Piecewise Constant Fitting (PCF) algorithm [14]. The segmentations with the information of the number of mutations and its average inter-mutational distances are reported, and the genomic coordinates are also produced. Thus it’s simple for users to filter the segments with the threshold of the mutation number and the average inter-mutational distance. The kata() function will automatically perform the previous jobs.
After all the first three steps finished, and the localized hypermutation regions are identified successfully, it comes to the last step to visualize the regions. Researchers are usually interested in the global landscape of the distribution of the regions in the genome (Fig. 2), and also the nucleotides content (Fig. 3) and spectra (Fig. 4) of the foci and flanking regions of the localized hypermuation regions. Here we provide a kataplot() function, it will take in the data produced by the first three steps and produce these plots automatically. The users only have to control which type of plots will be produced, the size and format of the plots, and the name of the plots as well. The global landscape of the distribution of the regions can also become the landscape of one or several certain chromosomes other than the whole genome (Fig. 5).