nanotatoR: a tool for enhanced annotation of genomic structural variants

Background Whole genome sequencing is effective at identification of small variants, but because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity. Results We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient’s phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.). nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR’s annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. Conclusions The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting.


Introduction
With the advent of the high throughput short-read sequencing (SRS) techniques, identification of molecular underpinnings of genetic disorders has become faster, more accurate and costeffective [1]. SRS platforms used for whole exome (WES) or genome (WGS) DNA sequencing, produce billion of reads per run, typically limited in length to 100-150 base pairs (bp) [2]. When WES started being used in clinical diagnostic practice, it was reported to be effective in identifying pathogenic genetic variants in approximately 30% of cases [3][4][5]. Even with technological evolution and more widespread practice, reports of diagnostic yields between 8 and 70%, depending on the disease [6] suggest that a large fraction of cases remain undiagnosed.
WGS was shown to be more effective than WES in identifying single nucleotide variants (SNVs) or small insertions and deletions (INDELs) than WES [7,8]. However, both WES and WGS are ineffective in identification of structural variants (SVs, >50 bps) or copy number variants (CNVs) because short reads cannot span repetitive elements or provide contextual information.
Many algorithms have been designed for detection of SVs in short-read-based sequences (69 of them were compared in [9]). However, performance analyses highlight their limitations such as low concordance, poor precision, and high rate of false positive calls [9,10]. Another benchmarking study comparing 10 different SV callers against robust truth sets showed that the total number of calls made by the different algorithms varied by greater than two orders of magnitude [11]. Region of genome analyzed (repeats vs. high-complexity regions), noise of data (platform-specific sequencing or assembly errors), complexity of the SV, and library properties (e.g. insert size) all affect specificity, sensitivity and/or processing speed of the various variantcalling algorithms [10]. 4 Chromosomal microarray (CMA) is the established method for high-accuracy detection of CNVs, but it can only identify gains or losses of genetic material and is virtually blind towards identification of balanced rearrangements such as inversions or translocations. CMA clinical application is typically limited to CNVs above 25-50 kb, although higher resolution CNV maps have been built and are being used to design disease-specific paths to diagnostic detection of smaller variants (e.g. [12]). Breakpoint resolution is limited by the density of probes on the array.
Novel approaches which analyze single, long DNA molecules hold the promise of detecting the previously inaccessible SVs. Long-read sequencing (LRS) technologies such as nanoporebased sequencing (Oxford Nanopore Technologies) or single-molecule real-time sequencing (Pacific Biosciences) have the potential to both detect complex genomic rearrangements and increase SV break point resolution [13,14]. They have been critical to shed light on the "dark" regions of the genome where short reads had been insufficient for accurate assembly [15].
However, as LRS still isn't in wide use, SV detection pipelines have seen slower development than SRS-based algorithms, and both quantity and quality of identified SVs vary significantly between tools [16].
In parallel, a method not based on sequencing, optical genome mapping (OGM, Bionano Genomics), provides much higher sensitivity and specificity for identification of large SVs, including balanced events, compared to karyotype, CMA, LRS and SRS [17][18][19][20]. For example, a comparison of OGM, PacBio LRS and Illumina-based SRS on the same genome showed that about a third of deletions and three quarters of insertions above 10 kb were detected only by OGM [17] . For OGM, purified high-molecular-weight DNA is fluorescently labeled at specific sequence motifs throughout the genome (reviewed in [21]). The labeled DNA is imaged through nanochannel arrays for de novo genome assembly. Assembly and variant calling are performed 5 using algorithms provided by Bionano Genomics and/or tools developed by the community, such as OMblast [22] or OMtools [23]. OGM has been effective in identifying pathogenic variants in patients with cancer [24][25][26], Duchenne muscular dystrophy [27], and facioscapulohumeral muscular dystrophy [28,29]. Importantly OGM has allowed refinement of intractable, lowcomplexity regions of the genome and discovery of genomic content missing in the reference genome assembly [19].
Although, OGM is effective in identifying clinically relevant SVs, the currently available SV annotation tools do not provide sufficient variant information for determination of variant pathogenicity. Here, we report the development of an annotation tool in R language, nanotatoR, that provides extensive annotation for SVs identified by OGM. It determines population variant frequency using publicly available databases, as well as user-created internal databases. It offers multiple filtration options based on quality parameters thresholds. It also determines the percentage of overlap of genes with the SV, as well as distance between nearest genes and SV breakpoints, both upstream and downstream. It offers an option for incorporating RNA-Seq read counts, which has been shown to enhance variant classification [6], as well as user-specified disease-specific gene lists extracted from NCBI databases. The final output is provided in an Excel worksheet, with segregated SV types and inheritance patterns, facilitating filtration and identification of pathogenic variants.
Subsequently, samples were annotated with nanotatoR to examine the performance.
Additionally, we tested nanotatoR's ability to accurately annotate the known disease variants in a previously published cohort of 11 Duchenne Muscular Dystrophy samples [27]. For this, internal cohort frequency calculations are based on these 11 samples. For gene expression integration, NA12878 fastq files (RNA-Seq) were obtained from Sequence Read archive (SRA) (Sample GSM754335) and aligned to reference hg19, using STAR [30]. Read counts were estimated using RSEM [31], reported in Transcripts per million (TPM). nanotatoR input file formats nanotatoR was written in R language. The nanotatoR pipeline takes as input Bionanoannotated SV files, in the format of either unmodified SMAP (BNG's SVcaller output) or text (TXT) files that retain information from the SMAPs, but also append additional fields. The two main differences between the input files are:

nanotatoR output file formats
The annotations provided by nanotatoR are currently subdivided into 5 categories described below: 1) calculation of SV frequency in external and internal databases; 2) determination of gene overlaps; 3) integration of gene expression data; 4) extraction of relevant phenotypic information from public databases for 5) variant filtration. Finally, all of the sub-functions are compiled into a Main function.

Structural Variant Frequency
Variant frequency is one of the most important filtration characteristics for the identification of rare, possibly pathogenic, variants. Because OGM is not sequence based the average SV breakpoint uncertainty is 3.3 kbp [20]. As a result, compared with SNV frequency calculations, 8 frequency estimates for SVs pose greater difficulty, due to the breakpoint variability between "same" structural variants identified by different techniques.

External Databases
nanotatoR uses 3 external databases: Database of Genomic Variants (DGV) [32], Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER) [33] and Bionano Genomics control database (BNDB Output: The output is appended to the original input file in individual columns. For DECIPHER, this consists of a single column termed "DECIPHER_Freq_Perc". As DGV provides information on number of samples in addition to frequency, nanotatoR prints two columns: "DGV_Count" (with the total number of unique DGV samples containing variants matching the query SV) and "DGV_Freq_Perc" (for the percentage calculated using Formula 1).
For the BNDB, in addition to "BNG_Freq_Perc_Filtered", "BNG_Freq_Perc_UnFiltered", a third column reports "BNG_Homozygotes" (number of homozygous variants that pass the filtration criteria).

-Internal Databases
The internal cohort analysis is designed to calculate variant frequency based on aggregation of SVs for samples ran within an institution or laboratory and provides parental zygosity information for inherited variants in familial cases. The function consists of two distinct parts: 1.2a -Building the internal cohort database: Individual (solo) SMAP files for each of the samples are concatenated to build an internal database (buildSVInternalDB = TRUE), which Formula 2: BNDB database filtered SV frequency calculation. The variants that pass the similarity criterion (step 1.1.a) are filtered with size threshold and quality score (step 1.1.b). The number of variants is estimated as mentioned in step 1.1.c. Denominator is the number of alleles. This ratio is multiplied by 100 to get the frequency in percentage.

2
is stored in the form of a text file. This step creates a unique sample identifier (nanoID) based on a key provided that ensures unique sample ID and encodes family relatedness. The nanoID is written as NR<Family #>.<Relationship #>. For example, the proband in a family of three (trio) would be denoted as NR23.1, with NR23 denoting the family ID and 1 denoting the proband. For the parents of this proband, the nanoID would be NR23.2 for the mother and NR23.3 for the father. Currently, only trio analyses are supported, future updates will include larger family analyses. If multiple projects exist within the same institution and are coded with project-specific identifiers nanotatoR will append the project-specific identifier in front of the nanoID (e.g. Project1_NR23.1 and Project2_NR42.1).

1.2b Calculating internal frequency and determining parental zygosity:
For singleton analyses, the function internalFrequency_Solo (for both DLE labeling and SVmerge) calculates internal database frequency of queried SVs based on the same principles explained in section 1.1d (Formula 2) for BNDB frequency calculations. However, additional filtration criteria are implemented to increase the accuracy of frequency estimation. SVs overlapping gaps in hg19/hg38 are annotated in the output SMAPs as "nbase" calls (e.g. "deletion_nbase") and are likely to be false. nanotatoR filters out "nbase"-containing SVs when estimating internal frequency. For duplications, inversions, and translocations nanotatoR evaluates whether chimeric scores "pass" the thresholds set by the Bionano SVcaller during de-novo genome assembly [34] ensuring that SVs that "fail" this criterion are eliminated from internal frequency calculations (Fail_BSPQI_assembly_chimeric_score = "pass" or Fail_BSSSI_assembly_chimeric_score = "pass") for SVmerge datasets, or (Fail_assembly_chimeric_score = "pass") for a single-enzyme dataset. Lastly, nanotatoR checks whether the SVs were confirmed with Bionano Variant 1 3 Annotation Pipeline, which examines individual molecules for support of the identified SV [34] (Found_in_self_BSPQI_molecules = "yes" or Found_in_self_BSSSI_molecule = "yes") for SVmerge datasets, or (Found_in_self_molecules = "yes") for a single-enzyme dataset.
For family analyses (duos and trios), the internalFrequencyTrio_Duo function is used to identify parental/control sample zygosity based on the nanoID coding using criteria described in sections 1.1.a/c (note that, here, the default size similarity percentage used is ≥ 90% as inherited variants are expected to be virtually identical). Zygosity information for the identified variants is extracted and appended into two separate columns (fatherZygosity and motherZygosity). This functionality is available for both SVmerge (merged outputs from 2 enzymes) and single enzyme labeling. SVs with the same family ID as the query are not included in the overall internal frequency calculation as described in the previous paragraph. Five columns: "MotherZygosity", "FatherZygosity", "Internal_Freq_Perc_Filtered", "Internal_Freq_Perc_Unfiltered", and "Internal_Homozygotes" are appended to each of the annotated input files (nonrelevant fields contain dashes).

Gene Overlap
The

Expression Data Integration
The SVexpression_solo/_duo/_trio functions for singletons, dyads and trios respectively provide the user with tools to integrate tissue-specific gene expression values with SVs. The function takes as input a matrix of gene names and corresponding expression values for each sample (individual files can be merged by the RNAseqcombine function for dyads/trios or RNAseqcombine_solo function for singletons). To differentiate between sample types (probands and affected/unaffected parents), we recommend the user add a code (or pattern) to the file name.
For example, for the proband of family 23, the expression file name would be 1 5 Sample23_P_expression.txt, where P denotes proband. Currently, by default nanotatoR recognizes "P" as proband, "UM/AM" for unaffected/affected mother and "UF/AF" for unaffected/affected father. A function to encode more complex intra-family relatedness and identify individual samples, termed nanoID, is in development and will be available by default in the next release of nanotatoR.
Expression values for overlapping and non-overlapping SV genes are extracted from the genome-wide expression matrix and appended into separate columns in the overall SV input file.
For example, an overlap of gene X with a SV in the proband, with an expression value of 10, would be represented as gene X (10), and printed in the "OverlapProbandEXP" column. The appended number of columns in the output is dependent on which functions were run. For a trio analysis, 9 columns are added: "OverlapProbandEXP", "OverlapFatherEXP", and "OverlapMotherEXP" for overlapping genes, and a similar set for up-and down-stream nonoverlapping genes (e.g. "NonOverlapUPprobandEXP", "NonOverlapDNprobandEXP"). In case of dyads the number of columns would be 6 (with the parent column being either mother or father) and 3 for singletons.

Entrez Extract
The gene_list_generation function assembles a list of genes based on the patient's phenotype and overlaps it with gene names that span SVs. User-provided, phenotype-based keywords are used to generate a gene list from the following databases: ClinVar [35], OMIM (https://omim.org/), GTR [36], and the NCBI's Gene database (www.ncbi.nlm.nih.gov/gene) .
The input to the function is a term, which can be provided as a single term input (method = "Single"), a vector of terms (method = "Multiple"), or a text file (method = "Text"). The output can be a dataframe or text. The rentrez [37] and VarfromPDB [38] R-language packages are used 1 6 to extract data related to each of the user-provided phenotypic terms, from the individual databases. For the Gene database rentrez provides the entrez IDs associated with each gene, which are converted in nanotatoR to gene symbols using org. The user has an option to save the datasets (removeClinvar = FALSE and removeGTR = FALSE) or delete the database after the analysis is completed (removeClinvar = TRUE and removeGTR = TRUE).
The output is provided in CSV format with 3 columns: "Genes", "Terms", and "ClinicalSignificance". The "Terms" column contains the list of terms associated to each gene and corresponding database from where the association was derived. The "ClinicalSignificance" column contains genes that have clinical significance (Pathogenic/Likely Pathogenic variants) for the associated term, derived from the ClinVar database. The output of entrez extract serves as input for the subsequent variant filtration step. 1 7 The filtration function has two major functionalities: categorization of variants into groups (such as de novo, inherited from mother/father; potential compound heterozygous; inversions or translocations) and integration of the primary gene list (either provided by the user or generated by nanotatoR as described in section 4) into the input SV-containing file. In this function, genes overlapping and near SVs that are present in the primary gene list are printed in separate columns. For genes that overlap with a SV, the output is printed in columns termed "Overlap_PG" (primary genes that are common with the genes that overlap SVs) and
The final annotated SV calls are divided into multiple sheets: • All: contains all identified variant types and annotations.
• all_PG_OV: Variants overlapping with the primary gene list.
• Mismatch: Variant Type = "MisMatch" in SVmerge output (available for dual enzyme labeling output only) contains SV calls discordant between the two enzymes datasets (e.g. one called a deletion and the second an insertion).
◊ indel_dup_both if present in the proband, as well as both parents (e.g.
• Variant Type = "translocation_intrachr" or "translocation_interchr" if within a single chromosome or between two chromosomes respectively.
There is a total of 6 variant filtration functions, based on enzyme type (SVmerge for dualenzyme labeling or single enzyme) and sample type (singleton, dyad, or trio). For example, for a

Main Function
The main function sequentially runs the available nanotatoR sub-functions by merging the outputs from each step. There are in total 6 main functions depending on the type of input:

Results
The nanotatoR pipeline, written in R, integrates five individual sub-functions based on the enzyme and the sample type. A visual representation of the steps is shown in Figure 1.
nanotatoR passed all runtime and space criteria for the Bioconductor repository To demonstrate the various functionalities of nanotatoR, we present below the annotation results obtained from previously described truth sets: a control trio mapped with the singleenzyme technique, a control singleton sample mapped with both DLE and two-enzyme techniques, and a cohort of patients, for which we have previously established the efficacy of OGM to identify the SV causing Duchenne Muscular Dystrophy [27].

Example I: Annotation of a control trio single labeling dataset
We used the so-called "Ashkenazi trio" reference datasets, mapped using the DLE labeling methodology, to test the trio analysis function of nanotatoR (expression data was not available for these samples). Bionano SVcaller identified a total of 9,387 SVs in the proband (NA24385), shown in the last tab of the 9-tab report, termed "all". The complete, unfiltered nanotatoR output file for NA24385 (GM24385) is available in Supplementary Table S1. reported in the "inv" tab, and 0 (out of 84) translocations in the "trans" tab. All the translocations called by the Bionano SVcaller in this sample were in the categories "trans_interchr_common" and "trans_intrachr_common", which are classified as likely false by the Bionano annotation pipeline [34,40].
Confidence and frequency filtration: To demonstrate the importance of frequency filtration in identifying rare variants, a manual filtration was performed using a 1% threshold for external databases (DGV, DECIPHER and BNDB) and the internal cohort database. A confidence threshold (0.5 for insertions and deletions, 0.01 for inversions) was also applied. This eliminated 89% of the variants, leaving a total of 1,005 SVs, including 53 inv variants and 952 indel_dup (shown in top right pie chart, Fig. 3).
Out of the 952 insertions, deletions and duplications, only 8 were de novo ("indel_dup_denovo" tab), 794 found in both parents ("indel_dup_both"), 68 in the mother only ("indel_dup_mother"), and 82 in the father only ("indel_dup_father"). These numbers can be used to evaluate pathogenicity of the SVs. 150 SVs would be reported in the "indel_dup_cmdHet" column (found in either the mother or father, but not both), which can be manually inspected to identify potential compound heterozygous SVs.
SVs overlapping with the primary gene list: As these samples are those of healthy individuals, we could not use a disease term to generate a gene list. However, analysis of the genomes with OGM had revealed a deletion variant affecting the UGT2B17 gene in the son and the mother [41]. A 150 kb deletion on chromosome 4q13.2 spanning the whole UGT2B17 gene has been associated with osteoporosis [42]. To check whether our tool can efficiently annotate the variant, we used the term 'osteoporosis' to generate a primary gene list. This yielded a list of over 307 2 2 genes of which only 4 had pathogenic or likely pathogenic variants in ClinVar, highlighting the importance of this nanotatoR function (Supplementary Table S2). The complete extracted gene list can be used for gene discovery, while the pathogenic list is most efficient to identify variants in genes known to be associated with the proband's phenotype. Note that while UGT2B17 was accurately extracted into the primary gene list by nanotatoR, as its association with osteoporosis is reported in OMIM, it does not appear in the list of genes with pathogenic variants, as no such variant is currently reported in ClinVar.
169 SVs were found to be overlapping with the primary gene list, and were shown in the "all_PG_OV" tab.
Internal cohort frequency and zygosity calculations for the UGT2B17 variant: To investigate the frequency of the variant in the 8-sample internal cohort database, we first selected all variants with within -10 kb of start breakpoint and +10 kb of the end breakpoint (i.e. between hg19 genomic coordinates chr4:69,362,091 and chr4:69,500,860). A total of 6 variants passed the filtration criteria ("GM24385_del_totalData" tab in Table S3). Of these, 3 are from the query family: one is the proband's variant (all annotation shown in "GM24385_Variant_UGT2B17" tab in Table S3) and the other two are found in his mother. Of the maternal variants, one has exactly same start and end breakpoints as the proband's. Bionano SVcaller also called another variant in the mother with the same end breakpoint, and a similar, but not identical, start breakpoint. Both were retained as they have a size similarity > 90% with the proband's variant.

3
As described in Methods section 1.2b, nanotatoR selects the variants that pass size similarity and breakpoint criteria, and reports the zygosity for each in the parents ("GM24385_del_example_Zygosity" tab in Table S3).
In addition to the family, the variant was found in two other samples of the internal cohort.
The first was heterozygous, the second homozygous, so the total number of alleles carrying the SV was counted as 3. As the internal control cohort was composed of 8 samples, of which 3 were part of the Ashkenazi family, the total number of alleles in the internal cohort was calculated as (for both filtered and unfiltered, as all variants passed the quality filters). Note that the nanotatoR annotation process has detected the erroneous duplicate call made by SVcaller in sample NA12878, where two variants with identical characteristics were called under two different SVIndex numbers (rows 6&7, totalData tab, Table S3). Only one is taken into account for internal frequency calculation, as shown in Table S3, where tabs "GM24385_del_example_filter" and "GM24385_del_example_unfilt" show the samples used for the calculations.
Next, we calculated the filtered and unfiltered frequency (Formula 2, Methods 1.1d) of the deletion overlapping UGT2B17 in the Bionano reference database. We identified a total of 58 variants in BNDB ("GM24385_data_all" tab in Supplementary Table S4). Of these 33 ("GM24385_data_filtered" and "GM24385_data_unfiltered" tabs of Supplementary Table S4) passed the nanotatoR default filtration criteria and were used for frequency calculation. Twelve 9.18%. (Note that, here too, the number of variants was the same before and after filtration, yielding the same frequency value).
Run times: The SMAPs were annotated on an Intel core i7-6700 CPU with 16 GB RAM, Windows 10 system. It took ~15 minutes to annotate the trio sample (Supplementary Table S5 has run time for each of the functions individually). The runtime for nanotatoR is dependent on number of variants as well as network speed (for gene_list_generation function).
To generate the primary gene list for the trio sample, we downloaded the ClinVar and GTR databases, using the downloadClinvar = TRUE and downloadGTR = TRUE parameters in the gene_list_generation function. The gene_list_generation function took ~2 minutes to run for the sample. The time for this function is dependent on the number of input terms, as well as the computational/internet bandwidth available to the user. To make this process faster, the input parameters removeGTR and removeClinvar can be switched to FALSE; nanotatoR will then use the pre-downloaded database files for subsequent runs. It is recommended to download these databases periodically as they get updated frequently.
All the other databases (OMIM, DGV, DECIPHER, BNDB) must be downloaded manually from the database websites or from the nanotatoR database GitHub page (https://github.com/VilainLab/nanotatoRexternalDB). For this example, the internal frequency database was built based on a cohort of 8 samples (time ~10 minutes). The time taken for database creation depends on the user's cohort size.

Example II: Annotation of a control singleton dataset labeled with several enzymes
We also investigated the OGM datasets available for the sample NA12878 to test the annotation effectiveness of nanotatoR on multi-enzyme labeling and integration of RNA-Seq 2 5 data. OGM data is available for three labeling enzymes Nt.BspQI, Nb.BssSI and DLE1. The Nt.BspQI and Nb.BssSI SMAP outputs were merged using SVmerge. 10,087 and 6,814 SVs were reported by SVcaller for single enzyme (DLE1) and SVmerge output respectively ( Figure   4).  Fig. 4b, right panel). Proportions of insertions and deletions are similar in the two data sets, while the dual enzyme labeling called more duplications and inversions than single-enzyme labeling in this example.

6
To validate the efficiency of nanotatoR in identifying genes overlapping with SVs for NA12878, we looked for four previously published variants [43]. The 4 deletion variants identified in the study overlapped GSTM1, LCE3B, LCE3C, CR1 and SIGLEC14 genes.
nanotatoR's automated pipeline was able to identify the same type of variant (deletions) involving the same genes in both the single enzyme and SVmerge datasets. SV breakpoints reported in the original publication and in the nanotatoR-annotated data sets are shown in Supplementary Table S8; SV type and gene names are highlighted in the all_PG_OV tab in Tables S6 (DLE) and S7 (SVmerge).

Example III: Duchenne Muscular Dystrophy Cohort
We have previously published validation of the OGM technology to identify variants in the DMD gene in a cohort of patients with Duchenne muscular dystrophy [27]. We used the same cohort to test the nanotatoR annotation pipeline. The gene_list_generation function, using "Duchenne muscular dystrophy" as input for the rentrez tool, used at high stringency, i.e.
selecting only genes with pathogenic or likely pathogenic variants in ClinVar, yielded only one gene as expected for this monogenic disorder. Note that, as most callers, Bionano SVcaller currently identifies zygosity for X and Y chromosome variants in an XY individual as homozygous rather than hemizygous, which we corrected in Table 1. As a result of the erroneous input, internal frequency is currently 2 7 overestimated for variants on the X chromosome. Frequency calculations for SVs on the Y chromosome are not affected.
Using nanotatoR, we were able to automate steps that previously had to be taken manually to identify the pathogenic SVs in the DMD gene in the data sets: navigation to X chromosome location of the DMD gene, selection of the type of the SV (deletion/insertion, etc..), filtration by frequency, and curation of gene pathogenicity. In addition to the previously reported SVs, with the help of nanotatoR, we identified an additional deletion of unknown significance in sample CDMD_1159 that had previously been missed.

Conclusions
Structural variants play a major role in various genetic diseases (reviewed in [44] The field of genome-wide structural variation identification is rapidly advancing with LRS and OGM constantly evolving. However, currently both LRS and OGM technologies have limitations in identification of SNVs, large SVs >5kb (LRS) and small SVs <1kb (OGM) [17]. A combination of these various methods on the same genome will likely be necessary for optimal resolution and accuracy of SV detection [17,47], which will require the design of integrated platforms able to detect and classify variants using multiple types of data sets. Similarly, accurate determination of the pathogenicity of SVs requires integration of multiple data sets (e.g. OGM and gene expression) as well as tools capable of annotating these various types of data sets.
nanotatoR has this functionality and was able to considerably reduce analysis time compared to manual filtration of many steps. Future development of nanotatoR will be focused around adoption of variant annotation file (VCF) format as input files, support for SV calls produced by LRS/SRS technologies, additional population frequency databases such as gnomAD [48], and implementation of automated SV classification based on ACMG guidelines [49]. In addition, we plan to design a graphical interface for easy access and wider adoption.

Competing interests
EV owns a limited number of shares of Bionano Genomics Inc. HB is employed part-time by Bionano Genomics Inc.
HB employment at Bionano Genomics started after the development of nanotatoR. HB was also previously awarded stock options of Bionano Genomics Inc.

Funding
The research was funded in part by 5UL1TR001876-03 from the NIH National Center for   The nanotatoR pipeline is divided into 3 layers. Input Layer: Takes as input the OGM text or Smap file. Annotation Layer: The annotation layer comprises of five methods. The method that extracts the overlapping genes and the genes near (downstream and upstream) takes as input a BED file, and calculates the overlap percentage and the distance between nearest genes and SV using chromosomal locations. Next, the frequency calculation function calculates external and internal frequency taking DGV, DECIPHER and BNDB database as input for external frequency calculation, while input solo files, merged to form the internal frequency database, are taken as input to calculate the internal frequency. If RNA-Seq data is available, the expression count matrix is taken as input. Finally, output from all these methods as well as a primary gene list created from terms, is integrated, filtered based on quality criteria and written into an Excel file. Output Layer: The output is an Excel workbook, with each tab representing different SV types. The output files and number of tabs depend on the sample type and enzyme type: Singleton samples have 5 tabs for DLE and 6 tabs for SVmerge; 6 tabs for DLE and 7 tabs for SVmerge are created for dyad analyses; Trio analyses have 9 tabs for DLE and 10 tabs for SVmerge. Figure 2: nanotatoR annotates genes overlapping or near a SV: a) The cartoon shows three hypothetical scenarios: one deletion in the region upstream of Gene X (yellow) which may contain regulatory regions, indicated as solid purple in the reference genome (top) and striped in the patient's genome (bottom); one insertion (green) into Gene Y (blue), and a complete deletion of Gene Z (red, striped in patient genome). RNA-Seq reads are depicted as blue lines below the genes. b) nanotatoR annotation snapshot: nanotatoR annotates the three variants with the overlapping genes and percentage overlap, nearest genes upstream and downstream, distance to the breakpoints in kilobases, BNDB frequency, internal frequency, overlap gene expression value (in transcripts per million or TPM), nearest genes expression in TPM and overlapping genes term from NCBI databases.

Figure 3. Filtration and annotation of SV distribution in the NA24385 trio dataset:
Out of the total 9387 variants found by SVcaller, 8,804 passed the nanotatoR filtration of "Present in self molecules" and "Pass chimeric score" conditions. Of these, 2,787 were deletions (dark blue), 5,837 were insertions (light blue), 66 were duplications (grey), 114 were inversions (orange) and 0 were translocations (not shown), left pie chart. Note, nanotatoR outputs deletions, insertions and duplications in a single excel sheet "indel_dup". The number of variants was dramatically reduced after filtering for rare variants and imposing a confidence threshold (top right pie chart). With a threshold of less than 1% internal frequency, DGV frequency, BNDB frequency, and DECIPHER frequency and confidence thresholds of >0.5 for INDELs, >0.01 for inversions and >0.1 for translocations, 1,005 rare variants remain. These were further categorized with nanotatoR by inheritance (bottom right pie chart). All 53 inversions were inherited. Of the 952 indel_dup variants, only 8 were de novo, 794 are were identified as indel_dup_both (found in both mother and father), 68 are indel_dup_mother (found in only the mother), and 82 are indel_dup_father. This annotation can be used to evaluate relevance of the variants to the condition studied.

a) Unfiltered and filtered variants distribution in DLE and SVmerge datasets:
The total number of unfiltered variants for NA12878 DLE are 10,087, out of which 9,584 variants are filtered using nanotatoR criteria (found in self molecules, passed chimeric score threshold). For NA12878 SVmerge, out of 6,814 variants, 6,201 pass the filtration. Mismatches were not considered in this analysis but are shown in Table S7. b) SV distribution in the NA12878 DLE and SVmerge filtered datasets: Deletions (dark blue), insertions (light blue), duplications (grey) and inversions (orange) numbers are as shown in the pie charts. Bottom table shows distribution of SVs by type in percentages. For DLE, the majority of the identified SVs were insertions (64.9%), followed by deletions (33.5%), inversions (1.1%) and finally duplications (0.5%). While the total number of variants called is different between DLE and SVmerge, a similar pattern is seen in the SVmerge dataset. Many more duplications and inversions were called in the dual labeling than single DLE labeling method.