Sample data sets
Optically mapped genomes for 8 different reference human samples were used to construct the internal cohort database for evaluating nanotatoR’s performance. All sample datasets, including “Utah woman” (Genome in a Bottle Consortium sample NA12878), “Ashkenazi family” (NA24143 [or GM24143]: Mother, NA24149 [or GM24149]: Father, and NA24385 [or GM24385]: Son), GM11428 (6-year-old female with duplicated chromosome), GM09888 (8-year-old female with trichorhinophalangeal syndrome), GM08331 (4-year-old with chromosome deletion) and GM06226 (6-year-old male with chromosome 1–16 translocation and associated 16p CNV), were obtained from the Bionano Genomics public datasets (https://bionanogenomics.com/library/datasets/). OGM-based genome assembly and variant calling and annotation were performed using Solve version 3.5 (Bionano Genomics). Subsequently, samples were annotated with nanotatoR to examine the performance.
Additionally, we tested nanotatoR’s ability to accurately annotate the known disease variants in a previously published cohort of 11 Duchenne Muscular Dystrophy samples [27]. For this, internal cohort frequency calculations are based on these 11 samples. For gene expression integration, NA12878 fastq files (RNA-Seq) were obtained from Sequence Read archive (SRA) (Sample GSM754335) and aligned to reference hg19, using STAR [30]. Read counts were estimated using RSEM [31], reported in Transcripts per Million (TPM).
nanotatoR input file formats
nanotatoR was written in R language. The nanotatoR pipeline takes as input Bionano-annotated SV files, in the format of either unmodified SMAP (BNG’s SVcaller output) or text (TXT) files that retain information from the SMAPs, but also append additional fields. The two main differences between the input files are:
-
a)
Enzyme: If a combination of restriction endonucleases (Nt.BspQI and Nb.BssSI) are used for DNA labeling, genome assembly and variant calling, the resultant SV call sets from each enzyme are merged into a single TXT file in an SMAP format (SVmerge function, Bionano Genomics). If a single direct labeling enzyme such as DLE1 is used for DNA labeling, the resultant SV call set is kept in a single SMAP file format. Both file types (SVmerge TXT and SMAP) can serve as input files for nanotatoR.
-
b)
Family: Depending on the availability of family members, the SV-containing input files (TXT/SMAP) contain additional information derived from the Variant Annotation Pipeline (BN_VAP, Bionano Genomics). For trio analysis (proband, mother, father), BN_VAP performs molecule checks for SVs identified in the proband (self-molecules) and checks whether the SV is present in the parents’ molecules. For duo analysis (proband vs. control -any family member or unrelated individual- or tumor vs. normal) variants’ presence is evaluated in self-molecules as well as the control sample molecules. For proband-only analysis variants’ presence is evaluated in self-molecules only.
nanotatoR functions
The annotations provided by nanotatoR are currently subdivided into 5 categories described below: 1) calculation of SV frequency in external and internal databases; 2) determination of gene overlaps; 3) integration of gene expression data; 4) extraction of relevant phenotypic information from public databases for 5) variant filtration. Finally, all of the sub-functions are compiled into a Main function.
Function 1: Structural variant frequency
Variant frequency is one of the most important filtration characteristics for the identification of rare, possibly pathogenic, variants. Because OGM is not sequence based the average SV breakpoint uncertainty is 3.3 kbp [20]. As a result, compared with SNV frequency calculations, frequency estimates for SVs pose greater difficulty, due to the breakpoint variability between “same” structural variants identified by different techniques.
Function 1.1 - External databases
nanotatoR uses 3 external databases: Database of Genomic Variants (DGV) [32], Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER) [33] and Bionano Genomics control database (BNDB). The respective functions are named: DGVfrequency, DECIPHERfrequency and BNDBfrequency. The 3 datasets are accessible through the nanotatoR GitHub repository (https://github.com/VilainLab/nanotatoRexternalDB).
BNDB is provided by Bionano Genomics in a subdivided set of 4 files based on the type of SVs (indels, duplications, inversions, and translocations) for two different human reference genomes (GRCh37/hg19 and GRCh38/hg38). nanotatoR aggregates the variant files of the user-selected reference genome (hg19 or hg38), into a single format (e.g. TXT) used for frequency calculation. This action is performed as part of the function BNDBfrequency with the following input parameters: buildBNInternalDB = TRUE, InternalDBpattern = “hg19” or InternalDBpattern = “hg38”. The following steps are used to calculate the frequency of a query SV in external databases:
-
1.1a
Variant-to-variant similarity: Estimating the frequency of a query SV first requires determining whether the variant is the same as the ones found in a database of interest. In order for the SVs to be considered “same”, nanotatoR, by default, checks whether two independent variants of the same type (e.g. deletion) are on the same chromosome, have 50% or greater size similarity, and if the SV breakpoint start and end positions are within 10 kilobase pairs (kbp) for insertions/deletions/duplications and within 50 kbp for inversions/translocations. For example, if there is a deletion on chromosome 1 with a breakpoint start at position chr1:350,000 and end at chr1:550,000 on the reference, all deletion variants in chr1:340,000–560,000 with a size similarity of 50% would be extracted from the database. Similarly, if the variant was an inversion, nanotatoR would search for variants of the same type and on the same chromosome, with a breakpoint start between chr1:300,000 and chr1:400,000 and breakpoint end between chr1:500,000 and chr1:600,000. Currently the 50% size similarity cutoff is not implemented by default for inversions and translocations, as sizes have only started to be provided in the SVcaller output recently; however, users have an option to run the size similarity, and future releases of nanotatoR will perform the size similarity calculations by default.
The percentage similarity parameters (DECIPHER and BNDB functions: input parameter perc_similarity; DGV function: input parameter perc_similarity_DGV) and breakpoint start and end error (DECIPHER and BNG functions insertion, deletion and duplication: input parameter win_indel; DGV function insertion, deletion and duplication: win_indel_DGV; DECIPHER and BNG functions inversion and translocation: win_inv_trans; DGV function inversion and translocation: win_inv_trans_DGV) are modifiable by the user.
-
1.1b
Variant size and confidence score: Two additional criteria are implemented to select for high-quality variants in BNDB. Bionano’s SVcaller calculates a confidence score for insertions, deletions, inversions, and translocations. To calculate allele frequency, nanotatoR takes into account the BNDB variants above a threshold quality score of 0.5 for insertions and deletions (indelconf), 0.01 for inversions (invconf) and 0.1 for translocations (transconf). These thresholds can be modified by the user. In addition, nanotatoR filters out SVs below 1 kbp in size to decrease the likelihood of false positive calls [20].
-
1.1c
Zygosity: Variants in BNDB are reported as homozygous, heterozygous or “unknown” (DGV and DECIPHER do not report zygosity). This is used to refine frequency calculation for BNDB SVs: nanotatoR attributes an allele count of 2 for homozygous SVs and 1 heterozygous SVs. Currently, nanotatoR overestimates the frequency for variants that overlap with reference database (BNDB) SVs for which the zygosity is unknown by counting the number of alleles as 2. If the query SV matches with multiple variants in the BNDB from the same BNDB sample, nanotatoR counts these as a single variant/sample, with allele count of 2 for homozygous/unknown and 1 for heterozygous matches.
-
1.1d
Frequency calculations: for DECIPHER and DGV, SV frequency is calculated by dividing the number of query matched database variants (step 1.1a) by the total number of alleles in the database, i.e. 2x the number of samples, which are diploid, and multiplying with 100 to get percentage frequency (Formula 1).
Formula 1: External public (DGV and DECIPHER) database SV frequency calculation. Numerator: number of variants that pass the similarity criteria (step 1.1.a). Denominator is twice the number of samples, i.e. the number of alleles. This ratio is multiplied by 100 to express the frequency as a percentage. |
$$ E\mathrm{x} ternal\ DB\ SV\ frequency=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{matching}\ \mathrm{variants}\ \left(\mathrm{step}\ 1.1\mathrm{a}\right)}{2\ X\ total\ number\ of\ samples\ }\ X\ 100 $$
For BNDB two types of frequency calculations are performed: filtered and unfiltered. For filtered frequency calculations the following criteria must be met: 1.1a; 1.1b; 1.1c. For unfiltered variants frequency calculation only 1.1a and 1.1c criteria are enforced. The resultant number of identified counts is divided by the number of alleles in BNDB (currently 468 for 234 diploid samples). The result is multiplied by 100 to get a percentage (Formula 2).
Formula 2: BNDB database filtered SV frequency calculation. The variants that pass the similarity criterion (step 1.1.a) are filtered with size threshold and quality score (step 1.1.b). The number of variants is estimated as mentioned in step 1.1.c. Denominator is the number of alleles. This ratio is multiplied by 100 to get the frequency in percentage. |
$$ Filtered\ SV\ frequency=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{matching}\ \mathrm{filtered}\ \mathrm{variants}\ \left(\mathrm{steps}\ 1.1\mathrm{a},\mathrm{b},\mathrm{c}\right)}{2\ X\ Number\ of\ BNDB\ samples\ }\ X\ 100 $$
Output: The output is appended to the original input file in individual columns. For DECIPHER, this consists of a single column termed “DECIPHER_Freq_Perc”. As DGV provides information on number of samples in addition to frequency, nanotatoR prints two columns: “DGV_Count” (with the total number of unique DGV samples containing variants matching the query SV) and “DGV_Freq_Perc” (for the percentage calculated using Formula 1). For the BNDB, in addition to “BNG_Freq_Perc_Filtered”, “BNG_Freq_Perc_UnFiltered”, a third column reports “BNG_Homozygotes” (number of homozygous variants that pass the filtration criteria).
Function 1.2 - Internal databases
The internal cohort analysis is designed to calculate variant frequency based on aggregation of SVs for samples ran within an institution or laboratory and provides parental zygosity information for inherited variants in familial cases. The function consists of two distinct parts:
-
1.2a
Building the internal cohort database: Individual (solo) SMAP files for each of the samples are concatenated to build an internal database (buildSVInternalDB = TRUE), which is stored in the form of a text file. This step creates a unique sample identifier (nanoID) based on a key provided that ensures unique sample ID and encodes family relatedness. The nanoID is written as NR < Family # > . < Relationship #>. For example, the proband in a family of three (trio) would be denoted as NR23.1, with NR23 denoting the family ID and 1 denoting the proband. For the parents of this proband, the nanoID would be NR23.2 for the mother and NR23.3 for the father. Currently, only trio analyses are supported, future updates will include larger family analyses. If multiple projects exist within the same institution and are coded with project-specific identifiers nanotatoR will append the project-specific identifier in front of the nanoID (e.g. Project1_NR23.1 and Project2_NR42.1).
-
1.2b
Calculating internal frequency and determining parental zygosity: For singleton analyses, the function internalFrequency_Solo (for both DLE labeling and SVmerge) calculates internal database frequency of queried SVs based on the same principles explained in section 1.1d (Formula 2) for BNDB frequency calculations. However, additional filtration criteria are implemented to increase the accuracy of frequency estimation. SVs overlapping gaps in hg19/hg38 are annotated in the output SMAPs as “nbase” calls (e.g. “deletion_nbase”) and are likely to be false. nanotatoR filters out “nbase”-containing SVs when estimating internal frequency. For duplications, inversions, and translocations nanotatoR evaluates whether chimeric scores “pass” the thresholds set by the Bionano SVcaller during de-novo genome assembly [34] ensuring that SVs that “fail” this criterion are eliminated from internal frequency calculations (Fail_BSPQI_assembly_chimeric_score = “pass” or Fail_BSSSI_assembly_chimeric_score = “pass”) for SVmerge datasets, or (Fail_assembly_chimeric_score = “pass”) for a single-enzyme dataset. Lastly, nanotatoR checks whether the SVs were confirmed with Bionano Variant Annotation Pipeline, which examines individual molecules for support of the identified SV [34] (Found_in_self_BSPQI_molecules = “yes” or Found_in_self_BSSSI_molecule = “yes”) for SVmerge datasets, or (Found_in_self_molecules = “yes”) for a single-enzyme dataset.
For family analyses (duos and trios), the internalFrequencyTrio_Duo function is used to identify parental/control sample zygosity based on the nanoID coding using criteria described in sections 1.1.a/c (note that, here, the default size similarity percentage used is ≥90% as inherited variants are expected to be virtually identical). Zygosity information for the identified variants is extracted and appended into two separate columns (fatherZygosity and motherZygosity). This functionality is available for both SVmerge (merged outputs from 2 enzymes) and single enzyme labeling. SVs with the same family ID as the query are not included in the overall internal frequency calculation as described in the previous paragraph. Five columns: “MotherZygosity”, “FatherZygosity”, “Internal_Freq_Perc_Filtered”, “Internal_Freq_Perc_Unfiltered”, and “Internal_Homozygotes” are appended to each of the annotated input files (nonrelevant fields contain dashes).
Function 2: Gene overlap
The gene overlap function (overlapnearestgeneSearch) identifies known gene and non-coding RNA genomic locations that overlap with or are immediately upstream or downstream from the identified SVs. This function takes as input an SV-containing file (TXT or SMAP) and a modified Browser Extensible Data (BED) file where human X and Y chromosomes are numbered as 23 and 24 respectively. The user has an option of either providing a Bionano-provided modified BED file (inputfmtBed = “BNBED”) or a BED file from UCSC Genome Browser or GENCODE (inputfmtBed = “BED”). For the latter, nanotatoR supports conversion of a BED file into BNBED standard with buildrunBNBedFiles function. The BED file is used to extract location/orientation of genes and overlap this information with SVs (overlapGenes function, called from overlapnearestgeneSearch). The function calculates the percentage overlap, by calculating its distance from the breakpoint start (if the gene is partially upstream of the SV) or breakpoint end (if the gene is partially downstream of the SV), and dividing it by the length of the gene (calculated by nanotatoR from genomic coordinates information in the BED file). By default, nanotatoR applies a 3 kbp gene overlap window (breakpoint start – 3 kbp; breakpoint end + 3 kbp) to account for the typical OGM breakpoint error [20] when searching for genes overlapping insertions, deletions and duplications. For inversions and translocations overlapping genes are limited to +/− 10 kbp from the breakpoint start/end (both parameters are user-selectable). For the nonOverlapGenes function (also called from overlapnearestgeneSearch), genes located near, but not overlapping with, SVs are reported along with the corresponding distances from the SV. Genes are sorted based on their distance from the SV breakpoints. The default number of reported nonoverlap genes is 3 (also user-selectable). The output produces 3 additional columns: “OverlapGenes_strand_perc”, “Upstream_nonOverlapGenes_dist_kb” and “Downstream_nonOverlapGenes_dist_kb”.
Function 3: Expression data integration
The SVexpression_solo/_duo/_trio functions for singletons, dyads and trios respectively provide the user with tools to integrate tissue-specific gene expression values with SVs. The function takes as input a matrix of gene names (first column) and corresponding expression values for each sample, with the sample names as column headers (individual files can be merged by the RNAseqcombine function for dyads/trios or RNAseqcombine_solo function for singletons). The SVexpression_solo/_duo/_trio function can take as an input read estimation from any transcript quantification tool, provided the input format mentioned above is followed.
To differentiate between sample types (probands and affected/unaffected parents), we recommend the user add a code (or pattern) to the file name. For example, for the proband of family 23, the expression file name would be Sample23_P_expression.txt, where P denotes proband. Currently, by default nanotatoR recognizes “P” as proband, “UM/AM” for unaffected/affected mother and “UF/AF” for unaffected/affected father. A function to encode more complex intra-family relatedness and identify individual samples, termed nanoID, is in development and will be available by default in the next release of nanotatoR.
Expression values for overlapping and non-overlapping SV genes are extracted from the genome-wide expression matrix and appended into separate columns in the overall SV input file. For example, an overlap of gene X with a SV in the proband, with an expression value of 10, would be represented as gene X (10), and printed in the “OverlapProbandEXP” column. The appended number of columns in the output is dependent on which functions were run. For a trio analysis, 9 columns are added: “OverlapProbandEXP”, “OverlapFatherEXP”, and “OverlapMotherEXP” for overlapping genes, and a similar set for up- and down-stream non-overlapping genes (e.g. “NonOverlapUPprobandEXP”, “NonOverlapDNprobandEXP”). In case of dyads the number of columns would be 6 (with the parent column being either mother or father) and 3 for singletons.
Function 4: Entrez extract
The gene_list_generation function assembles a list of genes based on the patient’s phenotype and overlaps it with gene names that span SVs. User-provided, phenotype-based keywords are used to generate a gene list from the following databases: ClinVar [35], OMIM (https://omim.org/), GTR [36], and the NCBI’s Gene database (www.ncbi.nlm.nih.gov/gene). The input to the function is a term, which can be provided as a single term input (method = “Single”), a vector of terms (method = “Multiple”), or a text file (method = “Text”). The output can be a dataframe or text. The rentrez [37] and VarfromPDB [38] R-language packages are used to extract data related to each of the user-provided phenotypic terms, from the individual databases. For the Gene database rentrez provides the entrez IDs associated with each gene, which are converted in nanotatoR to gene symbols using org.Hs.eg.db [37], a Bioconductor package. For OMIM, rentrez provides the OMIM record IDs, which are used to extract the corresponding disease-associated genes from the OMIM ID-to-gene ID conversion dataset (mim2gene.txt). For GTR, rentrez extracts the GTR record IDs, which are then used to extract corresponding gene symbols from the downloaded GTR database. For ClinVar, VarfromPDB is used to extract genes corresponding to the input term. All genes to which the query keyword is attached, irrespective of their clinical significance, are extracted; genes of clinical significance (i.e. those for which Pathogenic/Likely Pathogenic variants are reported) are further reported in a separate column. The user also has the option to download the ClinVar and GTR databases by choosing downloadClinvar = TRUE and downloadGTR = TRUE, which may improve run times. The user has an option to save the datasets (removeClinvar = FALSE and removeGTR = FALSE) or delete the database after the analysis is completed (removeClinvar = TRUE and removeGTR = TRUE).
The output is provided in CSV format with 3 columns: “Genes”, “Terms”, and “ClinicalSignificance”. The “Terms” column contains the list of terms associated to each gene and corresponding database from where the association was derived. The “ClinicalSignificance” column contains genes that have clinical significance (Pathogenic/Likely Pathogenic variants) for the associated term, derived from the ClinVar database. The output of entrez extract serves as input for the subsequent variant filtration step.
Function 5: Variant filtration
The filtration function has two major functionalities: categorization of variants into groups (such as de novo, inherited from mother/father; potential compound heterozygous; inversions or translocations) and integration of the primary gene list (either provided by the user or generated by nanotatoR as described in section 4) into the input SV-containing file. In this function, genes overlapping and near SVs that are present in the primary gene list are printed in separate columns. For genes that overlap with a SV, the output is printed in columns termed “Overlap_PG” (primary genes that are common with the genes that overlap SVs) and “Overlap_PG_Terms” (terms and database the genes are being extracted from). Genes that are up/downstream of, but not overlapping with, the SV are divided into “Non_Overlap_UP_PG”,“Non_Overlap_UP_Terms”, “Non_Overlap_DN_PG”, and “Non_Overlap_DN_Terms”.
The final annotated SV calls are divided into multiple sheets:
-
All: contains all identified variant types and annotations.
-
all_PG_OV: Variants overlapping with the primary gene list.
-
Mismatch: Variant Type = “MisMatch” in SVmerge output (available for dual enzyme labeling output only) contains SV calls discordant between the two enzymes datasets (e.g. one called a deletion and the second an insertion).
By default, for the rest of the sheets, nanotatoR performs self-molecule checks for all SV types (described in section 1.2b, Found_in_self_ molecules = yes) and chimeric score quality checks for duplications, inversions and translocations (also described in section 1.2b, Fail_BSPQI_assembly_chimeric_score = “pass”) and reports on zygosity. (These two criteria are henceforth called “default nanotatoR filtration”). For indel_dup, this comprises:
-
For singleton studies (solo): Variant type = “insertion”, “deletion”, “duplication”, “duplication_split” or “duplication_inverted”.
-
For dyads: indel_dup_notShared or indel_dup_Shared respectively: Insertions, deletions and duplications not present in a control sample, mother or father (Found_in_control_molecules = “no”) or present in a control sample, mother or father (Found_in_control_molecules = “yes”).
-
For trios, insertions, deletions or duplications are reported in:
-
◊ indel_dup_denovo if not present in the parents (Found_in_parents_BSPQI_molecules/Found_in_parents_BSSSI_molecules = “none” or “-” for SVmerge or Found_in_parents_molecules = “none” for single enzyme).
-
◊ indel_dup_both if present in the proband, as well as both parents (e.g. Found_in_parents_molecules = “both”).
-
◊ indel_dup_mother and indel_dup_father if inherited from only the mother or father, respectively (e.g. Found_in_parents_molecules = “mother” or “father”).
-
◊ indel_dup_cmpdHET if present in the heterozygous state in parents. This output is a combination of Indel_dup_mother and Indel_dup_father datasets. The user can manually inspect this list to find potential compound heterozygote variants, i.e. 2 variants with genomic coordinates overlapping with the same gene, one present in the mother, the other in the father.
Finally, inversion and translocations are reported as follows, for singletons, dyads, or trios:
-
Variant Type = “inversion” or “inversion_paired” or “inversion_partial” or “inversion_repeat”.
-
Variant Type = “translocation_intrachr” or “translocation_interchr” if within a single chromosome or between two chromosomes respectively.
There is a total of 6 variant filtration functions, based on enzyme type (SVmerge for dual-enzyme labeling or single enzyme) and sample type (singleton, dyad, or trio). For example, for a single-enzyme singleton dataset the function is run_bionano_filter_SE_solo. Others are run_bionano_filter_SE_duo, run_bionano_filter_SE_trio, run_bionano_filter_SVmerge_solo, run_bionano_filter_SVmerge_duo, and run_bionano_filter_SVmerge_trio. Variant filtration output is an Excel file with each of the different output groups represented as separate tabs. The output is written in Excel file format using the openxlsx [39] package, with the following default naming convention “Sample1_solo.xlsx”.
Main function
The main function sequentially runs the available nanotatoR sub-functions by merging the outputs from each step. There are in total 6 main functions depending on the type of input: nanotatoR_main_Solo_SE; nanotatoR_main_Duo_SE; nanotatoR_main_Trio_SE; nanotatoR_main_Solo_SVmerge; nanotatoR_Duo_SVmerge and nanotatoR_SVmerge_Trio. The Main function takes the SMAP file, DGV file, BED file, internal database file and phenotype term list as inputs and provides a single output file in a form of an Excel spreadsheet. The output location and file name are user-specified.