Dintor: functional annotation of genomic and proteomic data

Background During the last decade, a great number of extremely valuable large-scale genomics and proteomics datasets have become available to the research community. In addition, dropping costs for conducting high-throughput sequencing experiments and the option to outsource them considerably contribute to an increasing number of researchers becoming active in this field. Even though various computational approaches have been developed to analyze these data, it is still a laborious task involving prudent integration of many heterogeneous and frequently updated data sources, creating a barrier for interested scientists to accomplish their own analysis. Results We have implemented Dintor, a data integration framework that provides a set of over 30 tools to assist researchers in the exploration of genomics and proteomics datasets. Each of the tools solves a particular task and several tools can be combined into data processing pipelines. Dintor covers a wide range of frequently required functionalities, from gene identifier conversions and orthology mappings to functional annotation of proteins and genetic variants up to candidate gene prioritization and Gene Ontology-based gene set enrichment analysis. Since the tools operate on constantly changing datasets, we provide a mechanism to unambiguously link tools with different versions of archived datasets, which guarantees reproducible results for future tool invocations. We demonstrate a selection of Dintor’s capabilities by analyzing datasets from four representative publications. The open source software can be downloaded and installed on a local Unix machine. For reasons of data privacy it can be configured to retrieve local data only. In addition, the Dintor tools are available on our public Galaxy web service at http://dintor.eurac.edu. Conclusions Dintor is a computational annotation framework for the analysis of genomic and proteomic datasets, providing a rich set of tools that cover the most frequently encountered tasks. A major advantage is its capability to consistently handle multiple versions of tool-associated datasets, supporting the researcher in delivering reproducible results. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2279-5) contains supplementary material, which is available to authorized users.

, [4], [5] Gcoords2gcoords Convenience script to convert between Dintor genomic coordinates (GC) and other commonly used GC formats. gcoords2genes Query for human genes in the vicinity of a position on the genome, usually addressing a variation. Output includes the Ensembl human gene ID and the distance to the gene, its strand and a distancebased rank. gcoords2ld Compute linkage disequilibrium (LD) between pairs of GCs or a GC and a gene. Outputs D' and r 2 measures. For a GC/gene pair, all SNPs from the gene are taken and the maximum LD measure is reported. Calculation can be restricted to certain populations available in the 1000 Genomes/HapMap projects. [6] gcoords2reg Query for regulatory regions from the Encode project for a position on the genome. The level of output detail can be chosen and reflects the way Encode data are organized. [7] gcoords2snp Inverse function of snp2gcoords. Queries a position on the genome for dbSNP entries and outputs any rs* IDs found. [8] gcoordsconservation Query if a given position on the genome is located in a conserved regions across a selectable set of organisms. Depending on the choice of conservation assignment, output is either binary (conserved or not) or a GERP score.
[9], [10] gene2canonexons Retrieve a list of canonical exons for a human gene. Useful for producing input for Illumina DesignStudio software. liftgcoords Lift genomic coordinates from a previous human genome release to the current.

snp2gcoords
Convert dbSNP rs* IDs to genomic coordinates as used in Dintor. Output includes Ensembl quality control flags, reference and alternate alleles, strand information, and evidence backing up the SNP. Inverse functionality is given by gcoords2snp. [8] tbl2tbl Import filter for tabular data. Used to import arbitrarily formatted text-based tables into Dintor's tabseparated format. tblsubmerge Merge a table based on a column's unique identifier. Rows with the same entry in a specified column will be joined into a single row. tblsubsplit Split single or multiple cells into separate rows, optionally adding an index column that can be used to undo this operation by calling tblsubmerge. HSEnsgProteinMapper Human gene ↔︎ protein mapping tool. Maps between any combination of Ensembl gene ID, consensus coding sequence (CCDS), Ensembl transcript ID, Ensembl protein ID, UniProt SwissProt or Trembl accession number or entry name. A common use case is mapping from Ensembl gene IDs (or transcript IDs) derived from Dintor tools to UniProt or CCDS.
[13], [14] HSGeneOrthologyMapper Derive orthology information between human genes and the most common scientific model organisms, fruit fly, mouse, and worm. Orthology mappings are based on Ensembl. [15]

Interval2Genes
List human genes contained in a genomic region specified by a pair of begin/end GCs. Additionally, it can be used to relate a genomic position (as originating from a variation) to the genes contained in an LD block output by the Pos2LDBlock tool.

Pos2LDBlock
Assign LD-based haplotype blocks to a position on the genome. The output encodes the relationship between the query position and the LD-haplotype block. [16] TableJoiner Join two tables on a common column. Unlike the Unix join command, this tool works on arbitrary, unsorted, tab-separated tables and allows transferring subsets of columns from the joined table.

VCF2Dint
Import filter for variant call format (VCF) files into Dintor tables.

Dintor tool name Description References ClinVarAnnotator
Output hits in NCBI ClinVar database for a GC, a GC with reference and alternate allele, or an interval on a chromosome (e.g. derived from Interval2Genes). [17] DrugBankAnnotator Retrieve DrugBank information for UniProt accession numbers. Lists drugs associated with proteins identified by their respective UniProt accession numbers. [18] GOAnnotator Query Gene Ontology (GO) for either GO terms and their descendants or for GO terms associated with UniProt accession numbers. A variety of information can be retrieved, such as GO term names, evidence codes, and ontology name. Filters exist to limit the number of terms to a certain depth in the GO graph, to ontologies, to certain types of edges, and to high quality SwissProt entries. [19] HGMDAnnotator [not available on public Galaxy web server] The human gene mutation database (HGMD) contains manual annotations of human gene mutations. Due to licensing restrictions, access to this database is only available as a command line tool interfacing a purchased HGMD MySQL database. The tool itself has an interface comparable to ClinVarAnnotator, and a Galaxy interface is ready for license holders running their own Galaxy server. [20]

HSGeneAtlas
Retrieve tissue-specific gene expression for human genes using the Genomics Institute of the Novartis Research Foundation (GNF) Gene Atlas. Filters are available for gene over-, and/or under-expression and tissue types. [21] InteractionAnnotator Find protein interaction partners using the iRefIndex database. Additional data characterizing the interactions, such as external references and experimental detection techniques, can optionally be output. Interactions may be restricted to a panel of predefined genes/proteins. [22] PharmaADMEntor Highlight mutations in an industry-initiated database of genetic biomarkers reliably involved in drug metabolism. [23] ReactomeAnnotator Retrieve information about pathways, reactions, and participating molecules from the Reactome database, taking into account the hierarchical (parent-child) structure of the data. The tool can be queried by UniProt accession numbers, Reactome identifiers, or free text; the output may be restricted to a predefined panel. [24]

Dintor tool name Description References GOEnricher
Perform GO term-based gene set enrichment analysis. Enrichment can be performed on any of the three ontologies (biological process, molecular function, and cellular component). Correction for multiple hypothesis testing and result clustering are available. Enriched GO terms are usually based on the set originating from all genes, but can also be broken down to genes.
[25], [26] GOFunSim Compute pairwise protein functional similarity. The tool offers calculation of five different functional similarity measures based on six different semantic similarity measures. In addition, functional similarity can be computed between a list of proteins and predefined panel of (usually related) proteins. Furthermore, semantic similarity can be derived for pairs of GO terms. Graph-based GO term information content can also be output. [27] MendelianFilter Remove variants that do not comply with a certain mode of Mendelian inheritance. The tool operates on a multi-sample VCF file and furthermore requires relatedness be provided in a pedigree (PED format). Filtering is possible for autosomal dominant or recessive, X-linked dominant or recessive, and mitochondrial linked inheritance.
[28], [29] MetaRanker Given an object (e.g. gene) associated with multiple scores, each one in a single column of a table, compute a single, rank-based score from these columns. This module is used in the final ranking provided by Prioritizer. Columns may contain missing values, the ordering of a column's content can be specified individually, and the final rank calculation allows weighting the contributing columns. [30] Prioritizer Performs candidate gene prioritization by a guilt-by-association approach. Candidate genes are compared to a user-defined panel of related genes (e.g. disease associated) by the following Dintor tools: • InteractionAnnotator: Does the candidate gene interact with a panel gene?
• ReactomeAnnotator: Does the candidate gene share pathways with the panel genes?
• GOFunSim: Is there high functional similarity between the candidate gene and genes from the panel?
• GOAnnotator: Is the candidate gene involved in similar GO classes as the panel genes? [31]