GrabBlur - a framework to facilitate the secure exchange of whole-exome and -genome SNV data using VCF files
© Stade et al.; licensee BioMed Central Ltd. 2014
Published: 20 May 2014
Next Generation Sequencing (NGS) of whole exomes or genomes is increasingly being used in human genetic research and diagnostics. Sharing NGS data with third parties can help physicians and researchers to identify causative or predisposing mutations for a specific sample of interest more efficiently. In many cases, however, the exchange of such data may collide with data privacy regulations. GrabBlur is a newly developed tool to aggregate and share NGS-derived single nucleotide variant (SNV) data in a public database, keeping individual samples unidentifiable. In contrast to other currently existing SNV databases, GrabBlur includes phenotypic information and contact details of the submitter of a given database entry. By means of GrabBlur human geneticists can securely and easily share SNV data from resequencing projects. GrabBlur can ease the interpretation of SNV data by offering basic annotations, genotype frequencies and in particular phenotypic information - given that this information was shared - for the SNV of interest.
GrabBlur facilitates the combination of phenotypic and NGS data (VCF files) via a local interface or command line operations. Data submissions may include HPO (Human Phenotype Ontology) terms, other trait descriptions, NGS technology information and the identity of the submitter. Most of this information is optional and its provision at the discretion of the submitter. Upon initial intake, GrabBlur merges and aggregates all sample-specific data. If a certain SNV is rare, the sample-specific information is replaced with the submitter identity. Generally, all data in GrabBlur are highly aggregated so that they can be shared with others while ensuring maximum privacy. Thus, it is impossible to reconstruct complete exomes or genomes from the database or to re-identify single individuals. After the individual information has been sufficiently "blurred", the data can be uploaded into a publicly accessible domain where aggregated genotypes are provided alongside phenotypic information. A web interface allows querying the database and the extraction of gene-wise SNV information. If an interesting SNV is found, the interrogator can get in contact with the submitter to exchange further information on the carrier and clarify, for example, whether the latter's phenotype matches with phenotype of their own patient.
Since the introduction in 2005, Next Generation DNA Sequencing (NGS) has been used successfully in numerous research projects . Meanwhile, further technological advances have reduced the per base pair sequencing costs dramatically, thereby allowing more and more molecular diagnostics laboratories to screen the complete exome of individual patients with an apparently inherited disease for causative mutations . Indeed, exome sequencing has already started to revolutionize diagnostic genetic testing . However, pertinent data privacy law, the type of informed consent declarations used and limited genetic counseling resources bar sharing of high-resolution genetic data with third parties in most countries. From both a medical and a scientific point of view, this "locking" of data is hardly compatible with good professional practice. For instance, for a physician or geneticist it may be essential to know whether a particular mutation found in the genome of their patient has been found in another patient with a similar phenotype before. Related questions are also likely to arise in basic research projects on both monogenic and complex (i.e. oliogenetic) diseases.
We developed GrabBlur, a tool to collect and aggregate (i.e. "grab" and "blur") 'single nucleotide variants' (SNVs) linked to a specific trait or phenotype, and to share them with others by way of a public database while keeping individual samples unidentifiable. The database will not only help human geneticists to distinguish between benign variant findings and truly disease-causing mutations, but will also benefit genetic epidemiological research (i.e. case-control association studies) based upon large-scale SNV data.
In contrast to databases like ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) or the Human Gene Mutation Database (HGMD) , which only contain out-of-context information on genotype-phenotype associations, GrabBlur provides access to all SNVs detected in a given patient alongside the description of their specific phenotype. The Exome Variant Server (EVS)  provides about 2 million annotated SNVs of 6,500 individuals with heart-, lung- and blood-related diseases; more details are not specified. Through the straightforward aggregation of SNVs, it is not possible to find out, which SNV originated from which individual and phenotype. The EVS helps researchers excluding SNV candidates found in patients with monogenetic diseases, but it is not a resource to exchange genotypic and phenotypic data from other data sets, especially for Mendelian diseases where often the exact phenotypes are needed. Owing to this level of comprehensiveness, GrabBlur helps users not only to reckon known mutations, but also to validate newly found ones.
The most important feature of GrabBlur is the high level of anonymity ensured by its process of data aggregation. No conclusions as to the identity of a patient can be drawn even if the entire data stored for that individual are downloaded or the whole database is mirrored. It is possible neither to reconstruct a single patient genome nor to re-identify a patient from knowing their SNVs. Data is aggregated at the site of the submitter, i.e. behind their own firewall and under their responsibility for data protection. Hence, no identifying data leaves the submitter institution, and even if the data is "tapped" by an unauthorized person during upload to the database, a high level of privacy protection is maintained.
DNA sequence data are accepted by GrabBlur in standardized VCF format . Additional information such as the phenotype or gender of a patient is stored in a separate "initialization file" (INI file format). Most of this information is optional and provision is at the discretion of the submitter. The following information may be recorded:
Trait. A description of the disease of all patients in a GrabBlur set of samples (see below). Samples must be marked at least as 'patient' or 'healthy control'. (mandatory)
Phenotype: GrabBlur uses Human Phenotype Ontology (HPO) terms  to classify phenotypes. Every phenotype can be ascribed an unlimited number of HPO terms. (optional)
Gender: Gender of a single patient. (optional)
Platform: DNA sequencing technology used. (optional)
Enrichment: DNA enrichment kit used for sequencing. (optional)
PI: Identity of principal investigator. (optional)
Contact details: Identity, affiliation and e-mail address of the submitter (mandatory for upload, but optional release to public database)
On the project homepage, we also provide Perl scripts to read either a single VCF file or all VCF files contained in one directory and to directly submit filenames and sample IDs to this interface.
GrabBlur aggregates data in the following three steps:
1. Inspection of the additional information available for every patient
To prevent identification of a patient via the combination of different individual-specific informational items, these items must not be unique in the set of sample data provided to a third party. Every variant of a patient is associated with his meta-data. In case of uniqueness, the reconstruction of a patient's genome would be possible. At least two samples must have exactly the same phenotypes, same gender information etc.
In order to generate a sufficient level of ambiguity, samples with an identical set of HPO terms are combined in classes. If any other additional information is not sufficiently ambiguous, GrabBlur blurs it by deleting e.g. the gender or the platform-name.
2. Fragmentation of the SNV-data
In a second step, the SNVs of a sample are divided into sub-samples of different size. A list linking sample IDs and sub-sample IDs is stored in a encrypted and password-protected file at the submitter site. Encryption is accomplished by means of the Blowfish algorithm of OpenSSL . Only the submitter themselves can open this file. This is needed, for example, to delete a sample from the database in case the patient withdraws the consent.
Each SNV of a sample is randomly assigned to a sub-sample. This assignment is not uniformly distributed because otherwise any group of linked sub-samples would contain an approximately equal number of SNVs, thereby allowing reconstruction of the complete sample. Therefore SNVs are assigned to a sub-sample with a differently weighted likelihood.
3. Blurring the genotype information for rare variants
In a third step, all rare variants of a sample are aggregated by replacing the sub-sample ID of a rare SNV by the contact information of the submitting institution. Since a patient can easily be identified by singletons (i.e. SNVs that have been detected only once), these and other rare SNVs in their exome are blurred. In the aggregation step the association between an SNV and all belonging sub-samples has been deleted. Only the trait and (if known) the submitting institution remain linked to the SNV. Hence, only common SNVs carry a sub-sample ID and, therefore, are associated with specific phenotype information.
Here, freq(SNV) denotes the frequency of the SNV irrespective of its genotype, and med(frq) is the median over all SNV frequencies in the sample set. We choose the median because it is robust against outliers, like in this case above-average number of singletons.
The default factor of 1.5 can be modified by the submitter to get a lower or higher aggregation level. Usually, with a default factor of 1.5, the threshold equals between 8 and 12 so that a data set must comprise at least 8 to 12 samples in order to provide additional information other than the contact address.
(1) After registration, users can upload their data. The web front end allows the user to choose an aggregated VCF-file, which must have been created before using the blurring software described above. The front end sends the file to a client software running on the same server, which checks the file for consistency and potential corruptions and then transfers it to the database.
During the upload-process, every SNV is automatically functionally annotated using our in-house software tool snpActs (http://snpacts.ikmb.uni-kiel.de). snpActs identifies whether an SNV causes a protein coding substitution and which amino acid is affected using the gene annotations from CCDS  and RefSeq . The amino acid changes in all iso-forms of the affected gene are classified and ranked in the following order: "nonsense" (most likely to be damaging), "readthrough", "start-lost", "splice site", "missense", "synonymous" (least likely to be damaging). To obtain more information for estimating whether an SNV is likely to be damaging, snpActs also queries the Human Gene Mutation Database "HGMD" . HGMD provides a database of comprehensive, in part manually curated data on human inherited disease mutations. Since this is a commercially available database, only an identifier from the HGMD database is named in snpActs. All results of these annotations, including the highest ranked classification of the SNV, are stored in the database upon upload of the data.
The aggregation software was written in C++ on an Ubuntu Linux system. The runtime of the aggregation increases linearly with the amount of samples. The consumption of memory (RAM) increases logarithmically. On a desktop PC, a VCF file with 43,000 SNV was aggregated in less than 3 seconds using one core (Intel Xeon 4C, 2.0 GHz). The aggregation of 50 exomes with about 40,000 - 45,000 SNVs needs approximately 128 MB RAM and 130 sec. The aggregation of 150 exomes needs about 7 minutes with approximately 350 MB RAM.
The web interface for data access has been implemented using the Django web application framework  (v1.5.4) and the Python programming language  (v3.2.2). It is currently running on an Ubuntu Linux Server (12.04.3 LTS). The MySQL database containing the actual GrabBlur data is located on another server (with the same configuration) and is accessed using the respective built-in modules of Django and Python.
GrabBlur is a "light weight" tool to aggregate SNV data of thousands of samples with a specific trait or phenotype and to share the data with other via a public database. The main goal of GrabBlur, namely to keep each individual sample unidentifiable, was achieved by deleting other important information from individual exomes or genomes. For instance, all information of linked SNVs must be dropped to avoid the reconstruction of a given data set. But exactly this information is very valuable for scientific studies. For example, rare variant association analysis methods collapse rare variants into groups based upon, for example, the functional annotation of genomic regions. Whether GrabBlur can be used in such studies needs to be verified individually for each analysis method (for a review of methods, see ). However, GrabBlur is intended mainly to serve human geneticists who try to find more data on a variant and the phenotype of interest. The user-friendly GrabBlur web interface should inspire users to share their data and to use the tool for their own purposes. Although GrabBlur anonymizes the genetic data to a sufficient degree, a cautious user may want to use GrabBlur only behind their own firewall to handle aggregated information. While we encourage users to share their data, we also support such "internal" mirrors and provide instructions to set them up.
GrabBlur also has limitations that should not go unmentioned. For example, the system is not yet checking for duplicate uploads. It is thus possible that redundant data end up in the GrabBlur database. Moreover, the quality of an uploaded SNV may not have been adequately checked. Detailed quality data, as it can be generated using our previously reported tool pibase , would require that users also retrieve BAM files for their sequence data, run additional and standardized analyses. Moreover, the addition of the quality scores would significantly inflate the GrabBlur database. We rather prefer that submitter provide their contact details so that data users can enquire the quality of particular SNVs directly. The submitter may then go back to the raw data and use pibase, the Integrated Genomics Viewer  or other tools to assess the quality of the SNV in more detail. It is also possible with GrabBlur to ask submitters for additional details on the phenotype of a patient or for a detailed re-phenotyping based on new scientific findings.
This project received infrastructure support from DFG Cluster of Excellence "Inflammation at Interfaces". We thank the TMF working group "Molecular Medicine", especially Prof. Thomas Wienker, for critical discussions on and support of the project.
The publication costs for this article were funded by the DFG Cluster of Excellence No. 306 „Inflammation at Interfaces".
This article has been published as part of BMC Genomics Volume 15 Supplement 4, 2014: SNP-SIG 2013: Identification and annotation of genetic variants in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S4
- Majewski J, Schwartzentruber J, Lalonde E, Montpetit A, Jabado N: What can exome sequencing do for you?. J Med Genet. 2011, 48: 580-9.PubMedView ArticleGoogle Scholar
- DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). [http://www.genome.gov/sequencingcosts]
- Neveling K, Feenstra I, Gilissen C, Hoefsloot LH, Kamsteeg E-J, Mensenkamp AR, Rodenburg RJT, Yntema HG, Spruijt L, Vermeer S, Rinne T, van Gassen KL, Bodmer D, Lugtenberg D, de Reuver R, Buijsman W, Derks RC, Wieskamp N, van den Heuvel B, Ligtenberg MJL, Kremer H, Koolen Da, van de Warrenburg BPC, Cremers FPM, Marcelis CLM, Smeitink JaM, Wortmann SB, van Zelst-Stams WaG, Veltman Ja, Brunner HG, et al: A Post-Hoc Comparison of the Utility of Sanger Sequencing and Exome Sequencing for the Diagnosis of Heterogeneous Diseases. Hum Mutat. 2013, (Denovo 281964)Google Scholar
- Gilissen C, Hoischen A, Brunner HG, Veltman JA: Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011, 12: 228-PubMedPubMed CentralView ArticleGoogle Scholar
- Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN: The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet. 2013Google Scholar
- Exome Variant Server. [http://evs.gs.washington.edu/EVS/]
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R: The variant call format and VCFtools. Bioinformatics. 2011, 27: 2156-8.PubMedPubMed CentralView ArticleGoogle Scholar
- Robinson PN, Mundlos S: The human phenotype ontology. Clin Genet. 2010, 77: 525-34.PubMedView ArticleGoogle Scholar
- Blowfish. [https://www.openssl.org/docs/crypto/blowfish.html]
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner M-M, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009, 19: 1316-23.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, 33 (Database): D501-4.PubMedPubMed CentralGoogle Scholar
- The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-73.Google Scholar
- Django: A Web framework for the Python programming language. Django Software Foundation, Lawrence, Kansas, USA, [http://www.djangoproject.com]
- The Python Language Reference. [http://docs.python.org/py3k/reference/index.html]
- Bansal V, Libiger O, Torkamani A, Schork NJ: Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet. 2010, 11: 773-85.PubMedPubMed CentralView ArticleGoogle Scholar
- Forster M, Forster P, Elsharawy A, Hemmrich G, Kreck B, Wittig M, Thomsen I, Stade B, Barann M, Ellinghaus D, Petersen B-S, May S, Melum E, Schilhabel MB, Keller A, Schreiber S, Rosenstiel P, Franke A: From next-generation sequencing alignments to accurate comparison and validation of single-nucleotide variants: the pibase software. Nucleic Acids Res. 2013, 41: e16-PubMedPubMed CentralView ArticleGoogle Scholar
- Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29: 24-6.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.