Privacy-preserving storage of sequenced genomic data

Background The current and future applications of genomic data may raise ethical and privacy concerns. Processing and storing of this data introduce a risk of abuse by potential offenders since the human genome contains sensitive personal information. For this reason, we have developed a privacy-preserving method, named Varlock providing secure storage of sequenced genomic data. We used a public set of population allele frequencies to mask the personal alleles detected in genomic reads. Each personal allele described by the public set is masked by a randomly selected population allele with respect to its frequency. Masked alleles are preserved in an encrypted confidential file that can be shared in whole or in part using public-key cryptography. Results Our method masked the personal variants and introduced new variants detected in a personal masked genome. Alternative alleles with lower population frequency were masked and introduced more often. We performed a joint PCA analysis of personal and masked VCFs, showing that the VCFs between the two groups cannot be trivially mapped. Moreover, the method is reversible and personal alleles in specific genomic regions can be unmasked on demand. Conclusion Our method masks personal alleles within genomic reads while preserving valuable non-sensitive properties of sequenced DNA fragments for further research. Personal alleles in the desired genomic regions may be restored and shared with patients, clinics, and researchers. We suggest that the method can provide an additional security layer for storing and sharing of the raw aligned reads. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07996-2.


reference index
In the case of an SNV, it is the index of the reference allele in A, T, G, C list. In case of an INDEL, it is the index of the listed allele.

allele mapping
Listed SNV alleles correspond to A, T, G, C bases respectively. Each listed INDEL allele has an index pointing to a target allele from the list.  When two different personal alleles are replaced by two identical masking alleles as part of the masking process described later (masking from heterozygous to homozygous position), information necessary to reverse this operation is lost. It is impossible to infer which particular alignment with the masked allele was the carrier of which personal allele from the original pair. This problem is resolved by keeping one of the replaced personal alleles as a part of a BDIFF record together with the list of identifiers of alignments associated with this allele. The other replaced personal allele is mapped to a masking allele as usual.
In addition, the BDIFF file needs to keep deleted base qualities associated with replaced alleles. Base qualities are deleted only when the longer allele is replaced with shorter allele, which is a case of INDEL masking. Deleted quality sequences are stored in another field of INDEL record as a list sorted by the genomic position of the corresponding alignment.

BDIFF encryption and storage
The checksum of the mapped reads and checksum of the VOF file is stored along with masked mapped reads for later verification (Figure 1). After masked mapped reads are complete, their checksum is added to the header of the encrypted BDIFF file. BDIFF file header contains the exact range that the masking covers -effective range. It is necessary because a BDIFF file does not have to contain records with genomic positions exactly at the start and the end of a specific range. By default, the effective range covers the whole genome. An owner of a BDIFF file can specify a subrange of an effective range to produce a new smaller BDIFF. This process is called dissemination and is explained later. Effective range is always greater or equal to a range defined by the first and last BDIFF record. The secret key for encryption of unmapped reads and checksum of masked mapped reads are also stored within the BDIFF header.
BDIFF contains all of the information necessary to unmask personal alleles within masked mapped reads, hence it is never stored as plain text. Content of the BDIFF file is encrypted using AES encryption with a randomly generated key. The AES key itself is encrypted by an RSA public key, provided by the user, and stored as a part of the encrypted BDIFF file.
In this way, access to the personal alleles is restricted to the owner of the private key paired with the public key used for encryption. Finally, the plain BDIFF file is signed with a provided private key. The signature is stored as a part of an encrypted BDIFF file and verified at the start of a decryption process using a public key paired with a signing private key.

Masking
If the masking process alters any of the alignments covering the position of a variant, the mapping from personal alleles to masking alleles is stored in the BDIFF format. After all positions of the variants on an alignment, described by population allele frequencies, and all preceding alignments are treated, the alignment is stored as a masked mapped read.
When all positions of variants are processed, remaining alignments, although unchanged, are stored as well. Since potential alleles in unmapped reads are not masked in the process, they are completely encrypted by random nucleotides. In addition, random masking can be further employed to increase personal privacy by obfuscation. We describe these two methods hereinafter.

SNVs
An allele of a single nucleotide variant (SNV) is one of four DNA bases; therefore, the number of possible SNV alleles is always four. Accordingly, the probability matrix of SNV allele pairs has size 4x4, where the probability of each pair is the product of their population frequencies.
Unknown allele (typically denoted as N in sequence files) is mapped to itself; thus, it is always preserved.

INDELs
In case of an insertion or a deletion (INDEL), the number of possible alleles depends on the number of different alleles found within alignments at the position of a variant. In order to find an actual INDEL allele within a particular alignment, population alleles defined by VOF records are iterated from the longest to the shortest one. In each iteration, the CIGAR string of the current allele is inferred from the difference between its length and the length of reference allele. Length of the shorter allele from the pair is considered to be a number of CIGAR match operations. The difference between the two lengths is either positive or negative, denoting the number of CIGAR insertions or the number of CIGAR deletions, respectively. The computed CIGAR string of the current allele is compared with the corresponding portion of CIGAR string describing the alignment. Likewise, the nucleotide sequence of the allele is compared with the corresponding subsequence of the alignment.
If both sequences and CIGAR strings match, the actual allele is found, and iteration is stopped.
The probability matrix of INDEL allele pairs is created from the found alleles, where probabilities are determined as in the SNV case. A personal allele is replaced with a masking allele affecting both nucleotide sequence and CIGAR string. The alignments without any detected personal allele remain unchanged.

Unmapped reads
Unmapped reads are encrypted completely using stream cypher encryption which produces a cypher with the same size as the input. At first, a secret key is randomly generated for all unmapped reads. This key is stored within the BDIFF file header. When an unmapped read is found, its template name and the secret key are hashed by SHA algorithm producing 512 bits long hash. The hash is then used to encrypt the sequence of the read. Every two bits of the hash are used to encrypt one DNA base, also encoded by 2 bits, from the input sequence using a simple XOR operation. Consequently, the key size is enough to encrypt a sequence of 256 bases uniquely. If the sequence is longer, the key is repeated. Unknown bases, represented by letter N, are skipped in the encryption.

Random masking
The provided VOF file and the masked mapped reads are considered public; therefore, everybody can tell which positions on a genome could be masked. As a consequence, rare variants not covered by the VOF file can be still abused by an adversary to infer personal data. This vulnerability is mitigated by the introduction of random, artificial SNV alleles into masked mapped reads by generating additional random VOF records before the masking process. Each generated VOF record has a random genomic position and contains allele counts representing approximate ratios of alleles in the human genome. Both generated and file contained VOF records are iterated together and processed in the same way. As a result, generated VOF records have a chance to mask or introduce novel variants in the same way as a population based record. The number of new variants should be high enough to disallow attacks in-between the variants from the VOF file. On the other hand, the size of BDIFF file and time cost of all operations linearly increases with the increasing number of variants.

CIGAR string and sequencing quality
Mapped reads do not contain only nucleotide sequences, but also other sensitive data that we process. When making a modification to alignment, the CIGAR string is modified accordingly; otherwise, it would be easy to guess the nature of an original alignment.
Moreover, mapped alignment typically contains sequence qualities that express confidence in each base. While a masking SNV allele does not change the length of the alignment, a masking INDEL allele often does, so it is necessary to adjust the length of a sequencing quality string to match the altered alignment. If the masked alignment is longer than the original one, the masking method provides artificial qualities to fill the gap. On the other hand, if the masked alignment is shorter than the original one, sequencing qualities are deleted and stored within a BDIFF file to keep the masking method reversible.

Masking effect
We examined the masking effect on personal alleles on non-Finnish European (NFE), African (AFR), and South Asian (SAS) populations separately with corresponding population allele frequencies (Figure 2-4). For this purpose we selected samples from the third phase of the 1000 Genomes Project mapped to the GRCh38 reference genome (https://www.internationalgenome.org/data-portal/sample). We provide the list of these samples in the Additional file 2.
Masked samples stay within the same population space and are shuffled within a main cluster. In the case of the SAS population (Figure 4), there is only a negligible shift from personal to masked samples, since its size is 1,526 genomes only. As can bee seen this number is not large enough for an effective masking and the masking effectiveness depends on the comprehensiveness of population allele frequencies. In comparison, NFE and AFR populations have 32,299 and 21,042 genomes respectively.
Moreover, we masked all the three populations of samples with all the The Genome Aggregation Database version 3 (GnomAD v3) population allele frequencies (https://gnomad.broadinstitute.org/downloads), causing shift towards a common area ( Figure 5). However, most masked samples do not overlap with masked samples from a different population, hence this approach can not be used for masking the population of a sample origin with current size of the gnomAD population allele frequencies.