 Research
 Open Access
 Published:
Secure tumor classification by shallow neural network using homomorphic encryption
BMC Genomics volume 23, Article number: 284 (2022)
Abstract
Background
Disclosure of patients’ genetic information in the process of applying machine learning techniques for tumor classification hinders the privacy of personal information. Homomorphic Encryption (HE), which supports operations between encrypted data, can be used as one of the tools to perform such computation without information leakage, but it brings great challenges for directly applying general machine learning algorithms due to the limitations of operations supported by HE. In particular, nonpolynomial activation functions, including softmax functions, are difficult to implement with HE and require a suitable approximation method to minimize the loss of accuracy. In the secure genome analysis competition called iDASH 2020, it is presented as a competition task that a multilabel tumor classification method that predicts the class of samples based on genetic information using HE.
Methods
We develop a secure multilabel tumor classification method using HE to ensure privacy during all the computations of the model inference process. Our solution is based on a 1layer neural network with the softmax activation function model and uses the approximate HE scheme. We present an approximation method that enables softmax activation in the model using HE and a technique for efficiently encoding data to reduce computational costs. In addition, we propose a HEfriendly data filtering method to reduce the size of largescale genetic data.
Results
We aim to analyze the dataset from The Cancer Genome Atlas (TCGA) dataset, which consists of 3,622 samples from 11 types of cancers, genetic features from 25,128 genes. Our preprocessing method reduces the number of genes to 4,096 or less and achieves a microAUC value of 0.9882 (85% accuracy) with a 1layer shallow neural network. Using our model, we successfully compute the tumor classification inference steps on the encrypted test data in 3.75 minutes. As a result of exceptionally high microAUC values, our solution was awarded cofirst place in iDASH 2020 Track 1: “Secure multilabel Tumor classification using Homomorphic Encryption”.
Conclusions
Our solution is the first result of implementing a neural network model with softmax activation using HE. Also, HE optimization methods presented in this work enable machine learning implementation using HE or other challenging HE applications.
Background
Cancer is a disease caused by the unlimited proliferation of certain cells in the human body, and the exact cause of cancer is not yet known. In 2020 alone, 19.3 million new cases were reported worldwide, and 10 million people died because of cancer [1]. For this reason, the prediction of cancer through genetic data analysis has been regarded as one of the most important tasks since early treatment can reduce the lethal effects of cancer on the human body.
Machine learning (ML) is one of the fields of artificial intelligence that learns the process of finding solutions on its own without human assistance to a given problem. Due to the difficulty in the process of diagnosing tumors through genes, tumor classification using MLbased on large amounts of genetic data contributes to decision making to diagnose and treat cancer. Some cancer genome studies using ML techniques on largescale data have shown the relationship between genetic modification and specific cancer types [2–5].
Since genetic data contains a lot of personal information and cannot be discarded or changed even if it is leaked, it is essential to protect the privacy of information about genetic data in the data analysis using ML. Various techniques, including differential privacy [6, 7] or multiparty computation [8], have been used to ensure privacy in the data analysis process. However, each of these approaches has the disadvantage of losing accuracy in the process of anonymizing the data or requiring multiple phases in the process of data sharing.
Homomorphic Encryption (HE) is a cryptographic scheme that enables us to perform arithmetic operations between encrypted data without decryption. HE has been considered one of the useful applications for privacypreserving ML, as it allows the computation of desired operations without disclosing information about the data [9, 10]. However, most HE libraries [11–14] mainly support only addition and multiplication of arithmetic operations. Although linear operation in ML, such as matrixvector multiplication, can be easily computed by using a suitable data packing method [15], many activation functions widely used in neural networks, such as sigmoid, ReLU, or softmax functions, cannot be directly operated for encrypted data. Most of the applications using HE overcomes such problems by approximating nonarithmetic operations with a polynomial that minimizes the error in a certain interval. The main problem with applying HE is to minimize computational increases occurring during the polynomial approximation process while limiting errors.
Integrating Data for Analysis, Anonymization and SHaring (iDASH) has held an annual secure genome analysis competition since 2014. Each year, important topics in the field of genetic analysis are selected to compete for the most effective solution. The problem of multilabel tumor classification using HE was one of the three tasks for the 2020 iDASH competition. Given the dataset with a total of 2,713 patients and their 25,128 genes, participants had to preprocess the given data, train the ML model with plain data, and obtain the highest microAUC score [16] within 5 minutes when the inference step was performed with encrypted test data of 909 patients.
Related works
Integrated analysis, which means integrating and classifying different types of data for samples in the same cohort, was developed with the emergence of ML techniques. Now various ML techniques are used in multilevel omics data integration as reviewed in [17–22] and classified depending on the learning method, data integration method, and feature selection method: supervised and unsupervised learning; horizontal and vertical data integration; supervised and unsupervised feature selection.
In the field of cancer type classification based on somatic mutations, many ML techniques are used to build suitable multilabel classifiers. Classifiers using unsupervised learning method such as cluster analysis exists [23]; however, supervised learningbased classifiers are more in our interest because accuracy can be increased substantially with the labeled data. Chen et al. [24] used a supervised learning technique named Support Vector Machine (SVM) to classify cancer types of given somatic mutation samples.
Yuan et al. [4, 25] proposed DeepGene and DeepCNA, which are multilabel cancer classifiers based on Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN), respectively. In particular, DeepGene uses a feature selection technique called Clustered Gene Filtering (CGF) based on cluster analysis. Sun et al. [26] also used DNN model with 5 layers and reached 70.1% classification accuracy. Some classifiers use multiple ML techniques and ensemble them to reach higher accuracy. Lee et al. [27] introduced a classifier named CPEM, an ensemble of two ML techniques, Random Forest (RF) and DNN, and reached 84.7% accuracy. However, these previous classification techniques do not consider privacy protection and sensitive genetic data leaks to the untrusted classifier owner, unless the classifier owner gives his whole model to the client.
As a solution of privacypreserving ML for genetic data, ML over encrypted data with HE is drawing attention with the annual iDASH competition [9, 10, 28–35]. Kim et al. introduced a privacypreserving logistic regression model over HE [28]. The model uses modified Nesterov’s accelerated gradient descent method to reduce the number of iterations, since the depth of the homomorphic circuit highly affects to the computational cost. The classifier was selected as the best solution of Track 3 at iDASH competition 2017, where similar approaches were also proposed [9, 29, 30].
Some parallelized versions submitted for the second track of iDASH 2018 competition [10, 31–34]. In 2019, the HE task of iDASH competition was a secure genotype imputation using HE. The solution showed “ultrafast” HE models to verify genotype imputation that took less than 10 seconds for evaluation with only a 2–3% decrease in accuracy [35]. The authors also have confirmed that similar results can be obtained for various HE libraries, including BFV [36], CKKS, and TFHE [37].
Difficulty of ML using HE
Some limitations of computation using HE make the general ML techniques not directly applicable to the ML over HE since they are yet insecure nor impractical for some reasons. First, as mentioned above, most HE libraries [11–14] mainly support only addition and multiplication of arithmetic operations. Thus, HE requires a polynomial approximation of nonpolynomial functions such as ReLU, Sigmoid, softmax, and even division and comparison and those greatly amplify the amount of computation over HE. However, the supervised ML models [4, 25–27] generally have large circuit depths for high accuracy and use nonpolynomial functions such as ReLU or maxpooling, which have high time complexity when implemented with HE.
To overcome the above problems, the ML over HE is being studied in various ways, such as replacing a nonpolynomial such as sigmoid in logistic regression with a simple polynomial [28], or finding an algorithm consisting of only polynomial operations although it is less efficient than stateoftheart algorithms involving nonpolynomials [9]. In our case, we propose the efficient polynomial approximation of softmax function and our algorithm with HE that operates within a reasonable time.
Secondly, preprocessing methods included in the classifiers can leak personal information. To prevent leakage, preprocessing should be done by the client herself or the server in an encrypted state. However, clients with low computational capabilities cannot follow heavy preprocessing methods [24, 27]. Also in the case of servers, they should compute only with the data encrypted with HE, so the preprocessing will be very impractical due to logical/nonpolynomial functions in the preprocessing as in [4, 24–27] when using the CKKS [14] HE scheme. Therefore, the classifiers with HEfriendly models and light and secure preprocessing are important for privacypreserving classification.
In addition, if preprocessing method can be easily done by the server using above properties, there is another advantage to the client that does not have to compute preprocessing. In this case, the data can be used for various training models and so the client only needs to encrypts the original data and share them.
Table 1 summarizes the previous multilabel tumor type classifications and their weaknesses when applied to the privacypreserving classification scenario. HEfriendly Model indicates the applicability of the model over HE, and Client and Server (HE) indicate the hardness of secure preprocessing as mentioned above.
Scenario and security model
The scenario in this paper follows the model presented in the HE task of iDASH 2020, which is also commonly considered in standard HE application models. In our scenario, two parties are involved in: the private data owner and the service provider, and they are simply denoted by client and server, respectively. The server trains the model using his data(or public largescale data) and wants to provide a service to generate inference results, the prediction of the tumor type, from the client’s encrypted data. The client has limited computational power and wants to outsource the tumor type prediction; however, she does not want to leak her sensitive data. In addition, the client’s data needs to be preprocessed to fit the model computation while maintaining privacy in two possible ways: by the client with limited computation power or by the server in the encrypted state. Therefore, in our scenario, the server should not be aware of the client’s data when preprocessing as well as computing the inference results.
The detailed explanation of our scenario and method is illustrated in Fig. 1. Our protocol can be performed by using HE while ensuring client’s data privacy. Let m be the client’s private data and f be a function that computed the whole process of the model generated by the server. First, the client encrypts m by her own secret key and sends the ciphertext Enc(m) to server. Then, server evaluates his model using HE operations, so that he obtains the encrypted evaluation output Enc(f(m)) and sends it back to the client. Finally, the client performs HE decryption for Enc(f(m)) with the secret key to get the desired output f(m). Since the server can only access encrypted data, our protocol can achieve the desired security even on malicious servers.
In our implementation, all ciphertexts have a 128bit security^{Footnote 1} level regardless of the remained number of operations. This means that the ciphertext will not leak any information in a lifetime even if all the computing power in the world is used. Our ciphertext is based on the RingLearning with Error (LWE) problem which is attracting attention as it seems secure although quantum computers are developed. And we set the HE parameters to achieve 128bit security estimated by Albrecht’s LWE estimator[38].
Goal of this work
Our goal is to perform privacyprotecting machine learning computations on encrypted tumor datasets. In general, there are two main obstacles to achieving this goal. The first point is the size of dataset. Most HE schemes that support multiencoding are able to contain at most 2^{16} or less number of data in a ciphertext with a practical parameter. Since typical genomic dataset consists of largescale matrix with millions of elements, a large number of ciphertexts are required to encrypt data and a huge computational overhead occurs in the operation. Therefore, a preprocessing process that reduces the size while maintaining the properties of the initial data is essential, and because of the security problem, this method should be performed in a HEfriendly manner.
The second point is that the machine learning model should be performed using HE. Since the operations allowed in HE is limited, especially addition and multiplication, most of deep and complex models cannot be implemented using HE in practical time now. Even in simple cases such as singlelayer neural networks, since most activation functions use nonarithmetic operations such as comparison (ReLU) or exponential functions (sigmoid, softmax), additional time consuming and accuracy loss occurs in the process of computing approximation of these functions with polynomials.
Results
Summary of results
In this paper, we propose a privacypreserving multilabel classifier using a shallow neural network with a softmax activation function based on HE, which is also an outstanding solution of the first track of the iDASH 2020 competition. Our method is a supervised parallel integration method that uses unsupervised feature selection (cluster analysis) which provides the scalability on the parameter depending on the computational bound of HE. Our ideas can be categorized into three main subjects. First, we suggest data filtering method to reduce the size of the large raw data. This method is suitable for our scenario using HE because the data owner does not require additional operations other than filtering before encryption. Second, we modify the data packing method for matrixvector multiplication [15] by duplicating the data in encryption to make a tradeoff between the number of ciphertexts for packing and the number of rotations to reduce computational cost for HE. In multiplication between encrypted matrix and plain vector, the number of rotations occupies the most time cost, so we minimize the total time by choosing an appropriate number of copies in the encryption step. Lastly, we use elementary exponential approximation method to evaluate the microAUC value. We minimize the loss of the approximated microAUC from the real value not by focusing on minimizing the error in the approximation of the exponential function, but by using an approximation that shares properties with the exponential function.
Dataset description
Our scenario focuses on the dataset originated from The Cancer Genome Atlas (TCGA) database, which is widely used in genomic research. The data of the somatic Single Nucleotide Variation (SNV) and genelevel Copy Number Variation (CNV) information for TCGA samples are downloaded from publicly available datasets [39] and [40], respectively. The training and testing data are generated based on the sample metadata downloaded from the TCGA project (available on 8/13/2020) and extracted the cases for all available cancer types. The samples first are filtered for the ones that exist in both SNV and CNV datasets. For each of the remaining cancer types, the data were randomly divided into training and testing datasets with 75% and 25% of all samples for the corresponding cancer type, respectively. Cancer types with less than 100 training samples are filtered out, and the resulting dataset consists of cancers from 11 sites: Bladder, Breast, Bronchus and Lung, Cervix uteri, Colon, Corpus uteri, Kidney, Liver and Intrahepatic bile ducts, Ovary, Skin, and Stomach.
The dataset consists of total 3,622 samples (2,713 for train, 909 for test) and 25,128 genes, and each sample has one cancer type out of 11 types of cancer, consists of two types of data: Copy Number (CN) data and Variants data. The dataset is available in [41] which is generated in the same way with iDASH 2020 competition track I and the details of dataset including the number of samples and the number of mutation’s effects for each feature are disclosed in Table 2.
CN data consists of the copy number of the genes. The copy number data show the copy number variations (CNVs) of the genes as numbers, representing the state of the gene on the corresponding sample, whether the gene has duplication or deletion of a considerable number of base pairs. The data consist of 5 different levels 0, ±1,±2 where the negative and positive values represent deletion and duplication of the corresponding gene from the sample, respectively. Copy number with 0 means that the gene has no considerable variation on the corresponding gene from the sample.
Variants data consists of mutation data of selected pairs of samples and genes for each tumor type with various features: gene’s location on chromosomes, mutation type, whether the mutation is Single Nucleotide Polymorphism (SNP), and mutation’s effects separately predicted by two different methods.
Neural network model and parameter selection
For the training step, we first downsize each sample by using our proposed filtering algorithms, and used it for the training of our neural network model. More precisely, we feed the downsized samples into our shallow neural network model, which consists of one hidden layer with 64 nodes and linear activation function and output layer with 11 nodes. During the training step, we used batch size of 32, number of epochs of 50 and dropout rate of 0.9.
To suitably use the trained model in downstream tasks over encrypted data, we varied (d_{cn},k_{var}), the parameters for CN and variants dataset respectively, until the size of downsized data are less or equal to 2^{B} for each B from 9 to 12. In our algorithm, the size of model is determined by the size of filtered data under two parameters. For each B, we seek the best pair of parameters, d_{cn} among {0,0.01,⋯,0.24} and k_{var} among {0,10,⋯,590}; more precisely, we use 10fold cross validation to find the (d_{cn},k_{var}) pair with best microAUC such that the size of model is less or equal to 2^{B}. After selecting the parameters, d_{cn} and k_{var}, we train the model on the entire training dataset, and feed it into the inference step over encrypted data.
For the inference step over encrypted data, we adopt HEaaN library, the implementation of CKKS scheme [14]. The CKKS parameters are chosen by the ring dimension 2^{17} and the ciphertext modulus 2^{2670}. We used signed binary secret, which satisfy more than 128bit security according to Albrecht’s LWE estimator [38]. For the scal ing factor of CKKS, we choose the scaling factor by 2^{60} for ciphertexts and 2^{40} to encode plain vectors for constant multiplication or masking vectors.
For the approximation of Softmax, we use Goldschmidt algorithm with M=80 and d=30 for the inversion, and for the exponential function, we use (r,L)=(4,32) for B=9,10 and (4,64) for B=11,12, respectively. All experiments were performed on Intel Xeon CPU E52620v4 at 2.10GHz processor and used 8 threads. The detailed description of CKKS parameters is stated in Methods section and we refer Algorithm 4 for the parameters used in our algorithms.
Our codes for the training step can be found in https://github.com/jaihyunp/iDash2020, and those for the inference step can be found in docker repository swanhong/idash2020.
Experimental results
Training model with plain dataset
Gradually changing two parameters, our results are presented in Figs. 2 and 3. Figure 4 visualizes how (d_{cn},k_{var}) pair determines the size of model. The model tends to show the best performance in terms of microAUC on d_{cn}≈0.08, and larger k_{var} tends to show a better performance. However, to optimize the computational cost on the privacypreserving inference step based on homomorphic encryption, we seek the best parameters (d_{cn} and k_{var}) for the model of size less or equal to 2^{B} for each B=9,10,11, and 12.
The best (d_{cn},k_{var}) pair for each size of model, 2^{B}, and its performance on 10fold cross validation is given in Table 3.
Encrypted inference results
Using filtered gene in preprocessing step and the model built in training step, we compute inference step over encrypted data as stated in previous sections. As a result, we estimate the time cost for each roundtrip step including encryption, constant multiplication, rotate and summation, Approximate softmax evaluation, and decryption step. The results are stated in Table 4. In the table, the column Encoding Duplication means the parameter m that we used as the number of duplication in encryption. As m increases, the number of required ciphertexts increases linearly so the time cost for encryption also similarly increases and the number of rotations decreases as \(O\left (\frac {1}{m}\log \frac {1}{m}\right)\) scale. With such tradeoffs, denoting the bound of the number of genes by 2^{B}, we notice that the time cost is minimized when m=4 for B=9,m=2 for B=10,11, and m=4 for B=12, respectively. We illustrate the scalability of time cost from m in Fig. 5 for B=10. Moreover, we state the final microAUc value and accuracy for each B in the table. In summary, we get about 0.988 microAUC and 85% accuracy for test dataset, which is the relatively better result compared to other works using the dataset from same data source [4, 26].
Discussion
Our work in this paper was awarded fist place (with other teams including Desilo, Inpher and SamsungSDS) in the HE track of iDASH 2020 competition. Unlike other award winners who obtained relatively low microAUC values (close to 0.95) by performing linear activation within a very short time, our result shows a high microAUC value by proposing the only method for applying softmax activation within a practical time. In addition, our preprocessing method has scalability, making it possible to obtain a higher microAUC value through additional time consumption, which is different from other teams.
The main purpose of our data preprocessing is to remove irrelevant genes in the data. The genes removed have a similar effect to the remaining genes, or are have a weak effect on the cancer type classification. CN data filtering extracts the genes with the hamming distance of the samples’ copy number, based on the fact that the adjacent genes have copy number similarity. This filtering technique can be applied to other types of the genetic dataset with an appropriate threshold d_{cn} for each dataset. Variants data filtering can be also applied to other types of datasets that have a lot of empty information. Since various genetic datasets are ‘emptysparse’, not sparse with lots of 0s but no information, our filtering technique can be used to efficiently extract the relevant features. However, since our method focused only on the filtering method to select meaningful genes, it cna be expected that the high microAUC values can be obtained through other methodologies that perform simple operations. In future studies, combining our method with other HEfriendly preprocessing methods [4, 26] would be able to show better accuracy using the same dataset.
To our best knowledge, our softmax approximation algorithm is the first approach to use the softmax activation function for the neural network implemented with HE. However, the size of the input data must be adjusted through normalization, and it should be preceded that the process of predicting the range of the maximum and minimum values of the matrix product XW as a result of the train data to compute the approximation of exponential function and the Goldschmidt algorithm. In addition, our algorithm may not be suitable for computing deep neural network models with multiple softmax activation functions for a limited time, as it consumes a lot of HE depth in approximating the softmax. In further research, we expect that other popular activation functions, such as sigmoid or ReLU, can be combined with our neural network model to improve the final score.
Conclusions
In this paper, we propose the first result of privacypreserving multilabel classification for tumor data using neural network with softmax activation. To enable implementation using HE within a practical time, we suggest filtering method to reduce the size of genes from 25,128 to prefixed bound 2^{B} where B is in {9,10,11,12}. For encoding filtered dataset to ciphertexts, we suggest the duplicating method that encodes same data multiple time, which decreases the time cost for matrixvector multiplication for evaluating neural network model and provides a time cost tradeoff between encryption step and message rotation step. Also, our approximation method for the softmax function enables to apply the softmax activation function for shallow neural network model. As a result, we obtain inference results with microAUC values of about 0.988 to classify multilabel tumor data in 5 minutes. If the time given for the inference step is not limited, our preprocessing and approximation methods can be used as building blocks for general deep neural networks.
Methods
Notations
All logarithms are base 2 unless otherwise indicated. The vectors are denoted with upper arrow. We denote an entry of the vector by using the same character with index. We denote Hadamard multipcation between two vectors by \(\vec {a} \odot \vec {b}\). Also, for simplicity, we denote the [x]_{n} be the number in {1,⋯,n} satisfying [x]_{n}=x mod n.
For a matrix A, A(i,j) means the i th row and j th column element of A. Also, we denote the submatrix of A with i∈I th rows and j∈J th columns from A by A[I,J]. In this case, the colon(:) indicates the whole index set.
Approximate homomorphic encryption
Since HE enables anaylsis of encrypted data while preserving the privacy of message from operators, it is considered as one of the beneficial tool to delegate operations that requires sensitive data without revealing any information. Unlike other popular cryptographic tools including multiparty computation, which require protocol participants to continuously interact to each other through the scenario, HE has the advantage that no additional actions or online processes are required for the message owners after they encrypts and transmits data to the operator. Through these characteristics, HE has been exploited in various field such as ML that requires computation using sensitive information such as genomic data [35, 42, 43] or financial data [44]. Additionally, HE can also play an important role in the protection of data in the computation process in applications that require computations between real data, such as ML [45] or cyber physics system [46].
Since Gentry firstly suggested in his blueprint [47] in 2009, a number of HE schemes have been proposed to achieve useful properties for applications. Each scheme has advantages in operations in a particular message space, such as finite field operations [12, 36] or boolean circuit [37]. However, many wellknown deep learning algorithms [48, 49] cannot be directly implemented because of some limitations of HE. First, since most HE methods only support multiplication operations less than fixed number of depths, it is difficult to implement a method using a large number of layers in a neural network. Therefore, most HE applications focus on implementing shallow neural networks or logistic regression. Although it is possible to recover the depth of the ciphertext through an operation called bootstrapping, it requires a very high computational overhead compared to other basic operations. Second, each scheme does not support both common arithmetic operations and binary(or logistic) operations at the same time. While many ML algorithms require both operations such as matrixvector multiplication or ReLU function, the application of the HE scheme requires the way to efficiently perform unfavorable operations.
The approximate HE scheme, namely as CKKS scheme, is proposed by Cheon et al. [14]. The main feature of approximate HE scheme is that it deals with operations in complex numbers \(\mathbb {C}\) and it supports fixedpoint arithmetic operations between encrypted data. It also supports approximate arithmetic, which considers the noise of the ciphertext as part of the message to increase the efficiency of the operation. Since most ML algorithms mainly use fixedpoint operations on real data or noisefriendly algorithms such as gradient descent, CKKS scheme takes advantages of the most of ML applications compared to other HE schemes.
We remark that the word approximate does not mean that the homomorphic operations contain large errors and result in data loss, but rather that the very small errors are allowed in the message to increases efficiency of the operations in the scheme. In machine learning applications, such errors does not ruin the message as the scheme guarantees a sufficiently large precision if practical parameters are used.
For the rest of this section, we formally describe CKKS scheme. Let L be a level parameter that a fresh ciphertext is equipped with a modulus q_{L}=(2^{Δ})^{L}, and q_{ℓ}:=(2^{Δ})^{ℓ} for 1≤ℓ≤L for some scaling factor Δ. Let \(R:=\mathbb {Z}[X]/(X^{N}+1)\) be a cyclotomic ring for a poweroftwo N and R_{q} be a moduloq quotient ring of R, i.e., R_{q}=R/qR. The distribution χ_{enc} and χ_{err} denote the discrete Gaussian distribution with some fixed standard deviation. The distribution χ_{key} outputs a polynomial of {−1,0,1}coefficient. We denote the rounding function ⌊·⌉ and moduloq operation [·]_{q}.
CKKS scheme uses a plaintext vector \(\vec {m}\in \mathbb {C}^{N/2}\) and provides enrtywise operation, called SingleInstrumentMultipleData (SIMD) operation, such as addition, substitution, and Hadamard multiplication between vectors. To encrypt complex value, CKKS uses a field isomorphism \(\tau : \mathbb {R}[X]/(X^{N}+1)\rightarrow \mathbb {C}^{N/2}\) called canonical embedding.

KeyGen(params).

Sample s←χ_{key} and Set the secret key as sk=(1,s).

Sample \(a\leftarrow U(R_{q_{L}})\) and e←χ_{err}. Set the public key as \(\mathsf {pk} = (b,a)\in R_{q_{L}}^{2}\) where \(b=[a\cdot s + e]_{q_{L}}\).


Enc_{pk}(m). Given a message m∈R, sample v←χ_{enc} and e_{0},e_{1}←χ_{err}. Output the ciphertext \(\phantom {\dot {i}\!}\mathsf {ct} =[v\cdot \mathsf {pk} + (\mathsf {m}+e_{0},e_{1})]_{q_{L}}\).

Dec_{sk}(ct). Given a ciphertext \(\mathsf {ct}\in R_{q_{\ell }}^{2}\), output m^{′}=〈ct,sk〉.

Add/Sub(ct,ct^{′}). Given two ciphertext \(\mathsf {ct}, \mathsf {ct}' \in R_{q_{\ell }}^{2}\), output the ciphertext \(\phantom {\dot {i}\!}\mathsf {ct}_{add}/\mathsf {ct}_{sub} = [\mathsf {ct} \pm \mathsf {ct}']_{q_{\ell }}\) encrypting a plaintext vector \(\vec {m}_{1} \pm \vec {m}_{2}\).

Mult_{evk}(ct,ct ^{′}). Given two ciphertexts \(\mathsf {ct}, \mathsf {ct}' \in R_{q_{\ell }}^{2}\), output a leveldowned ciphertext \(\mathsf {ct}_{mult}\in R_{q_{\ell 1}}^{2}\) encrypting a plaintext vector \(\vec {m}_{1}\odot \vec {m}_{2}\).

Rot_{rk}(ct;r). For a ciphertext ct encrypting a plaintext vector \(\vec {m} = (m_{1},\cdots,m_{n})\), output a ciphertext ct^{′} encrypting a plaintext vector \(\vec {m}' = (m_{r+1},\cdots,m_{n},m_{1},\cdots,m_{r})\) which is the (left) rotated plaintext vector of ct by r positions.
For the remind of the paper, we may denote the operations between ciphertexts or ciphertext and plain vector by common symbols, such as Add(ct_{1},ct_{2})=ct_{1}+ct_{2} or \(\mathsf {CMult}(\mathsf {ct}, \vec {c}) = \mathsf {ct} \cdot \vec {c}\) for simplicity.
Data preprocessing
Given more than 25,000 genes, meaningful gene selection is essential for efficient ML using HE. Therefore, we introduce a new data preprocessing technique consisting of a feature selection method modified from [4] to obtain significant genes useful for tumor prediction. Our data preprocessing method consists of two filtering algorithms that extract specific features from two different data types. The resulting filtered data matrix is concatenated as shown in Fig. 6(c) and inserted into our neural network.
In the rest of paper, we denote the number of samples and genes from original data by s and G and the number of types of tumor by T. In our dataset, these notations contain values s=2,713(train) or 909(test), G=25,128 and T=11. Using the abstract notation, the CN and Variants data can be understood as a matrix of the size s×G and the T number of set of information. Since the number of genes G is too large to handle with HE efficiently, we propose the preprocessing method which is easy to compute both clientside and serverside in our scenario to reduce the number of filtered gene less than fixed bound.
CN data filtering
Our main approach for CN data filtering is to filter out the irrelevant genes with cluster analysis based on the similarity of the neighboring genes. Modified from CGF in [4], we introduce a new cluster analysis using hamming distance instead of Jaccard distance. To reduce the number of input features we take only one gene from each cluster, unlike CGF.
Let C∈{0,±1,±2}^{s×G} be the matrix of copy number data, where the s rows correspond to the s samples, and the g columns correspond to G genes; C(i,j) is a copy number that corresponds to gene j of sample i (see Fig. 6(a)). We initially sort the raw CN data according to the order of the gene positions on the chromosome.
Then, we cluster the genes into groups by their copy number similarity (line 310 in Algorithm 1). For two gene columns \(\vec {p}, \vec {q} \in \{0, \pm 1, \pm 2\}^{1 \times G}\), we use the Hamming distance as the similarity measure
where p_{i} and q_{i} are the i th entry of vectors \(\vec {p}\) and \(\vec {q}\), which represents the copy numbers of each gene for ith sample. Starting from the first gene in C, say a representative of the first group, we calculate the hamming distance with the following genes in order. Until the distance is less than the predetermined threshold d_{cn}, we merge each corresponding gene to the first group. If the distance first reached the threshold d_{cn}, the first group ends and the corresponding gene becomes a representative of the second group. Then start from the new representative, calculate the hamming distance with the following genes.
Repeating this algorithm until the genes are all clustered, each gene is in a unique group with neighboring genes. Finally, we filter the CN data with the representative genes, the first gene in each group, resulting in a matrix \(\tilde {C} \in \{0, \pm 1, \pm 2\}^{s \times \tilde {g}_{\mathsf {cn}}}\) where \(\tilde {g}_{\mathsf {cn}}\) is the number of groups as the result of our filtering. Note that \(\tilde {g}_{\mathsf {cn}}\) is much smaller than the number of original genes. We formally state our CN data filtering algorithm in Algorithm 1 and shown in Fig. 6(a).
Variants data filtering
For Variants data filtering, the feature selection mechanism uses only a mutation’s effect data classified in 4 different levels: LOW, MODERATE, MODIFIER, and HIGH. The Variants data filtering works based on the effectiveness of the mutations. It filters out the genes with less effective mutations.
In the Variants data, for each sample, only a few mutation effect data of individually selected genes are given. So the raw Variants data for whole samples and genes is almost empty, which cannot be used immediately as a neural network input. Hence our Variants data filtering is separately applied to each cancer type to reduce the empty spaces and fill in yet remaining spaces with zero.
Let V_{t} be the matrix of the raw data with mutation’s effect corresponding to the tumor type t (1≤t≤T). V_{t} is a s_{t}×G matrix with mutation’s effect as strings, where s_{t} rows correspond to s_{t} samples and G columns correspond to whole G genes for each t. V_{t}(i,j) is the effect of the mutation in j th gene of i th sample with t (see Fig. 6(b)). We first encode the string data to predetermined real values between 0 and 1: LOW = 0.2, MODERATE = 0.5, MODIFIER = 0.9, HIGH = 1.0, and 0 if no information exists. The resulting encoded Variants matrix is sparse matrix in \(\phantom {\dot {i}\!}E^{s_{t} \times G}\), where E={0,0.2,0.5,0.9,1.0} be a set of encoding values.
Secondly, we sum V_{t} by each gene column and if the sum is more than the predetermined threshold k_{var}, then put the gene in the list of the selected genes (line 28 in Algorithm 2). The columnwise summation \(\sum _{i=1}^{s_{t}}V_{t}(i,j)\) is a mutation’s effect of jth gene to the s_{t} samples, so the selected genes in list can be regarded as genes with considerable mutation effects. The union of the selected genes from each tumor type t is the set of filtered genes.
Finally, we filter the whole Variants data with the filtered genes for each t, resulting in a matrix \(\tilde {V} \in E^{s \times \tilde {g}_{\mathsf {var}}}\) with much smaller genes to whole samples The workflow of the Variants data filtering technique is shown in Fig. 6(b).
Roadmap for training step over training data
After preprocessing for CN and variants data, we get the data matrix \(X = [\tilde {C}\tilde {V}]\) with size s×g where \(g = \tilde {g}_{\mathsf {cn}} + \tilde {g}_{\mathsf {var}}\) by concatenating two filtered results. Then, with the tumor data Y, we train a neural network model from (X,Y) in plain while transform Y as the onehot encoded label matrix with size s×T. Here, we note that T is 11 in our dataset, which is relatively small than s and g. Our neural network model consists of one hidden layer with 64 nodes and linear activation function and output layer with 11 nodes. In the output layer, we use softmax activation function to output their predicted value. During the training phase, we used Keras library [50] in Python with batch size of 32, number of epochs of 50 and dropout rate of 0.9. Note that we used sufficiently big dropout rate in order to avoid overfitting since the layers in our model are small.
To select the best parameters, d_{cn} and k_{var}, for the preprocessing on the given train dataset, we train and evaluate shallow neural network models using 10fold cross validation with filtered data based on each pair of parameters. After selecting the best parameters, we train the shallow neural network model on the entire train dataset using the best parameters, and we use the trained model for the inference step over encrypted test dataset.
Roadmap for inference step over encrypted data
From now on, we state the method to compute inference step from our model with encrypted data. Precisely, input data matrix X with size s×g and weight matrix W with size g×T are given (recall that s,g and T refers the number of samples, genes, and tumors, respectively). Since T is relatively small than s and g in practice, we consider the matrixmatrix multiplication Y=X·W as matrixvector multiplications \(\vec {y}_{i} = X \cdot \vec {w}_{i}\) for 0≤i<T, where \(\vec {w}_{i}\) is the (i+1)th column of W. Then, for each row Y_{j} of Y (0≤j<s), we compute the softmax function to get final score of our model. Since the matrix Y is still encrypted, we need to compute an approximate function of softmax. In short, our Method can be divided into three steps:

1.
Data Packing : encrypt the matrix X to a number of ciphertexts {ct_{i}}.

2.
MatrixVector Multiplication : using {ct_{i}} and W, compute matrixvector multiplication to get the ciphertext ct_{Y} that contains Y=X·W.

3.
Softmax Evaluation : compute approximate softmax function for ct_{Y} to obtain softmax output for each row of Y.
We break the method down into the first two steps and the other, and explain the main ideas in each section.
Data packing and matrix multiplications
Although CKKS scheme supports SIMD operation between vectors, the SIMD operation cannot be directly applied to the matrixvector multiplication operation. Therefore, in order to efficiently perform a matrixvector multiplication operation using HE, a process of mapping a matrix to a number of vectors is required. In this section, focusing on the matrix multiplications in our scenario, we first state the naive approach and then suggests our optimization method.
The main idea of mapping a matrix to vectors for this computation is suggested in [15], which is that \(X \cdot \vec {w}_{i}\) can be understood as n times of slotwise vector multiplication between diagonal part of X and w_{i}. Concretely, we define \(\vec {x}_{j}\) by the jth diagonal part of X so that \(\vec {x}_{j} = (X(k, [j+k1]_{g})_{1 \le k \le s}\) for 0≤j<g. For the repeated rotation of each column \(\vec {w}_{i}\) of W, we define with superscript with parentheses as \(\vec {w}_{i}^{(j)} = (\vec {w}_{i}[j+k1]_{g})_{1\le k \le s}\) for 0≤j<g. Then, the matrixvector multiplication can be computed as \(\vec {y}_{i} = X \cdot \vec {w}_{i} = \sum _{j=0}^{g1} \vec {x}_{j} \odot \vec {w}_{i}^{(j)}.\) Hence, the multiplication can be done if we encrypt \(\vec {x}_{j}\)’s in several ciphertexts and compute constant multiplication between ciphertext and plain vector \(\vec {w}_{i}^{(j)}\) (see Fig. 7(a)).
Warmup with toy example
Before we state our explicit method for encodings and matrixmultiplications, we start with a toy example, simplifying the parameters as s=8,g=4, and T=4, and assumsing that the number of slots in one ciphertext is 4 times larger than the size of vector s. Then our goal is to compute \(\vec {y}_{i} = \sum _{j=0}^{3} \vec {x}_{j} \odot \vec {w}_{i}^{(j)}\) for 0≤i<4. In the naive approach, we encode and encrypt \(\vec {x}_{i}\)’s by \(\mathsf {ct} = \mathsf {Enc}(\vec {x}_{1} \ \vec {x}_{2} \ \vec {x}_{3} \ \vec {x}_{4})\) and define \(\tilde {w}_{i} = (\vec {w}_{i}^{(0)} \ \vec {w}_{i}^{(1)} \ \vec {w}_{i}^{(2)} \ \vec {w}_{i}^{(3)})\) for each i. Then we can compute Hadamard multiplication by \(\mathsf {ct}_{i}^{\mathsf {mult}} = \mathsf {ct} \cdot \tilde {w}_{i}\), which contains the message vector \((\vec {x}_{0} \odot \vec {w}_{i}^{(0)})\ \cdots \ (\vec {x}_{3} \odot \vec {w}_{i}^{(3)})\). Hence, the final sum is obtained by computing rotation twice as \(\mathsf {ct}_{i}' = \mathsf {ct}_{i}^{\mathsf {mult}} + \mathsf {Rot}(\mathsf {ct}_{i}^{\mathsf {mult}};2s)\) and \(\mathsf {ct}_{i}^{\mathsf {sum}} = \mathsf {ct}_{i}' + \mathsf {Rot}(\mathsf {ct}_{i}';s)\), where the real parts of first s slots in \(\mathsf {ct}_{i}^{\mathsf {sum}}\) contains \(\vec {y}_{i}\) (see Fig. 7(b)).
From the naive approach, we need T number of the matrixvector multiplication, so that the CMult and Rot should be computed T and 2T times, respectively. Here, the implementation cost of Rot is larger than CMult, so we suggest new packing method to reduce the number of rotation in the whole algorithm. Our main idea is to duplicate and encode X multiple times in each ciphertexts, which reduces the number of rotations since the number of summation between the slots in one ciphertexts is reduced.
Now, we propose a new approach for matrixvector multiplication to reduce the number of constant multiplications and rotations. Our main idea is that we copy each \(\vec {x}_{j}\)’s several times in encoding step. For the same condition as above example, we can encode the vectors \(\vec {x}_{i}\) twice so that we obtain two ciphertexts \(\mathsf {ct}_{0} = \mathsf {Enc}(\vec {x}_{0} \ \vec {x}_{0} \ \vec {x}_{1} \ \vec {x}_{1})\) and \(\mathsf {ct}_{1} = \mathsf {Enc}(\vec {x}_{2} \ \vec {x}_{2} \ \vec {x}_{3} \ \vec {x}_{3})\). In this case, we define \(\tilde {w}_{k, l}\) as in Eq. (2).
for 0≤k,l<2. Then we get \(\mathsf {ct}_{k, l}^{\mathsf {mult}} = \mathsf {ct}_{l} \cdot \tilde {w}_{k, l} = \mathsf {Enc}(\vec {x}_{2l} \odot \vec {w}_{2k}^{(2l)} \ \cdots \ \vec {x}_{2l+1} \odot \vec {w}_{2k+1}^{(2l+1)})\). Thus, by rotating and adding those ciphertexts as \(\mathsf {ct}_{k}^{\mathsf {sum}} = \sum _{l=0}^{1} \mathsf {ct}_{k,l}^{\mathsf {mult}}\) and \(\mathsf {ct}_{Y, k} = \mathsf {ct}_{k}^{\mathsf {sum}} + \mathsf {Rot}(\mathsf {ct}_{k}^{\mathsf {sum}}; 2s)\), the result ciphertexts contain message vectors as in Eqs. (3) and (4),
which are \(\vec {y}_{i}\)’s that we desired. In this method, the operations we need are 4 CMults and 2 Rots, while 2 ciphertexts are required for encryption instead of 1 (see Fig. 7(c)).
Using imaginary part of message space
The CKKS scheme supports operations between complex numbers, but in applications using real numbers only such as neural networks, the imaginary part is never used. Here, we can make the computation in the encrypted state more efficient by using the imaginary part, which is not used in the plain operation.
In [35], the authors suggested the method to reduce the number of multiplications using CKKS scheme. For four real numbers a, b, c, and d, the sum of the two products ab+cd is equal to the real part of one complex product (a+ib)(c−id), where \(\mathsf {i}= \sqrt {1}\). Since the matrixvector multiplication we need is also composed of the sum of Hadamard multiplications between vectors, we can reduce the number of ciphertext and the number of constant multiplications by half by combining each vector by two and encoding it into one complex number.
Rotate and sum algorithm
For the matrixvector multiplication, the addition between data packed in one ciphertext is needed. We use RotSum algorithm, which repeatedly computes rotation and addition to get summation of desired slots in a ciphertext.
Precisely, we define the algorithm RotSum(ct,s,t,d) outputs the ciphertext ct_{out} that is obtained by computing ct_{out}←ct and ct_{out}←ct_{out}+Rot_{rk}(ct_{out};st^{j}) for j=0,1,⋯,d−1 in order. As a result, if ct was an encryption of the message vector \(\vec {m} =(m_{i})_{1\le i\le n}\), then ct_{out} contains the message vector \(\vec {m}_{\mathsf {out}} = \left (\sum _{j=0}^{t^{d}1} m_{[i+js]_{n}}\right)_{1 \le i \le n}\).
Putting it all together
Combining all of the above methods, we explain the explicit method used for our implementation. For the rest of this section, we denote the concatenation of vectors by \(\_{i=s}^{t} \vec {a}_{i} = (\vec {a}_{s} \ \vec {a}_{s+1} \ \cdots \\vec {a}_{t})\) and the repeated concatenation of the same vector by \((\vec {a})^{r} = (\vec {a} \ \vec {a} \ \cdots \ \vec {a})\) (r times). Also, we can embed matrices X and W into large matrices with row and column lengths of powerof2, respectively, so we assume that s,g are powerof2’s. Recall that our goal is to compute \(\vec {y}_{j} = \sum _{i=1}^{g} \vec {x}_{i} \odot \vec {w}_{j}^{(i)}\) for 0≤j<t while \(\vec {x}_{i}\)’s are encrypted, so that the output \(\vec {y}_{j}\)’s are also encrypted.
As explained before, we copy each \(\vec {x}_{i}\)’s m times when encoding, where m is a powerof2. Since each vector has the size s and the number of slots in a ciphertext is n, each ciphertext will contain \(\ell = 2 \cdot \frac {n}{ms}\) number of different vectors (in both real and imaginary parts). Thus we need total \(\frac {g}{\ell } = m \cdot \frac {sg}{2n}\) number of ciphertexts for encoding.
Precisely, we encode \(\vec {x}_{i}\)’s to each ciphertext ct_{i} and plain vector \(\tilde {w}_{ji}\) by Eqs. (5) and (6) for \(0 \le i \le \frac {g}{\ell }  1\) and \(0 \le j \le \lceil \frac {T}{m} \rceil  1\), respectively. Note that the massage vector in ct_{i} sequentially contains m copies of vector \(\vec {x}_{i \cdot \ell + 2q} + \mathsf {i} \cdot \vec {x}_{i \cdot \ell + 2q + 1}\) for each q. Here we define \(\vec {w}_{jm+r} = \vec {0}\) if jm+r≥T.
Next, we compute the multiplication step. For each j, we multiply ct_{i} and \(\tilde {w}_{ji}\), add its conjugation and divide it by 2 to get \(\mathsf {ct}_{ji}' = \left ((\mathsf {ct}_{i} \cdot \tilde {w}_{ji}) + \overline {(\mathsf {ct}_{i} \cdot \tilde {w}_{ji})}\right) / 2\). Then, by adding them to get \(\mathsf {ct}_{j}^{\mathsf {sum}} = \sum _{i=0}^{\frac {g}{\ell }1} \mathsf {ct}_{ji}'\), we get the ciphertext \(\mathsf {ct}_{j}^{\mathsf {sum}}\) which is the encryption of the vector in Eq. (7) for each j.
Finally, we compute the following summations using RotSum so that the first ms slots contain the desired sums and are copied to the remaining slots. The Eq. (8) indicates the result of vector operation from Eq. (7) since the summation for (i·ℓ+wq+k)’s for all indices q,i, and k works as the summation for all indices from 0 to q−1. Hence, the result is the desired vector \(\vec {y}_{jm+r}\).
Then for each j, we compute the result ct_{Y,j} that contains jth group of m outputs \(\{\vec {y}_{jm + r}\}_{0\le r< m}\) using RotSum operation. From Eq. (8), the resulting ciphertext ct_{Y,j} can be computed as Eq. (9).
As a result, the desired vectors \(\vec {y}_{i} = X \cdot \vec {w}_{i}\) are contained in \(\lceil \frac {T}{m} \rceil \) number of ciphertexts \(\{\mathsf {ct}_{Y, j}\}_{0\le j < \lceil \frac {T}{m} \rceil }\). Note that the number of rotations used in the Eq. (9) is \(\log \frac {\ell }{2} = \log \frac {n}{ms}\) for each j, so total \(\log \frac {n}{ms} \cdot \lceil \frac {T}{m} \rceil.\)
In summary, if we duplicate \(\vec {x}_{j}\)’s m times for encryption, then the required number of rotations is reduced by m times, where the number of ciphertexts for encryption in increased by m times. Exactly, the number of rotations and multiplications are \(\log \frac {n}{ms} \cdot \lceil \frac {T}{m} \rceil \) and \( \frac {sg}{2n} \cdot m \lceil \frac {T}{m} \rceil \), respectively, while the number of ciphertexts for encryption is increased to \(m\cdot \frac {sg}{2n}\) (note that the value \(\frac {sg}{2n}\) implies the minimum number of ciphertexts to encrypt X). Since the number of multiplication does not change asymptotically for m, our method provides a time cost tradeoff between the number of rotations and encryptions. In our implementation environment, we can reduce the total time cost by using more memory to encrypt data since the rotation cost is the significant part in total matrixvector multiplication algorithm.
Approximation of softmax
While softmax layer substantially enhances the performance of the classification, it cannot be directly computed by CKKS scheme since it comprises several nonpolynomial operations. Recall that the softmax of a real vector \(\vec {v}=(v_{1}, \cdots,v_{t})\) is defined as Eq. (10).
Both exponential function and division are not polynomial, so we should replace them by their polynomial approximation. We introduce a proper polynomial approximation technique for each of exponential and division function. In general, in order to approximate a nonpolynomial operation by a polynomial, minimax method that minimizes the maximum error value within an interval or Chebyshev approximation that express the function by the series of Chebyshev polynomials are mainly used. However, the minimax approximation loses the increasing property of exponential and the division function, because the sign of the error is not constant. Hence, this method is not suitable in terms of calculating microAUC score. Instead, we suggest an approximation method of the softmax layer that is more suitable for microAUC score. In particular, our approximation method consists of a less number of squaring operations, so it has an advantage to be evaluated over homomorphically encrypted data.
Approximation of exponential function
Our approximation method for exponential function comes from the elementary definition of exp(x). Precisely, for some r, we approximately compute exp(x) as Eq. (11).
Note that this formula can be computed by squaring n times, so it mitigates the computational overhead accompanied by the HE computation. Also, our approximation of exp(x) is monotone on [−2^{r},∞), so it is more appropriate to approximate softmax functions in terms of getting a high microAUC score compared to other polynomial approximation techniques such as minimax and Chebyshev approximation.
Furthermore, to make the evaluation more stable, we rather consider the approximation of scaled exponential function, AE_{r,L}(x) as Eq. (12). From the definition, we expect this funciton to satisfy \( \texttt {AE}_{r,L}(x) \approx \left (\frac {L}{2^{r}}\right)^{2^{r}} \exp (x). \) If we carefully choose L large enough that satisfies \(\frac {2^{r} + x}{L}<1\) for the all possible choices of x, then the values will not rapidly grow during the evaluation. This makes the computation more stable to the HE implementation which usually supports fixed precision bits.
Goldschmidt’s algorithm
Goldschmidt’s divison algorithm [51] is one of the most popular approximation algorithm to compute the inverse of a real number. For x∈(0,2), Goldschmidt’s algorithm uses the property as in Eq. (13).
Note that the approximation converges rapidly as d grows, and it uses 2d−2 multiplications to evaluate the approximate polynomial of degree 2^{d}−1. The small number of multiplication gives a great advantage on implementation with HE, because HE multiplication between ciphertext is hugely time consuming. The detailed algorithm is described in Algorithm 3. Here, note that the Algorithm 3 quickly converges to 1/x since the ratio between the error and true value is \(\frac {a_{d}  1/x}{1/x} = (1x)^{2^{d+1}}\).
Moreover, we can easily exploit Goldschmidt’s algorithm Gol(·) to approximately evaluate 1/x on (0,2M), instead of (0,2), by using the equation \( \frac {1}{x} \approx \frac {1}{M} \texttt {Gol} \big (\frac {x}{M}\big). \) We note that the scaling factor M should be carefully chosen since the error accompanied by Goldschmidt’s algorithm becomes nonnegligiblly large as the input is near 0. Therefore, if we select an overly large M, the input values become smaller, resulting in an error that cannot be ignored.
To put approximate exponential function and Goldschmidt’s algorithm together, we now can approximately compute the softmax layer by using HE operations. Denoting \(L_{r} = \left (\frac {L}{2^{r}}\right)^{2^{r}}\), we utilize the Eq. (14) to compute approximate softmax function using HE.
The detailed algorithm is described in Algorithm 4.
Availability of data and materials
The dataset is generated and analyzed by the same way as iDASH competition 2020 from the The Cancer Genome Atlas (TCGA) dataset. The dataset is available in [41].
Notes
which means that an attacker needs more than 2^{128} operations to recover any secret message from a ciphertext with the current best algorithm.
Abbreviations
 HE:

Homomorphic encryption
 DNN:

Deep neural network
 CNN:

Convolutional neural network
 SIMD:

Single instrument multiple data
 CN:

Copy number
References
Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018; 68(6):394–424. https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21660.
Yu J, Ongarello S, Fiedler R, Chen X, Toffolo G, Cobelli C, Trajanoski Z. Ovarian cancer identification based on dimensionality reduction for highthroughput mass spectrometry data. Bioinformatics. 2005; 21(10):2200–09.
Nguyen C, Wang Y, Nguyen HN. Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. J Biomed Sci Eng. 2013; 6:551–60. https://doi.org/10.4236/jbise.2013.65070.
Yuan Y, Shi Y, Li C, Kim J, Cai W, Han Z, Feng DD. Deepgene: an advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinformatics. 2016; 17(17):243–56.
He Z, Zhang J, Yuan X, Zhang Y. Integrating somatic mutations for breast cancer survival prediction using machine learning methods. Front Genet. 2021; 11:1853.
Chaudhuri K, Monteleoni C, Sarwate AD. Differentially private empirical risk minimization. J Mach Learn Res. 2011; 12(3):1069–109.
Jagannathan G, Pillaipakkamnatt K, Wright RN. A practical differentially private random decision tree classifier. In: 2009 IEEE International Conference on Data Mining Workshops. Miami: IEEE: 2009. p. 114–21. https://doi.org/10.1109/ICDMW.2009.93.
Ball M, Carmer B, Malkin T, Rosulek M, Schimanski N. Garbled neural networks are practical. IACR Cryptol ePrint Arch. 2019; 2019:338.
Chen H, GiladBachrach R, Han K, Huang Z, Jalali A, Laine K, Lauter K. Logistic regression over encrypted data from fully homomorphic encryption. BMC Med Genet. 2018; 11(4):3–12.
Kim D, Son Y, Kim D, Kim A, Hong S, Cheon JH. Privacypreserving approximate gwas computation based on homomorphic encryption. BMC Med Genet. 2020; 13(7):1–12.
Microsoft SEAL (release 3.6). Redmond: Microsoft Research; 2020. https://github.com/Microsoft/SEAL. Accessed 1 Dec 2020.
Fan J, Vercauteren F. Somewhat practical fully homomorphic encryption. IACR Cryptol ePrint Arch. 2012; 2012:144.
Gentry C, Sahai A, Waters B. Homomorphic encryption from learning with errors: Conceptuallysimpler, asymptoticallyfaster, attributebased. In: Annual Cryptology Conference. Springer: 2013. p. 75–92. https://link.springer.com/chapter/10.1007/9783642400414_5.
Cheon JH, Kim A, Kim M, Song Y. Homomorphic encryption for arithmetic of approximate numbers. In: Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Application of Cryptology and Information Security. Springer: 2017. p. 409–37. https://link.springer.com/chapter/10.1007/9783319706948_15.
Halevi S, Shoup V. Algorithms in helib. In: Annual Cryptology Conference. Springer: 2014. p. 554–71. https://link.springer.com/chapter/10.1007/9783662443712_31.
Wu XZ, Zhou ZH. A unified view of multilabel performance measures. In: International Conference on Machine Learning. PMLR: 2017. p. 3780–88. https://dl.acm.org/doi/10.5555/3305890.3306072.
Yan KK, Zhao H, Pang H. A comparison of graph and kernelbased –omics data integration algorithms for classifying complex traits. BMC Bioinformatics. 2017; 18(1):539. https://doi.org/10.1186/s1285901719824.
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A selective review of multilevel omics data integration using variable selection. HighThroughput. 2019; 8(1):4. https://doi.org/10.3390/ht8010004.
Rappoport N, Shamir R. Multiomic and multiview clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 2018; 46(20) https://academic.oup.com/nar/articlepdf/46/20/10546/26817354/gky889.pdf. https://academic.oup.com/nar/articlepdf/46/20/10546/26817354/gky889.pdf. https://academic.oup.com/nar/articlepdf/46/20/10546/26817354/gky889.pdf.
Richardson S, Tseng GC, Sun W. Statistical methods in integrative genomics. Ann Rev Stat Appl. 2016; 3(1):181–209. https://doi.org/10.1146/annurevstatistics041715033506. http://arxiv.org/abs/https://doi.org/10.1146/annurevstatistics041715033506.
Li Y, Wu FX, Ngom A. A review on machine learning principles for multiview biological data integration. Brief Bioinform. 2016; 19(2):325–40. https://doi.org/10.1093/bib/bbw113. https://academic.oup.com/bib/articlepdf/19/2/325/25524236/bbw113.pdf
Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, Milanesi L. Methods for the integration of multiomics data: mathematical aspects. BMC Bioinformatics. 2016; 17(2):15. https://doi.org/10.1186/s1285901508579.
Wu D, Wang D, Zhang MQ, Gu J. Fast dimension reduction and integrative clustering of multiomics data using lowrank approximation: application to cancer molecular classification. BMC Genomics. 2015; 16(1):1022. https://doi.org/10.1186/s1286401522238.
Chen Y, Sun J, Huang LC, Xu H, Zhao Z. Classification of cancer primary sites using machine learning and somatic mutations. BioMed Res Int. 2015;2015. https://doi.org/10.1155/2015/491502.
Yuan Y, Shi Y, Su X, Zou X, Luo Q, Feng DD, Cai W, Han ZG. Cancer type prediction based on copy number aberration and chromatin 3d structure with convolutional neural networks. BMC Genomics. 2018; 19(6):1–8.
Sun Y, Zhu S, Ma K, Liu W, Yue Y, Hu G, Lu H, Chen W. Identification of 12 cancer types through genome deep learning. Sci Rep. 2019; 9(1):1–9.
Lee K, Jeong Ho, Lee S, Jeong WK. Cpem: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci Rep. 2019; 9(1):1–9.
Kim A, Song Y, Kim M, Lee K, Cheon JH. Logistic regression model training based on the approximate homomorphic encryption. BMC Med Genet. 2018; 11(4):23–31.
Bonte C, Vercauteren F. Privacypreserving logistic regression training. BMC Med Genet. 2018; 11(4):13–21.
Crawford JLH, Gentry C, Halevi S, Platt D, Shoup V. Doing real work with fhe: The case of logistic regression. In: Proceedings of the 6th Workshop on Encrypted Computing & Applied Homomorphic Cryptography. WAHC ’18. New York: Association for Computing Machinery: 2018. p. 1–12. https://doi.org/10.1145/3267973.3267974.
Blatt M, Gusev A, Polyakov Y, Rohloff K, Vaikuntanathan V. Optimized homomorphic encryption solution for secure genomewide association studies. BMC Med Genet. 2020; 13(7):1–13.
Kim M, Song Y, Li B, Micciancio D. Semiparallel logistic regression for gwas on encrypted data. BMC Med Genet. 2020; 13(7):1–13.
Sim JJ, Chan FM, Chen S, Tan BHM, Aung KMM. Achieving gwas with homomorphic encryption. BMC Med Genet. 2020; 13(7):1–12.
Carpov S, Gama N, Georgieva M, TroncosoPastoriza JR. Privacypreserving semiparallel logistic regression training with fully homomorphic encryption. BMC Med Genet. 2020; 13(7):1–10.
Kim M, Harmanci A, Bossuat JP, Carpov S, Cheon JH, Chillotti I, Cho W, Froelicher D, Gama N, Georgieva M, et al.Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation. Cell Syst. 2021; 12(11):1108–20.
Brakerski Z, Gentry C, Vaikuntanathan V. (leveled) fully homomorphic encryption without bootstrapping. ACM Trans Comput Theory. 2014; 6(3):1–36.
Chillotti I, Gama N, Georgieva M, Izabachène M. TFHE: Fast Fully Homomorphic Encryption Library. https://tfhe.github.io/tfhe/. Accessed 30 Oct 2020.
Albrecht MR, Player R, Scott S. On the concrete hardness of learning with errors. J Math Cryptol. 2015. https://bitbucket.org/malb/lweestimator. Accessed 30 Oct 2020.
Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. https://gdc.cancer.gov/aboutdata/publications/mc32017. Accessed 11 Jun 2021.
Molecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human Cancers. https://gdc.cancer.gov/aboutdata/publications/PanCanMetabolism. Accessed 11 Jun 2021.
Harmanci AO. TCGA Dataset Processed in the Same Way as iDASH Competition 2020.https://drive.google.com/drive/folders/1r9VV5D6S0aOo9aW1wVE2vC_lrS9KTkfj?usp=sharing. Accessed 11 Jun 2021.
Kim A, Song Y, Kim M, Lee K, Cheon JH. Logistic regression model training based on the approximate homomorphic encryption. BMC Med Genet. 2018; 11(4):83.
Kim M, Song Y, Wang S, Xia Y, Jiang X. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR Med Inform. 2018; 6(2):e19. https://doi.org/10.2196/medinform.8805.
Han K, Hong S, Cheon JH, Park D. Logistic regression on homomorphic encrypted data at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33: 2019. p. 9466–71. https://ojs.aaai.org/index.php/AAAI/article/view/5000.
Cheon JH, Kim D, Kim Y, Song Y. Ensemble method for privacypreserving logistic regression based on homomorphic encryption. IEEE Access. 2018; 6:46938–48.
Cheon JH, Han K, Hong SM, Kim HJ, Kim J, Kim S, Seo H, Shim H, Song Y. Toward a secure drone system: Flying with realtime homomorphic authenticated encryption. IEEE Access. 2018; 6:24325–339. https://doi.org/10.1109/ACCESS.2018.2819189.
Gentry C. A Fully Homomorphic Encryption Scheme: Stanford university; 2009. https://dl.acm.org/doi/10.5555/1834954.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 2016. p. 770–78. https://ieeexplore.ieee.org/document/7780459.
Simonyan K, Zisserman A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. 2014.
Chollet F, et al.Keras. 2015. https://keras.io. Accessed 30 Oct 2020.
Goldschmidt RE. Applications of division by convergence. PhD thesis: Massachusetts Institute of Technology; 1964. https://dspace.mit.edu/handle/1721.1/11113.
Acknowledgements
We would like to acknowledge Dr. Duhyeong Kim for helping to optimize the HE algorithm and having a useful discussion. Furthermore, we would like to acknowledge Prof. Arif Harmanci, who helped us reprocess and publish the dataset in the same way it was used in the iDASH competition.
Funding
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2020000840, Development and Library Implementation of Fully Homomorphic ML Algorithms supporting Neural Network Learning over Encrypted Data).
Author information
Authors and Affiliations
Contributions
SH designed the overall algorithms for this study. JHP contributed to build the structure of model and approximate softmax algorithm. WC contributed to train model and find the best parameters. HC contributed to preprocess data and construct filtered dataset. JHC contributed in writing the manuscript. All authors have read and approved the manuscript. SH takes final responsibility for the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hong, S., Park, J.H., Cho, W. et al. Secure tumor classification by shallow neural network using homomorphic encryption. BMC Genomics 23, 284 (2022). https://doi.org/10.1186/s1286402208469w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286402208469w
Keywords
 Homomorphic encryption
 Multilabel classification
 Privacy
 Neural network
 Softmax activation