StressGenePred: a twin prediction model architecture for classifying the stress types of samples and discovering stress-related genes in arabidopsis

Background Recently, a number of studies have been conducted to investigate how plants respond to stress at the cellular molecular level by measuring gene expression profiles over time. As a result, a set of time-series gene expression data for the stress response are available in databases. With the data, an integrated analysis of multiple stresses is possible, which identifies stress-responsive genes with higher specificity because considering multiple stress can capture the effect of interference between stresses. To analyze such data, a machine learning model needs to be built. Results In this study, we developed StressGenePred, a neural network-based machine learning method, to integrate time-series transcriptome data of multiple stress types. StressGenePred is designed to detect single stress-specific biomarker genes by using a simple feature embedding method, a twin neural network model, and Confident Multiple Choice Learning (CMCL) loss. The twin neural network model consists of a biomarker gene discovery and a stress type prediction model that share the same logical layer to reduce training complexity. The CMCL loss is used to make the twin model select biomarker genes that respond specifically to a single stress. In experiments using Arabidopsis gene expression data for four major environmental stresses, such as heat, cold, salt, and drought, StressGenePred classified the types of stress more accurately than the limma feature embedding method and the support vector machine and random forest classification methods. In addition, StressGenePred discovered known stress-related genes with higher specificity than the Fisher method. Conclusions StressGenePred is a machine learning method for identifying stress-related genes and predicting stress types for an integrated analysis of multiple stress time-series transcriptome data. This method can be used to other phenotype-gene associated studies.


Background
Recently, cellular molecule measurement technologies, such as microarray [1] and RNA-seq [2], can be used to measure the expression levels of tens of thousands of genes in a cell. Using these technologies, biologists have measured the change in gene expression levels under stress treatment over time. These time-series data are now available in databases such as ArrayExpress [3] and GEO [4]. To analyze of time-series transcriptome data, various methods were developed based on machine learning techniques such as linear regression, principal component analysis (PCA), naive Bayes, k-nearest neighbor analysis [5], simple neural network [6,7], naive Bayes methods [8], and ensemble model [9].
However, existing methods were designed to analyze gene expression data of a single stress, not of multiple stresses. Analyzing gene expression data of multiple stresses can identify stress-responsive genes with higher specificity because it can consider the effect of interference between stresses. However, since no method of integrating multiple stress gene expression data has been developed, this study aims to develop a method for an integrated analysis of transcriptome of multiple stress types.

Motivation
For the integrated analysis of transcriptome data of multiple stress, heterogeneous time-series analysis is should be considered [10]. Heterogeneous time-series analysis is a problem to analyze four-dimensional data of experimental condition (sample tissue, age, etc.), stress, time, and gene, where experimental condition axis and time axis are different among multiple time-series samples. Heterogeneous time-series analysis is explained in detail in the next section.
Many algorithms have been developed to analyze gene expression data. However, as far as we are aware of, there is no readily available machine learning algorithm for predicting stress types and detecting stress-related genes from multiple heterogeneous time-series data. Support vector machine (SVM) models are known to be powerful and accurate for classification tasks. Recently, SVMs are extended for multi-class problems and also for regression prediction. However, applying SVM for predicting stress-related genes and associating with phenotypes is not simple since the essence of the problem is to select small number of genes relevant to a few phenotypes. In fact, there is no known readily available prediction method for this research problem. Principal component analysis (PCA) is designed for predicting traits from the same structured input data, but it is not designed to analyze heterogeneous time-series data. Random forest (RF) is a sparse classification method, so how significant a gene is associated with stress is hard to be evaluated. Naive Bayes method [8] can measure the significance of genes, but it is not suitable for heterogeneous time-series data input. Clustering is one of the widely used machine learning approaches for gene expression data analysis. The STEM clustering method [11] clusters genes according to changes in expression patterns in time-series data analysis, but does not accept heterogeneous time-domain structure data.
Thus, we designed and implemented a neural network model, StressGenePred, to analyze heterogeneous timeseries gene expression data of multiple stresses. Our model used feature embedding methods to address the heterogeneous structure of data. In addition, the analysis of heterogeneous time-series gene expression data, on the computational side, is associated with the high-dimension and low-sample-size data problem, which is one of the major challenges in machine learning. The data consists of a large number of genes (roughly 20,000) and a small number of samples (about less than 100). To deal with the high-dimension and low-sample-size data problem, our model is designed to share a core neural network model between twin sub-neural network models: 1) biomarker gene discovery model 2) stress type prediction model. These two submodels perform tasks known in the computer field as feature (i.e., gene) selection and label (i.e., stress type) classification, respectively.

Multiple heterogeneous time-series gene expression data
Multiple stress time-series gene expression data is a set of time-series gene expression data. The k-th time-series gene expression data, D k , contains expression values for three dimensional axes: gene axis, G k = {g k1 , . . . , g k|G k | }, time axis, T k = {t k1 , . . . , t k|T k | }, experimental condition axis, F k = {f k1 , . . . , f k|F k | }. However, the structure and values of time dimension and experimental condition dimension can be different in multiple samples, called "heterogeneous time-series data. "

The time-series gene expression datasets of four stress types
In this paper, we analyze multiple heterogeneous timeseries data of four major environmental stresses: heat, cold, salt and drought. We collected the 138 sample timeseries data related to the four types of stress from Array-Express [3] and GEO [4]. Figure 1 shows the statistics of

Methods
StressGenePred is an integrated analysis method of multiple stress time-series data. StressGenePred (Fig. 2) includes two submodels : a biomarker gene discovery model (Fig. 3) and a stress type prediction model (Fig. 4).
To deal with the high-dimension and low-sample-size data problem, both models share a logical correlation layer with the same structure and the same model parameters. From a set of transcriptome data measured under various stress conditions, StressGenePred trains the biomarker gene discovery model and the stress type prediction model sequentially.

Submodel 1: biomarker gene discovery model
This model takes a set of stress labels, Y, and gene expression data, D, as input, and predicts which gene is a biomarker for each stress. This model consists of three parts: generation of an observed biomarker gene vector, generation of a predicted biomarker gene vector, and comparison of the predicted vector with the label vector. The architecture of the biomarker gene discovery model is illustrated in Fig. 3, and the process is described in detail as follows.

Generation of an observed biomarker gene vector
This part generates an observed biomarker vector, X k , from gene expression data of each sample k, D k . Since each time-series data is measured at different time points under different experimental conditions, a time-series gene expression data must be converted into a feature vector of the same structure and the same scale. This process is called feature embedding. For the feature embedding, we symbolize the change of expression before and after stress treatment by up, down, or non-regulation. In detail, a time-series data of sample k is converted into an observed biomarker gene vector of length 2n, X k = {x k1 , . . . , x k2n }, where x k2n−1 ∈ {0, 1} is 1 if gene n is downregulation or 0 otherwise, x k2n ∈ {0, 1} is 1 if gene n is up-regulation or 0 otherwise. For determining up, down, or non-regulation, we use the fold change information. First, if there are multiple expression values measured from replicate experiments at a time point, the mean of expression values is calculated for the time point. Then, the fold change value is computed by dividing the maximum or minimum expression values for a time-series data by the expression value at first time point. After that, the gene whose fold change value > 0.8 or < 1/0.8 is considered as up or down regulation gene. The threshold value of 0.8 is selected empirically. When the value of 0.8 is used, the fold change analysis generates at least 20 up or down regulation genes for all time-series data.

Generation of a predicted biomarker gene vector
This part generates a predicted biomarker gene vector, X k , from stress type label Y k . X k = {x k1 , . . . , x 2kn } is a vector of the same size as the observed biomarker gene vector X k . The values of X k ' means up or down regulation as same as X k . For example, x k2n−1 = 1 means gene n is predicted as a down-regulated biomarker, or x k2n = 1 means gene n is predicted as a up-regulated biomarker, for a specific stress Y k . A logical stress-gene correlation layer, W, measures the weights of association between genes and stress types. The predicted biomarker gene vector, X k , is generated by and a stress type prediction model (right). The two submodels share a "single NN layer". Two gray boxes on the left and right models output the predicted results, biomarker gene and stress type, respectively multiplying stress type of sample k and the logical stressgene correlation layer, i.e., Y k × W . In addition, we use the sigmoid function to summarize the output values between 0 to 1. The stress vector, Y k , is encoded as one-hot vector of l stresses, where each element indicates whether the sample k is each specific stress type or not. Finally, the predicted biomarker gene vector, X k , is generated like below: The logical stress-gene correlation layer has a single neural network structure. The weights of the logical stress-gene correlation layer are learned by minimizing the difference between observed biomarker gene vector, X k , and predicted biomarker gene vector, X k .

Comparison of the predicted vector with the label vector
Cross-entropy is a widely-used objective function in logistic regression problem because of its robustness to outlierincluding data [12]. Thus, we use cross-entropy as the objective function to measure the difference of observed biomarker gene vector, X k , and predicted biomarker gene vector, X k , as below: By minimizing the cross-entropy loss, logistic functions of the output prediction layer are learned to predict the true labels. Outputs of logistic functions can predict that a given gene responds to only one stress or to multiple stresses. Although it is natural for a gene to be involved in multiple stresses, we propose a new loss term because we aim to find a biomarker gene that is specific to a single stress. To control relationships between genes and stresses, we define a new group penalty loss. For each feature weight, the penalty is calculated based on how much stresses are involved. Given a gene n, a stress vector g n is defined as g n =[ g n1 , g n2 , ..., g nl ] with l stresses and g nl = max(w l,2n , w l,2n+1 ). Then, the a group penalty is defined as ( (g n )) 2 . Since we generate the output with a logistic function, g nl will have a value between 0 and 1. In other words, if g n is specific to a single stress, the group penalty will be 1. However, if the gene n reacts to multiple stresses, the penalty value will increase quickly. Using these characteristics, the group penalty loss is defined as below: On the group penalty loss, hyper-parameter α regulates effects of group penalty terms. Too large α imposes excessive group penalties, so genes that respond to multiple stresses are linked only to a single stress. On the other hand, if the α value is too small, most genes respond to multiple stresses. To balance this trade-off, we use wellknown stress-related genes to allow our model to predict the genes within the top 500 biomarker genes at each stress. Therefore, in our experiment, the α was set to 0.06, and the genes are introduced in "Ranks of biomarker genes and the group effect for gene selection" section.

Submodel 2: stress type prediction model
From biomarker gene discovery model, the relationships between stresses and genes are obtained by stress-gene correlation layer W. To build stress type prediction model from feature vectors, we utilize the transposed logical layer W T and define a probability model as below: Matrix W is calculated from a training process of the biomarker gene discovery model. A k means an activation value vector of stress types, and it shows very large deviations depending on the samples. Therefore, normalization is required and performed as below: For the logistic filter, these normalized embedded features vectors encapsulate average weight stress-feature relationship values that reduce variances among the vectors with different samples. As another effect of the normalization, absolute average weights are considered rather than relative indicator like softmax. So, false positive rates of predicted stress labels can be reduced. Using the normalized weights A norm k , logistic filter is defined to generate a probability as below: where a and b are general vector parameters of size L of logistic model g(x).
Learning of this logistic filer layer is started with normalization of the logistic filter outputs. This facilitates learning by regularizing the mean of the vectors. Then, to minimize loss of positive labels and entropy for negative labels, we adopted the Confident Multiple Choice Learning(CMCL) loss function [13] for our model as below: To avoid overfitting, a pseudo-parameter β is set by recommended setting from the original CMCL paper [13]. In our experiments, β = 0.01 ≈ 1/108 is utilized.

Results
In this paper, two types of experiments were conducted to evaluate the performance of StressGenePred.

Evaluation of stress type prediction
StressGenePred was evaluated for the task of stress type prediction. The total time-series dataset (138 samples) was divided randomly 20 times to build a training dataset (108 samples) and a test dataset (30 samples). For the training and test datasets, a combination analysis was performed between two feature embedding methods (fold change and limma) and three classification methods (StressGenePred, SVM, and RF). The accuracy measurement of the stress type prediction was repeated 20 times. Table 1 shows that feature embedding with fold change is more accurate in the stress type prediction than limma. Our prediction model, StressGenePred, more correctly predicted the stress types compared to other methods. Then, we further investigated in which cases our stress type prediction model predicted incorrectly. We divided the total dataset into 87 samples of training dataset and 51 samples of test dataset (28 cold stress and 23 heat stress samples). Then, we trained our model using training dataset and predicted stress types for the test dataset. Figure 5 shows three of 51 samples were predicted wrong in our model. Among them, two time-series data of cold stress type were predicted salt then cold stress types, and those samples were actually treated to both stresses [14]. This observation implied our prediction was not completely wrong.

Evaluation of biomarker gene discovery
The second experiment was to test how accurately biomarker genes can be predicted. Our method was compared with Fisher's method. The p-value of Fisher's method was calculated using the limma tool for each gene for each stress types (heat, cold, drought, salt). The genes were then sorted according to their p-value scores so that the most responsive genes came first.
Then, we collected known stress-responsive genes of each stress type in a literature search, investigated EST profiles of the genes, and obtained 44 known biomarker genes with high EST profiles. We compared the ranking results of our method and Fisher method with the known biomarker genes. The Table 2 shows that 30 of 44 genes ranked higher in the results of our method than the Fisher method. Our method was better in the biomarker gene discovery than Fisher method (p = 0.0019 for the Wilcoxon Signed-Rank test).
Our method is designed to exclude genes that respond to more than one stress whenever possible and to detect genes that only respond to one type of stress. To investigate how this works, we collected genes known to respond to more than one stress. Among them, we excluded genes that resulted in too low a ranking (> 3, 000) for all stress cases.
When comparing the results of our method to the Fisher method for these genes, 13 of 21 genes ranked lower in the result of our method than Fisher method (Table 3). This suggests that our model detects genes that respond only to one type of stress. Figure 6 shows a plot of changes in expression levels of some genes for multiple stresses. These genes responded to multiple stresses in the figure.

Literature-based investigation for discovered biomarker genes
In order to evaluate whether our method found the biomarker gene correctly, we examined in literature the relevance of each stress type to the top 40 genes. Our findings are summarized in this section and discussed further in the discussion section.  In the case of heat stress, we identified heat-related genes, including HSFA2, which are known to play an essential role in the plant's heat response. Heat shock protein genes such as HSP101, HSP15.7, HSP17.6, HSP20like, Hsp21, Hsp22, Hsp70B, and Hsp70T-2 we have identified are known to be highly related to heat stress. Mitochondrial heat shock protein genes such as AtHSP23.6 and MTHSC70-2 and chloroplast position genes such as HSP21 have also been identified. We predicted NADH dehydrogenases of energy metabolism which are related to heat stress.
In the case of salt stress, we have identified previously known ABA-related genes, such as ABI2, ABF1, HAI1 and HAI2, and late embryonic development-rich protein genes, such as AtLEA4-5, LEA7. Water biomarker genes as ATD18, NAC019, NAC047 and RAP2.6 were identified. We have also identified genes of common stress-response class genes, such as ALDH7B4 and ALDH2B7, AtMYB74, CYP707A1, and CYP94B3.
In the case of cold stress, we identified ADS2, AtGolS3, FP6, FRO3, GSTU18, UDP-glucosyl transferase, some lipid metabolism-related genes that are involved in a rearrangement of physical properties of the plasma membrane and cell wall. In addition, we identified genes related to development such as AGL20, BBX29, and GI. We also identified water biomarker genes such as ABF1, BBX25, and RAP2.1.
Finally, in the case of drought stress, we confirmed the involvement of well-known genes such as HIS1-3, NAC019 and SAUR63. Besides, we were able to identify common biomarker genes such as development-related AGL19 and CYP89A9. In addition, we predicted genes involved in microorganism development and differentiation such as ATHB-7, BRS1, GAMMA-VPE, GOLS2, MEE3, and PDCB3.

Discussion
In this section, we discuss gene-stress relationship in depth, referring to the current literature.

Biological function of heat stress-responsive genes
For heat stress, our model identified HSFA2, Hsp21, Hsp22, Hsp70B, Hsp70T-2, HSP101, HSP20-like, HSP17.6, HSP15.7, and NADH dehydrogenases. In heat stress, HSFA2 takes an essential part of heat response and may relate with histone methylation. HSFA2 is highly inducible and a direct target of HSFA1. HSFA2 is known to bind to the promoter of Hsp22 in vitro experiments [15]. Hsp22 is an endomembrane-localized protein during heat stress [16]. Hsp70 family proteins are well-known proteins, however functionally diversified. Hsp21 is small heat shock protein, which required for the development of chloroplasts [17] and associates with the thylakoid membranes [18]. HSP70 is a molecular chaperone and support To investigate that StressGenePred excludes genes that respond to more than one stress, 21 genes known to respond to more than one stress are collected. Among the 21 genes, 13 genes rank lower in the result of StressGenePred than Fisher method (Table 3) plastid protein translocation [19]. HSP70b may involve a protein accumulation in the cytosol [20] and inducible by heat shock, not by low temperature [21]. HSP101 is a member of the Hsp100/ClpB family of proteins, is thought to be involved in disaggregation of misfolded proteins [22]. HSP101 protects protein translation factors during heat stress [23]. HSP17.6 is induced by heat and osmotic stress, and overexpression of AtHSP17.6A increase salt and drought tolerance in Arabidopsis [24]. Hsp17.6CII is a peroxisome-localized catalase chaperone [23]. Also, HSP15.7 is inducible by heat shock and high light, detected in peroxisome [25]. Interestingly, both the chloroplast-located genes HSP21 and mitochondrial heat shock proteins such as AtHSP23.6 and MTHSC70-2 were identified.

Biological function of cold stress-responsive genes
For cold stress, our model predicted many genes involved in plasma membrane fluidity and cell wall rigidity. ADS2 gene adjusts the composition of membrane lipids, and confer chilling and freezing tolerance in Arabidopsis [26]. AtGolS3 codes galactinol synthase 3 which is only induced by cold stress and target of DREB1A [27]. FP6 is farnesylated protein 6, interacts with ACBP2, and the transgenic plants showed overexpression had Cd(II) tolerance [28]. FRO is an iron chelate reductase, and FRO3 is predicted to involve in iron metabolism and iron reduction in the root [29].

Biological function of salt stress-responsive genes
For salt stress, our model identified ABI2, ABF1, HAI1, HAI2, LEA7, AtLEA4-5, NAC019, NAC047, ATD18, RAP2.6, CYP707A1, CYP94B3, AtMYB74, ALDH7B4 and ALDH2B7 genes. In salt stress, many genes of downstream signal transduction or possibly related with ABA such as ABI2, ABF1, HAI1 and HAI2, late embryogenesis abundant proteins like LEA7 and AtLEA4-5. ABI2 is a protein phosphatase 2C, interacts with SOS2 and inhibits SOS2 activity [30]. ABI2 involved in ABA-mediated transcription of chloroplast genes and link nitrate uptake and utilization [31]. ABF1 regulates the induction of DREB2A [17] and is necessary for seedling establishment during winter. Expression of ABF1 is induced by cold, heat, and ABA [32]. HAI1 has roles in decreasing the low water potential signaling that controls proline and osmoregulatory solute accumulation [33]. HAI1 is involved in feedback regulation of ABA signaling and HAI2 is a positive regulator of ABA and related to Fig. 6 Visualization of gene expression for multiple stress associated genes. Genes that were investigated to be responsive to multiple stresses. In the visualization results, these genes responded to multiple stresses and were not suitable for biomarker genes of a single stress cell signaling mediated by ABA [34]. Late embryogenesis abundant proteins like LEA7 could protect the plasma membrane or organellar membrane. Its activity occurs at cytosol exposed side of the membrane [35]. AtLEA4-5 is a member of small, hydrophilic protein group, showing high expression levels in response hyperosmotic, drought, and ABA treatment [36]. NAC is a water stress-responsive transcription factor. NAC019 has ABRE-like motifs, and the motifs could induce expression in response to stress. NAC019 promoter interacts with a key mediator of ABA expression, ABI4, AP2 family transcription factors [37]. ATD18, also known as RAB18, is dehydrin family protein and required for ABA signal transduction. ATD18 expression is repressed by ethylene treatment [38]. RAP2.6 is induced by salt and osmotic stress. RAP2.6 promoter contains ABRE, DRE, MYBR, W-box, RAVbox, so seems like it may be an essential intersection in biotic and abiotic signaling [39]. CYP707A1 is a member of cytochrome P450 CYP707A family encoding ABA-8'-hydroxylases. CYP707As are working as structure modifiers of metabolites responsive to the abiotic stress, exogenous ABA treatment, and dehydration [40].

Biological function of drought stress-responsive genes
For drought stress, our model predicted many of early response genes against water stress. HIS1-3 has histone H1 globular domain and is expressed by dehydration and ABA [41]. SAUR63 is a member of early auxin-responsive genes family, promoting organ elongation by auxin stimulation in Arabidopsis [42]. AGL19 is expressed by a shortday photoperiod and vernalization [43]. Gamma-VPE is a type of vegetative VPE and induced during senescence, wounding, and pathogen infection [44]. Gamma-VPE has a cysteine protease activity and may be involved in plant hypersensitive cell death [41]. GOLS2 increase galactinol biosynthesis and improve oxidative stress tolerance. This gene regulated by HsfA3 [45]. AtGolS2-expressing transgenics displayed significantly improved drought tolerance [46]. MEE3 (Maternal Effect Embryo arrest 3) is a subfamily of single-MYB transcription factor and related to regulation of early photomorphogenesis [47]. BRS1 is involved in brassinosteroid signaling pathway. This gene was expressed strongly in the root and related to plant root development [48]. BRS1 gene encodes a serine carboxypeptidase II-like protein, secreted and active serine carboxypeptidase [49].

Stress responsive transcription factors
We examined genes that change expression levels with respect to temperature stress. Some of these genes were transcription factors, and they did not appear for other type stress because our predictive model predicted genes specifically associated with specific stresses. But what we can observe is that TFs, such as ARF, ERF, bZIP, which are involved in plant hormonal reactions, can be activated at both high and low temperatures when there are temperature-related stresses. Our model predicted NAD4L and NAD5 (NADH dehydrogenase subunits 4L and 5) and several unknown genes encoded in the mitochondrial genome that only affected heat stress. Some genes in mitochondria may be involved in the initial transcriptional response when under heat stress.
In the case of salt and drought stress, we predicted two TF genes, HD-ZIP (ATHB-5; AT2G468) and NAC (ANAC019: AT1G5289), which are associated with both stresses. These two genes are likely to respond early to water-related stress. NAC domain TF is prominent in salt stress, but not drought stress. We observed SAURs (small auxin upregulated RNA) in drought stress, which means that it is a small RNA that is actively involved in plant physiological regulation during long-term water deficiency.

Diversity of responses to multiple stresses
In this study, we selected four different types of stress to find and classify the affected genes. The effects of these environmental stresses are overwhelming, but they do not define specific parts of metabolism and physiological consequences. The characteristics of the four stresses we studied have in common with the physiological response associated with water. Although they react differently depending on the signaling pathways of each stress, they do not have complete separation because of the commonalities associated with using water. Many of the biomarker genes we have found have been shown to respond to multiple stresses, and have shown a variety of phenotypes for different stresses in plants that have been transfected with mutations or recombinant genes. The APX gene is a gene that responds to all four stresses, and other genes such as AREB, AtRIP, DREB, Gols and MAPs are well known as genes that respond to multiple stresses. In this study, the genes involved in the specific stresses we predicted were either identical in other stresses or related to multiple complex stresses.

Conclusion
This study presented StressGenePred, a method of analyzing a set of time-series transcriptome data for multiple types of stress. StressGenePred consists of twin classification models to achieve two analytic goals. The biomarker gene discovery model aims to discover genes that respond to specific stresses. The goal of the stress type prediction model is to classify samples into four types of stress, heat, cold, drought, and salt. The key problem in this study is to train the StressGenePred model from high-dimension (approximately 20,000 genes) and low-sample-size data (138 sample data in the study). Analysis of high-dimension and low-sample-size data is a difficult computational problem that many researchers are studying. In order to be trained with a small number of data, StressGenePred is designed to use a simplified architecture (only one logical layer) with a small number of parameters. StressGenePred is also designed so that twin classification models share the same logical layer and its parameters. In twin classification models, the logical layer is used symmetrically with respect to input and output. For example, the input and output in the biomarker gene discovery model are stress and genes, respectively, and the stress type prediction model is vice versa. When the logical layer is shared by both classification models, the parameters of the logical layer are trained redundantly in both models, reducing the number of data required.
In experiments using Arabidopsis stressed gene expression data, StressGenePred detected known stress-related genes at a higher rank compared to Fisher's method. StressGenePred showed better performance than random forest and support vector machine in stress type prediction.