A two-way rectification method for identifying differentially expressed genes by maximizing the co-function relationship

Background The identification of differentially expressed genes (DEGs) is an important task in many biological studies. The currently widely used methods often calculate a score for each gene by estimating the significance level in terms of the differential expression. However, biological experiments often have only three duplications, plus plenty of noises contain in gene expression datasets, which brings a great challenge to statistical analysis methods. Moreover, the abundance of gene expression levels are not evenly distributed. Thus, those low expressed genes are more easily to be detected by fold-change based methods, which may results in high false positives among the DEG list. Since phenotypical changes result from DEGs should be strongly related to several distinct cellular functions, a more robust method should be designed to increase the true positive rate of the functional related DEGs. Results In this study, we propose a two-way rectification method for identifying DEGs by maximizing the co-function relationships between genes and their enriched cellular pathways. An iteration strategy is employed to sequentially narrow down the group of identified DEGs and their associated biological functions. Functional analyses reveal that the identified DEGs are well organized in the form of functional modules, and the enriched pathways are very significant with lower p-value and larger gene count. Conclusions An integrative rectification method was proposed to identify key DEGs and their related functions simultaneously. The experimental validations demonstrate that the method has high interpretability and feasibility. It performs very well in terms of the identification of remarkable functional related genes.


Background
High-throughput experiments make it possible to evaluate the expression levels for thousands of genes in biological samples [1]. Gene expression data can reflect the gene expression level of the sample to be analyzed under different experimental conditions. Detecting differentially expressed genes (DEGs) across different experiments conditions is an essential step and sometimes the major goal in the statistical analysis of expression data [2]. It helps to understand the function of genes when cells respond to different conditions [3]. In addition, detecting DEGs can be a pre-step for clustering gene expression profiles or testing gene set enrichments [4,5].
Numerous methods have been developed for identifying DEGs. A typical approach based on fold change (FC) [6,7] calculates a ratio of the average expression values between control and test samples, where the threshold of 2-fold are usually employed to select genes under or above the threshold as DEGs [8]. Since the biological experiments often have limited number of duplications, plus plenty of noises contained in gene expression datasets, which makes the detection of FC based methods a little bit arbitrary [9]. To overcome this, many statistical approaches such as t-test, significance analysis of microarrays (SAM), etc, then become popular by modeling the distributional properties of gene expression levels. The SAM [10] method imposes a restriction on the variability of the genes by adding a value to the denominator of the t-statistic, excluding the genes that do not change or with high p-value. Another popularly used method called Moderated t-statistic (ModT) [9,11] uses a t-distribution with augmented degrees of freedom. It is the integration of linear model with empirical Bayes, aiming to obtain a p-value for each gene and choose a feasible false discovery rate (FDR). These methods can calculate a score for each gene, and each of them can result in a ranked list of genes in order of their estimated significance level [12]. The performance of the statistics based algorithms depends on the number of available duplications. If there are less number of duplications in biological experiments, it is difficult to assume the distribution of data from a statistical point of view. Although the above methods combine mean and sample variance with the availability of p-values in cope with the high level of noise of dataset, they ignore the interactions between genes and obtain a set of isolated genes in a biomolecular network that are easily enriched in many unrelated biological functions, which is hard to analyze those genes' functions from a system biology point of view and may become less efficiency to detect DEGs.
There are also methods that attempt to take into account the interactions between genes to reduce the effects of uneven distribution of the dataset. An influential approach called Characteristic Direction (ChDir) [12] towards the resolution of relationships between genes to increase statistical power. It incorporates a regularization scheme to maximize the use of dimensional information in expression data. Besides, the method provides an intuitive visualization of differential expression in terms of one single direction, which facilitates our subsequent analysis. The Min-Edge [13] method proposed in our previous study takes the interactions between genes into consideration, and detects DEGs based on the evaluation of differentially expressed edges, which is very useful for finding key genes related to disease.
We believe that giving two experimental conditions, DEGs should be involved in a certain biological function and may lead to a certain disease or phenotypic changes. Besides, they should be enriched in several distinct functions, such as functional related cellular pathways, GO terms or some common cellular functions [14]. Moreover, these pathways or functions should strongly related to each other under the biological experiment conditions. Thus, in this study, we propose a novel two-way rectification method to narrow down a set of potential genes and their associated cellular functions iteratively. The method starts from a set of high-confidence genes detected by any existed method, then one-way rectification is performed to search for the enriched cellular functions by conducting the genes into pathway enrichment analysis. Thirdly, an opposite direction of rectification is performed to rectify the DEG list by adding the meaningful genes that strongly related to target functions and removing the useless genes that related to only isolated functions. After conducting several rounds of this two-way rectification process, a set of closely related DEGs and pathways could be narrowed down by maximizing their co-function relationships.
The overall framework of the proposed two-way rectification method is illustrated in Fig. 1, where the cycle nodes represent a set of DEGs, and the diamond nodes represent the enriched pathways. Edge connects between a gene and a pathway means the gene is one of the annotated genes in that pathway.

Results
The performance of the proposed method for finding DEGs from microarray datasets is evaluated using "Platinum Spike" and GSE41089 dataset. "Platinum Spike" dataset with ground truth information is used for the comparison of performance of proposed method with other gene selection methods. The microarray dataset with accession GSE41089 is used to validate the ability of designed method through biology point of view. The performance of the proposed method is further applied and compared with various gene detection algorithms, such as the ModT and Min-Edge.
The data of four groups (ModT, Two-way rectification using ModT, Min-Edge, Two-way rectification using Min-Edge) are divided into two comparisons (ModT vs Fig. 1 The principle of this two-way rectification. The color of expression data represent the different expression values of genes in control and test samples. The solid dot nodes represent a set of DEGs, and the diamond nodes with blue color represent those enriched pathways. The color of circle node depicts the differential expression level of the gene, the more significant the differential expression level of the gene, the closer the color of the node to red. Edge connects between a gene and a pathway means the gene is one of the annotated genes in that pathway. The thicker the edge, the closer the relationship between the gene and the pathway Two-way rectification using ModT, Min-Edge vs Two-way rectification using Min-Edge), where Two-way rectification using ModT means the initial genes of Two-way rectification is selected from ModT. Similarly, the Twoway rectification using Min-Edge means the initial genes of Two-way rectification is selected from Min-Edge.

The "Platinum Spike" dataset Sensitivity verification of the parameters
In this section, we construct PPI networks based on the detected DEGs under different parameters of n, m and i. The ratio of the number of genes contained in the largest connected component in the PPI network to the number of DEGs detected at the i th round of rectification is used as the connectivity of the network. The connectivity of the network reflects whether these detected genes are contribute to the necessary biological processes.
We execute the algorithm with n from 1 to 100, m from 0.01 to 0.1. Because of each values of parameters correspond to a sub-figure, so we give part of the results in this paper, results of m = 0.05 and m = 0.1 under different values of n. Figure 2 shows the changes of the connectivity of the PPI network constructed by DEGs found by each method under the corresponding parameters. From Fig. 2a, for the first (i = 1) round of rectification with m = 0.05, we can see that when select the a smaller number of candidate genes from initial gene list, even no genes have PPI correlations of each method. For the second (i = 2) round of rectification, we can conclude that the connectivity of network obtained by Two-way recti-fication using Min-Edge is higher than Min-Edge first, but the connectivity of the network decreased with the increasing number of DEGs, which is due to the genes detected by Two-way rectification using Min-Edge contribute to some functional pathways but the centrality of genes in PPI network is small. Besides, it depicts the algorithm is not stable enough with smaller n. Moreover, the connectivity of the network is well performed when n ∈ (20, 100), which reflects the genes obtained by Two-way rectification based method are strongly interact with each other. When the number of initial genes n ∈ (80, 100), the algorithm obtained a lot of genes, these genes may co-expressed in the biological mechanisms because of the good influence on the connectivity of network. Considering the sensitivity of this convergence algorithm, we suggest that the values of n should be in (20,80). Figure 2b shows the connectivity of network with m = 0.1. Similarly, we can find that at the i th round of rectification, the Two-way rectification using Min-Edge performs well than Min-Edge, the Two-way rectification using ModT performs well than ModT. Comparing the Fig. 2a with b, more interacted genes can be detected when the values of m is 0.1 than 0.05.

Accuracy validation of the methods
The gene selection methods are employed for finding the DEGs based on differential expression between two experimental groups. And these two experimental groups are compared in terms of the AUC (area under the receiver operating characteristic (ROC) curve) scores. Algorithm with the highest AUC score performs the best. We computed true positive rate (TPR) as the number of true DEGs, true positive over the 1690 ground-truth DEGs, and the number of false positive gene (FPR) over the 11234 genes that are not differentially expressed, over all of the 12924 genes. Figure 3 shows the results of the ROC curves and the corresponding AUC scores of Two-way rectification using Min-Edge, Min-Edge, Two-way rectification using ModT and ModT, respectively. According to the figure, these four algorithms achieved AUC scores of 0.943028, 0.842122, 0.839329 and 0.801652, respectively, among which the AUC score of Two-way rectification using Min-Edge is higher than Min-Edge, and the AUC score of Two-way rectification using ModT is higher than the competing algorithm ModT.

The dataset of mice infected with T. cruzi The number of identified DEGs
By conducting the ModT and Min-Edge algorithm on 8088 genes, 1565 and 1864 ranked DEGs are detected as the candidate genes, respectively. Based on these data, we select the top n DEGs from each method and regard them as the start gene list of the two-way rectification method. We execute the algorithm on the dataset with different values of m and n. As is illustrated in previous section, we choose the result of middle value with n = 20, n = 50, m = 0.05, m = 0.1, i ∈ (1, 9) as display in Table 1. When we select the top 20 ranked genes detected by ModT method as the initial gene set of two-way rectification, we obtain 38 DEGs after the first (i = 1) round of rectification by using m = 0.05 while 84 genes are obtained after the second (i = 2) round of rectification. We can con- The red, orange, blue and purple lines depict the ROC curves of Two-way rectification using Min-Edge, Min-Edge, Two-way rectification using ModT and ModT, respectively. The AUC value of Two-way rectification using Min-Edge is 0.943028, which is higher than that of the other three algorithms clude that the number of DEGs be larger as the algorithm iterates more rounds.
As is clearly shown in Table 1, the number of DEGs is proportional to the selection of n, and m at the i th round of rectification. The larger the value of n and m we set, the larger the number of final DEG set will be obtained under the same conditions. When taking the same number of n genes from Min-Edge and ModT method, respectively, it is found that two-way rectification will get more genes from Min-Edge than from ModT after the i th round of rectification, since two-way rectification approach based on the pathway which takes the interactions between genes into consideration and provides the ranked genes correlated with functions while ModT only provides the isolated ranked genes.

Connectivity of PPI networks constructed by DEGs
PPI networks based on the detected DEGs under the different parameters are constructed for systematically view of connectivity. The number of DEGs in each largest connected network reflects whether these remarkable genes are involved in necessary biological mechanisms [15].
According to the results shown in Table 1, we compare ModT (Min-Edge) with two-way rectification in terms of the same number of DEGs obtained at the i th round of rectification. To be more specific, we select the top n = 20 ranked genes from ModT method as the initial gene set of two-way rectification and obtain 38 DEGs after the first round of rectification of the proposed approach with m = 0.05. Then select the same number of 38 DEGs from ranked genes identified by ModT method for the    Fig. 4a, we found that at each round of rectification, there are more genes in the largest component obtained by two-way rectification using ModT than by ModT, which depicts the PPI network obtained by two-way rectification is more connected than by ModT. Similarly, the PPI network is significantly connected in two-way rectification using Min-Edge compared to that in Min-Edge. The other three figures in Fig. 4 have same efficiency about this.
These results suggest that almost all the genes detected by two-way rectification method involve in the largest connected components of network, and reveal that these genes with closely correlations may play important roles in the activation of functional characteristics in mechanisms.

KEGG pathway enrichment analysis
Previous discussion confirms that the number of differentially expressed genes is proportional to the value of n and m in our method. In order to ensure the credibility and functional sensitivity of the differentially expressed genes, we give the results of taking different values for each parameter. In this section, we select the results when i = 6, n = 50 and m = 0.1 as an example for subsequent analysis.
When i = 6, n = 50 and m = 0.1, we obtain 873 DEGs by performing two-way rectification using ModT, while 961 DEGs are obtained by two-way rectification method when using the initial set of top 50 genes from the Min-Edge method. KEGG pathway analyses for the 873 (961) known DEGs from the two comparisons are performed, taking out the top 5 significant pathways of the ModT (Min-Edge) method and the two-way rectification using ModT (Min-Edge) respectively, and get 8 pathways after taking the combination for each comparison, as Fig. 5 shows. Figure 5a illustrates that for each enriched pathway, the p-value of which is smaller and it can annotate more DEGs by using the two-way rectification method, compared to that by using the ModT method. Figure 5b gives the comparison of DEGs between Min-Edge and two-way rectification method of which the start genes get from Min-Edge. For cytokine-cytokine receptor interaction pathway, there are more genes annotated and the pathway is more significant with smaller p-value enriched by two-way rectification than by ModT or Min-Edge. The results demonstrate that the pathways obtained by the two-way rectification method tend to be more significant with lower p-value and a large number of genes, among which are mainly likely to play important roles in the development of organisms.

Strong associations between pathways and T. cruzi infection
The gene expression dataset [16] we used in current study is generated to study the Trypanosoma cruzi infection on mice, with which the Chagas disease related pathways is strongly associated. We reviewed the literature and cated a catalog of some significant pathways including chemokine signaling pathway, cytokine-cytokine receptor interaction pathway, Toll-like receptor signaling pathway, NOD-like receptor signaling pathway and Chagas disease pathway, etc [16][17][18][19], which may contribute to pathogenesis of Chagas disease through the stimulation of proinflammatory cytokines and chemokines, leading to systemic alterations during the infection with T. cruzi parasites.
The DEGs detected by each method can enrich into the Chagas disease related pathways mentioned above. Table 2 shows the p-value corresponding to the pathway enriched by each method and the number of DEGs it annotates. We can see from the table that each method can enrich these pathways, among which the two-way rectification is more significant than the others, which demonstrate the genes we obtained by two-way rectification have high confidence in disease infection. Moreover, by using the two-way rectification, one can tend to enrich in a list of significant pathways with very low p-value and a large number of genes, such as cytokine-cytokine receptor interaction and Chagas disease pathway, which are strongly responsible to their real experiment conditions. We thus extract the well-known Chagas disease pathway at each round of rectification for further observation. Figure 6 depicts that the two-way rectification of which initial gene list from ModT with lower p-value and more Chagas disease related genes compared with ModT at each round of rectification. When the initial gene list selected from Min-Edge, the two-way rectification also performs better than Min-Edge, which indicates that the genes obtained at the i th (i=1, 2, ..., 8) round of rectification are more concerntrated in Chagas disease. Besides, the two-way rectification method result in a significant level increase in the susceptibility to T. cruzi infection after the i th round of rectification, revealing useful roles of DEGs detected by this way in againsting Chagas disease.

Identification of guilted DEGs for pathways
The number of DEGs obtained by two-way rectification using Min-Edge after the sixth rectification is 961 when n = 50, m = 0.1, among which 173 significant pathways are enriched. The interaction network between these pathways and DEGs is constructed to unravel the associations between key genes and functions by two-way rectification method of which the initial genes is obtained from Min-Edge. Edges in the network represent the genes contained in the corresponding pathways. The size of the node corresponds to the number of links, which reveals whether it is a significant pathway or functional related gene. After removing the genes with smaller degree, we obtained a network with 33 hub DEGs, 33 pathways and 497 edges. Figure 7 presents the associations between key pathways and DEGs detected by the proposed two-way rectification method. Each gene in the network has strong interactions with multiple cellular pathways, which indicates its potential probability as a driver gene of diseases due to its key role in the interaction network [20]. Previous studies   [16][17][18][19]21], which have been recognized as the crucial for host resistance against T. cruzi infection by mechanisms. According to these known information, DEGs (CASP2, CASP8, NOD1, MYD88, TLRs, IL-1s, NAIPs) involved in Chagas disease here are screened, which have been proven to be correlated with the activation or inhibition of multiple Chagas disease related pathways [22]. As expected, these genes are well mapped into the Chagas disease related pathways, which reveals these identified genes have emerged as important components of these signaling pathways that account for detection of intracellular microbial infection. And DEGs obtained after Fig. 7 The association network between pathways and DEGs. Cycle nodes in the middle represents the detected DEGs, size of which corresponds to the number of links. Square nodes besides the genes are enriched pathways, the color of which denotes the p-value of the enriched pathway, the larger the square node, the more genes are annotated in this pathway the sixth rectification by two-way rectification method provides a list of significant genes which may competent at producing cytokines for the inflammasome in the host protection against T. cruzi.

Conclusions
We have described a novel idea to identify DEGs by maximizing the co-functions between genes and pathways simultaneously. The method starts with known candidate genes and integrates pathway enrichment to obtain functional related genes. The basic assumption of the algorithm is the top ranked genes involved in pathways should play important roles in cellular functions.
To validate the feasibility of the proposed method, we use "Platinum Spike" dataset with ground truth information to compare the performance of proposed method with other gene selection methods, and the microarray dataset with accession GSE41089 is used to validate the ability and interpretability of designed method through biology piont of view. We identified the candidate genes and pathways related to Chagas disease for resisting T. cruzi, it is found that CASP2, CASP8, NOD1, MYD88, TLRs, IL-1s, NAIPs, etc, would be responsible for the T. cruzi infection. In addition, T. cruzi infection is regulated and mediated by genes related to chemokine signaling pathway, cytokine-cytokine receptor interaction pathway, Toll-like receptor signaling pathway, NOD-like receptor signaling pathway and Chagas disease pathway, etc. These findings establish the groundwork and imply that although the procedure relies on the genes that identified by existed methods, the proposed method outperforms other approaches in selecting functional related DEGs from microarray data, and it has high performance in either the number of DEGs enriched in pathways or the functionality of DEGs.
Given its excellent performance, we believe that the proposed method may shed new light on relevant biological mechanisms that would have remained undiscovered by the current methods. Further experiments will be focused on powerful larger samples with biological interpretation in identifying differentially expressed genes.

Data sources
The "Platinum Spike" dataset is downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (GEO) website (accession GSE21344) [23] that consists of 18 spike-in samples (9 controls versus 9 tests). The designated FC associated file [24] contains 18952 probes, among which 1940 are known as differentially expressed probes. The robust multi-array average (RMA) method is used to normalize the probes. After data cleaning steps, we obtained 12924 genes, among which 1690 genes are known DEGs.
The dataset can help us to validate the sensitivity of the proposed method and to evaluate the performance of different methods.
The second dataset is downloaded from the GEO website under the accession number GSE41089 [16]. This dataset contains 22,690 probe sets, 3 samples from uninfected mice (control), and 3 samples from infected mice (test). The probe-level data is analyzed using the MAS 5.0 [25] algorithm for the determination of the chip quality, including intensity value background correction, log2 transformation, and quantile normalization [26,27], etc. The results of interest are confirmed through the robust R language affy Bioconductor package (https://www.bioconductor.org/). After these preprocessed steps, 8088 genes are retained for further analysis.
The pathway dataset of mouse and drosophila melanogaster is obtained from the database of KEGG. There are 317 pathways of mouse, and the total number of genes consisting of those pathways is 8578. Among them, 3492 genes are overlapped with the above 8088 genes. For drosophila melanogaster, there are 137 pathways and 5659 genes are involved in those pathways. Among them, 2383 genes are overlapped with the above 12924 genes.
The protein-protein interactions (PPIs) [28] dataset of the mouse and drosophila melanogaster is derived from the database of STRING [29] with a much larger number of associations. In this study, only PPIs between those 8088 genes and 12924 genes are selected to construct a PPI network and the score criterion is 0.4. After deleting the duplicated edges between the same pair of nodes and the edges connecting to itself, there are 7028 genes with 220817 edges for mouse and 11052 genes with 375793 edges left in this study.

The two-way rectification method
Algorithm 1 gives the details of the presented DEGs identification method.
Giving two sets of control and test samples. Let DEGsik and Pathway-i represent different versions of the set of DEGs and pathways at the i th round of rectification, respectively, where k = 1, 2. The gene expression of higher species is not only tissue-specific and development-stage specific, but also affected by environmental factors. The genes expressed in a single cell account for only 15% of the total number of genes. These expressed genes include the expression of newly emerging genes and the expression value of genes with different expression levels. That means the number of truth DEGs is less than 15% or even less. Thus, we suppose the number of DEGs should less than the 15 percent of the researching data, or even less. We proposed N g as the 15 percentile of the number of the genes consist in researching data, which could more than the number of real DEGs. The two-way rectification method is designed to narrow down a set of DEGs and their related cellular functions, the algorithm is summarized as follows and the Fig. 8 shows the details.
1. The initial DEG list. Since the proposed method relies on an existing DEGs identification approach, firstly we need to use an existing method (such as SAM) to calculate a differential score for each gene, and order these genes according to these differential scores and denote these ranked genes as L. For the input of the algorithm, the top n genes are selected as the initial candidate genes DEGs-i1, where i represents the number of rounds with rectification, n represents the number of DEGs that is firstly selected from all candidate genes. 2. One-way rectification. This step aims to obtain a set of high-confidence pathways accroding to the genes. As is shown in Fig. 8, for the i th round of rectification, we use DEGs-i1 to conduct a KEGG pathway [30,31] enrichment analysis and obtain a list of significant pathways Pathways-i. We found a subset of DEGs-i1 are annotated in Pathways-i, we select and order these genes as the set U. However, Fig. 8 The details of the two-way rectification. The cycle nodes with color corresponding to the ranked genes that sorted according to the differential score, where the larger the differential score, the redder the color. P i1 or P i2 in a is a subset of P i1 or P i2 in b, cause only part of related genes will be found when conduct a gene list (such as DEGs-i1) into KEGG pathway enrichment, whereas P i1 or P i2 has more genes itself for each pathway in KEGG database, it includes but not only the genes in the set DEGs-i1, but also contains many annotated genes that are not shown in P i1 or P i2 , so we rank these genes as set C. 3. An opposite direction of rectification. Once the Pathways-i is obtained, the DEGs-i1 list is updated by adding high confidence genes through the equation defined below, Then the isolated functional related genes contained in DEGs-i1 and U are removed. The DEGs of D are obtained through the equation (1), where |C| and |U| represent the number of genes in C and U, respectively. The parameter m is a factor that controls the scale of expansion of DEGs-i1, which can help to reduce the noise arising from some weak functional related genes. It should be noticed that we aim to find genes that are strongly related to target functions. Hence, the value of m is usually very small. By doing this, the set of candidate functional hub genes is: Thus, genes in the set of DEGs-i2 are regarded as new candidate function-related DEGs. 4. Output. Let N DEGs-i2 be the number of genes in the set of DEGs-i2. Let DEGs-i1 = DEGs-i2, i = i + 1, repeat the step 2, 3, until N DEGs-i1 ≤ N g . Through the above steps, we finally get an updated set DEGs-i2 and Pathways-i, where DEGs-i2 are strongly related to functions in Pathways-i.