Network-based gene prediction for Plasmodium falciparum malaria towards genetics-based drug discovery

Background Malaria is the most deadly parasitic infectious disease. Existing drug treatments have limited efficacy in malaria elimination, and the complex pathogenesis of the disease is not fully understood. Detecting novel malaria-associated genes not only contributes in revealing the disease pathogenesis, but also facilitates discovering new targets for anti-malaria drugs. Methods In this study, we developed a network-based approach to predict malaria-associated genes. We constructed a cross-species network to integrate human-human, parasite-parasite and human-parasite protein interactions. Then we extended the random walk algorithm on this network, and used known malaria genes as the seeds to find novel candidate genes for malaria. Results We validated our algorithms using 77 known malaria genes: 14 human genes and 63 parasite genes were ranked averagely within top 2% and top 4%, respectively among human and parasite genomes. We also evaluated our method for predicting novel malaria genes using a set of 27 genes with literature supporting evidence. Our approach ranked 12 genes within top 1% and 24 genes within top 5%. In addition, we demonstrated that top-ranked candied genes were enriched for drug targets, and identified commonalities underlying top-ranked malaria genes through pathway analysis. In summary, the candidate malaria-associated genes predicted by our data-driven approach have the potential to guide genetics-based anti-malaria drug discovery.

The pathogen causing malaria is the Plasmodium species. After being injected by mosquitos into human skin, these parasites infect the liver and multiply using the host's cell resources. Then they invade the red blood cells and cause the disease symptoms [12][13][14]. In both the liver and blood stage, the parasites trigger the host's innate immune responses and remodel the host cells to survive from the immune responses [15][16][17][18][19]. The complex pathogenesis of malaria involves both human and parasite genomes [20][21][22][23][24], and is not fully understood yet [25][26][27].
Studies of the human-parasite protein interactions have provided insights into the molecular signatures for malaria-specific host immune responses [20,28,29]. For example, studies show that the parasite protein PfEMP1 binds the human protein CD36 [30][31][32] and ICAM1 [33][34][35], which play critical roles in the adhesion of the infected red blood cells to the endothelial cells, and eventually lead to the disruption of bloodbrain barrier in cerebral malaria patients [36,37]. Another example shows that the PfRh family of proteins in the parasites directly interacting with the human protein CR1 during the invasion of red blood cells, and CR1 has the potential to become the target of blood-stage vaccines [38,39].
Currently, large-scale data have been accumulated on the human genome, parasite genome and their interactions. Integration and systematic analysis of the cross-species genomic data may lead to novel discoveries in genetic basis of malaria. In this study, we designed a data-driven approach to infer novel malariaassociated genes. Recent computational disease gene discovery algorithms have shown great potential in predicting disease causes [40][41][42][43][44][45][46][47][48]. They exploited the protein interactome in human genome and assumed that genes related to a disease phenotype tend to be located in a neighborhood in the proteinprotein interaction network [49]. However, traditional methods are not sufficient for predicting genes for malaria, which naturally involves human-parasite protein interactions. Our approach represented the interacting human and parasite genomes with a heterogeneous network. We prioritized genes that are functionally related to the known malaria genes in the heterogeneous network and investigated if the top-ranked genes have the potential to guide drug discovery for malaria. We made our results publicly accessible at nlp.case.edu/public/data/malaria.

Methods
Our experiment work flow is depicted in Figure 1 and consists of two steps: (1) prioritize genes through network analysis and (2) analyze the result. We first constructed separate genetic networks for human genome and parasite genome, and then connected them with host-pathogen protein interactions. We used genes that are known to be associated with malaria as the seeds and applied a random-walk based algorithm to rank genes in the crossspecies network. To validate our method in prioritizing malaria genes, we performed a "leave-one-out" cross validation analysis and examined the ranks of a set of malaria genes extracted from literature. Then we evaluated if the top-ranked genes are druggable. Finally, we analyzed the functions of the prioritized genes by extracting pathways on the basis of gene ranking.

Construct cross-species gene network
We constructed the genetic network for human and Plasmodium falciparum (the species that causes the most dangerous form of malaria) from the STRING [50,51] database. STRING includes gene relationships over a thousand species from four sources:protein-protein interactions (PPIs) databases, PPIs mined from literature abstracts, curated pathway databases and co-expressed genes.We used the four sources to build comprehensive networks for both human and Plasmodium falciparum. The human network contains 20,770 proteins and 4,850,628 interactions; and the Plasmodium falciparum network contains 4,913 proteins and 1,007,938 interactions. In addition, we used the scores from STRING to weight the edges in the two genetic networks.
We connected these two protein networks with 36 interactions from PathogenPortal [52] and literature [29,30,33]. These interactions are binary and cover physical associations, direct interactions and chemical reactions between the two species. The interaction pairs from literature were curated manually. We unified the gene identifiers with the genetic networks for human and parasites through HUGO Gene Nomenclature Committee [53] and PlasmoDB [54].

Predict candidate genes for malaria
We manually collected 77 known malaria genes and used them as the seeds in our algorithm to find additional malaria genes. Among the 77 seed genes, 14 human genes were extracted from Online Mendelian Inheritance in Man (OMIM). In addition, extensive literature evidence suggests that the Plasmodium falciparum proteins-PfEMP1 [55][56][57], PfRh4 [38,58,59] and PfRh5 [60][61][62]-are essential for parasite growth and red blood cell invasion. We extracted 63 parasite genes encoding these three proteins and added them into the seed list.
We initiated a random walk on the cross-species genetic network from the seeds, and ranked all the genes by the probabilities of being reached from the seeds. We extended the algorithm by regulating the movements of the random walker between networks with the jumping probabilities l. We represented the human and parasite genetic network with H and P, respectively. When the random walker stands on a node in H, which is connected with a node in P , it may jump to P with the probability l or stay in H with the probability of 1 − l.
We calculated the ranking scores for each node as follows. Assume p0 is a vector of initial scores for each node, p k is the score vector at step k and was iteratively updated by: where g is the probability that the random walker restarts from the seeds at each step, and M is the transition matrix of the cross-species genetic network: The diagonal sub-matrices M H and M P consist of intra-network transition probabilities and were calculated as: where i ∈ {H, P }, A i is the adjacency matrix of the network H or P , k is the index of row, l is the index of column, and x is an indicator variable, which equals to 1 if l (A i ) kl = 0 and 0 otherwise. The off-diagonal submatrices MHP and MHP T consist of inter-network transition probabilities and were calculated as: where j ∈ {HP, HP T } and × is the same indicator variable. While the method could obtain a score for each human and parasite gene, we focus on ranking and analyzing the human genes in this study.

Evaluate the validity in predicting malaria genes
Before we used our method to predict genes for malaria, we performed the "leave-one-out" cross validation analysis to validate the method. Each time, we left out one malaria gene from the seed list, used the rest seeds as the input, and examined the rank of the excluded seed among the genes from the human or parasite genome. We repeated the same procedure for each of the 77 seeds, and assumed that the excluded seeds can be ranked highly if the method works well.
Then we used all the 77 seeds as the input, and evaluated if our gene ranking can prioritize novel malaria genes (other than the seeds). We manually constructed an independent set of 27 human genes involving malaria resistance and the host immune responses triggered by malaria parasites. These genes were extracted from literature references, which were mentioned in the textual descriptions of malaria in OMIM, and have zero overlap with the seed genes. We used this set as a proxy of novel malaria genes and evaluated the rank of this gene set among all human genes.

Evaluate the ranks of druggable genes
Currently, only a subset of the human genome is druggable [63]. In this study, we investigated if the topranked genes represent opportunities for drug discovery for malaria. We first extracted 1,935 human genes that were targets of all drugs from DrugBank [64]. All these drug target genes appear in our genetic network and have no overlap with the seeds. We used all 77 seeds as the input and ranked the human genes. Then we calculated the number of target genes among every 500 human genes in the rank from the top to the bottom, and plotted the variation of this number.

Extract and analyze malaria-specific pathways based on gene ranking
To better understand the functions of the prioritized genes, we linked the top 10% of human genes to their pathways. We downloaded 1320 canonical pathways from MSigDB [65] and ranked them based on the average of random walk scores for all the genes in each pathway. We manually examined if the top pathways are associated with the host response to the pathogen invasion.
In addition, we evaluated the impact of introducing the parasite genome into our gene prediction method. We removed the parasite genetic network and hostparasite interactions from our method, and calculated the random walk scores for human genes. Then we re-ranked the pathways containing the top 10% genes again. We compared the rank of pathways before and after using the parasite genetic network, and extracted the ones with largest rank difference.

Result
Network-based approach allowed the prioritization of known malaria genes from both human and parasite genomes Among the 77 seed genes, 14 were human genes and 63 were parasite genes. We evaluated the performances of our algorithms in ranking human and parasite seed genes separately with a leave-one-out cross validation analysis. Our method required two parameters: the jumping probability l between human and parasite genetic networks and the probability g that the random walker restarts from the seeds. We chose l=0.8 and g=0.3 to achieve the best performance in the cross validation, but different parameter values only slightly affect the result. We used the same values for the two parameters through all the analyses. Table 1 shows that the ranks of the excluded human seed genes were high. In nine cases, the excluded genes directly interact with another seed and were ranked within the top 1% amongst all the human genes. Of these, two genes (CD36, ICAM1) were ranked in the top five. In 13 out of 14 cases, the excluded genes were ranked within top 3%. The average rank for the excluded human seed genes is 336 (top 2% among all human genes).
We also evaluated the 63 parasite seed genes, and our approach ranked the excluded nodes within the top 5% in 56 out of 62 cases. Table 2 shows the top 10 parasite genes and their ranks in the cross validation. The average rank for the excluded parasite genes is 199 (top 4% among all parasite genes). Less comprehensive data in the parasite genome than in the human genome may contribute to the lower rank (in percentage) of the parasite seed genes. Overall, this analysis demonstrated the utility of the extended random walk to accurately prioritize known malaria genes.

Network-based approach prioritized novel malaria genes other than the seeds
Large amounts of literature have demonstrated strong associations between individual genes and malaria through transcriptional profiling, biological experimenting and genome-wide association studies. These genes include inflammatory responding genes, such as NFB and CXCL1 [66], parasite protein receptors, such as BSG [67] and PROCR [57], and the genes involving protection against malaria, such as HLA-B [68] and HAVCR1 [69]. We then used all the seeds to generate our ranking for human genes, and examined the rank of 27 malaria genes, which have been validated in previous published studies. Table 3 shows that 12 out of 27 We left out one malaria gene from the seed list each time, and determined the rank of this excluded gene using our method. We showed the rank and percentage among all human genes. genes were ranked within the top 1%, and a total of 24 genes within the top 5%. We also manually examined the top 50 human genes and found interesting predictions. Among them, TLR4 has been suggested to be protective against malaria in certain populations [70,71]. In addition, a recent mouse model experiment [72] has demonstrated that P53 was critical in the liver-stage infection of malaria. Together, the result demonstrated that our gene ranking prioritized novel malaria-associated genes other than the seeds. Figure 2 shows that the top-ranked genes are enriched for drug targets. The top 500 human genes in our ranking have 235 overlaps with the drug targets, which is a 4.3 fold enrichment compared with the average of 100 random rankings (p < 10 −8 ). Among the 235 druggable genes, only 5 have been targeted by FDA-approved anti-malaria drugs, such as chloroquine, proguanil and mefloquine. This result indicated that the top-ranked candidate genes for malaria may provide unique opportunities for malaria drug discovery through novel disease genetics.

Pathway analysis shows functions of prioritized genes
In order to gain insight into the commonalities underlying predicted malaria candidate genes, we analyzed the pathways associated with top-ranked genes. The topranked pathways are associated with different aspects of malaria. For example, malaria parasites actively alter the immune function of B cells and BIOCARTA BLYM-PHOCYTE PATHWAY [73]. BIOCARTA LYM PATH-WAY is a pathway of lymphocytes adhesion, and plays a central role in binding bacteria, parasites, viruses and tumor cells [74]. Also, BIOCARTA STEM PATHWAY regulates the hematopoiesis and induce hematopoietic activities in the presence of infection [75].
We compared the pathway ranking before and after introducing the parasite genetic network and found nine pathways increased the rank by over 50%. Table 4 lists these pathways and their plausible associations with malaria pathogenesis and protection. Several of these pathways are directly related with the parasite infection and inflammatory responses. REACTOME BASIGIN INTERACTIONS was prioritized through the interaction with the parasite protein PfRh5. Other pathways that were brought up by less than 50% also may have associations with malaria. For example, the rank of the REACTOME HDL MEDIATED LIPID TRANSPORT pathway were improved by 40%. A recent meta-analysis showed that host lipid profile alteration has a link with malaria pathogenesis, though the precise pathway has not been elucidated yet [76].

Discussion
Malaria is caused by the invasion of deadly parasites into human skin, liver and blood. The parasites trigger the human immune responses, but can manipulate human cells for nutrient uptake and cell growth. Recent studies have shown that host-pathogen protein interactions illuminate the malaria-specific pathways in the human host. With the accumulation of data in both human and parasite genome, systematically analyzing these two interacting genomes may potentially discover new malaria-associated genes, which will pave the way to identify novel anti-malaria drugs.
We developed a data-driven method to infer malaria genes based on random walking on the cross-species genetic networks. We demonstrated that the method can prioritize genes that are both drug targets and associated with malaria. Through comparing the result before and after adding the parasite genetic network into our method, we extracted specific pathways involving human-parasite interactions.
Our approach can be improved with a more comprehensive database of host-pathogen protein interactions. We currently manually curated 36 interactions, mostly from literature, to connect the human and parasite genetic network. Compared with the humanhuman protein interactions, the coverage of humanparasite interaction is much lower and might be biased. As more data are introduced into the method, the global structure of the cross-species genetic network may change, which will affect the result of gene ranking. In the future, we plan to automatically mine the human-parasite interaction from literature and construct a database with better coverage. Since our approach prioritized a set of druggable genes, which are associated with malaria, one example of subsequent work is to perform drug repositioning through matching the targets of approved drugs to predicted genes. In this way, however, a part of the candidate drugs may target generic inflammatory responses and may not be specific enough to kill the parasites. In addition, malaria is associated with different pathways when human are infected by different parasite species (other than Plasmodium falciparum) or different strains [28]. To develop more effective agents against malaria, we need to dissect the genetic basis using more specific data.

Conclusions
The lack of effective anti-malaria drugs and the poorlyunderstood disease genetics has motivated our study of detecting novel malaria-associated genes from both human and parasite genomes, with the ultimate goal of discovering innovative anti-malaria drugs based on a new  Associated with red blood cell adhesion to the endothelial cell and cerebral malaria [81,82] BIOCARTA VDR PATHWAY Control cellular nutrient uptake, differentiation, apoptosis, which may be affected by parasites [13,83] BIOCARTA MONOCYTE PATHWAY Recruitment and activation of monocytes and macrophages are essential for both protection and pathology in malaria-infected individuals [84] REACTOME PLATELET ADHESION TO EXPOSED COLLAGEN Platelet adhesion and aggregation may play important roles in facilitating adhesion of infected red blood cells [85][86][87] genetic understanding of the disease. We developed a data-driven approach to infer malariaassociated genes. Since malaria is caused by the interactions between parasites and human, we constructed a cross-species genetic network to model these interactions, and prioritized relative genes using network analysis. We demonstrated the validity of the method in predicting malaria genes, and showed the potential of the predicted genes in drug discovery. We also extracted pathways from the result of gene ranking, and found these pathways reflect different aspects of malaria pathogenesis.