Pinpointing disease genes through phenomic and genomic data fusion
© Jiang et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
Pinpointing genes involved in inherited human diseases remains a great challenge in the post-genomics era. Although approaches have been proposed either based on the guilt-by-association principle or making use of disease phenotype similarities, the low coverage of both diseases and genes in existing methods has been preventing the scan of causative genes for a significant proportion of diseases at the whole-genome level.
To overcome this limitation, we proposed a rigorous statistical method called pgFusion to prioritize candidate genes by integrating one type of disease phenotype similarity derived from the Unified Medical Language System (UMLS) and seven types of gene functional similarities calculated from gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern, respectively. Our method covered a total of 7,719 diseases and 20,327 genes, achieving the highest coverage thus far for both diseases and genes. We performed leave-one-out cross-validation experiments to demonstrate the superior performance of our method and applied it to a real exome sequencing dataset of epileptic encephalopathies, showing the capability of this approach in finding causative genes for complex diseases. We further provided the standalone software and online services of pgFusion at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgfusion.
pgFusion not only provided an effective way for prioritizing candidate genes, but also demonstrated feasible solutions to two fundamental questions in the analysis of big genomic data: the comparability of heterogeneous data and the integration of multiple types of data. Applications of this method in exome or whole genome sequencing studies would accelerate the finding of causative genes for human diseases. Other research fields in genomics could also benefit from the incorporation of our data fusion methodology.
Pinpointing genes causative for inherited human diseases is the primary step towards the understanding of intrinsic mechanisms of such diseases. In the post-genomics era, the analysis of human genetic data is often combined with the mining of functional genomic data to facilitate the identification of potential causative genes [1, 2]. For example, via genome-wide association (GWA) studies, genetic factors related to a query disease can typically be located within a region of 10M basepairs, containing about 100 candidate genes . The problem is then how to rank these genes according to their strength of association with the query disease. Resorting to the whole-exome sequencing technique, dozens or hundreds of de novo mutations can be screened for a query disease . The question is then how to infer true causative genes from candidate genes that contain such mutations.
Targeting on these demands, two groups of computational approaches have been proposed for the prioritization of candidate genes. The first group is designed based on the guilt-by-association principle, which suggests that genes associated with the same type of disease are similar in their functions . Accordingly, candidate genes can be ranked according to their functional similarity to a set of seed genes that are known to be associated with the query disease. In existing studies belonging to this category, such similarities have been quantified based on gene expression , gene ontology , protein sequences , protein-protein interactions , and many others [10–12]. Methods have also been proposed to integrate multiple data sources for achieving high accuracy . Nevertheless, the requirement of a predefined set of seed genes may greatly restrict the scope of applications of these methods, since according to the OMIM (Online Mendelian Inheritance in Man) database , genetic bases for a significant proportion of human diseases are completely unknown, making the selection of seed genes for such diseases a problem.
To overcome this limitation, the second group of methods, with the hallmark of using disease phenotype similarity data, has been proposed. For example, Lage et al. proposed a Bayesian model to integrate phenotypic similarities and protein-protein interaction (PPI) data . Wu et al. suggested to quantify the strength of association between a disease and a gene using correlation between phenotype similarities and gene proximities . Wu et al. further proposed to perform a local alignment of a phenotype network against a PPI network . Li and Patra adopted a random walk with restart model on an integrated network composed of both diseases and genes . Vanunu et al. proposed to simulate how disease status propagated through candidate genes . Chen et al. proposed to quantify the strength of association between a disease and a gene using the maximum information flow in a phenome-interactome network . These methods, though demonstrating higher accuracy and wider scope of applications than the guilt-by-association approaches, are often restricted by two factors: 1) the availability of the phenotype similarity data and 2) the coverage of the gene similarity data. For example, there are a total of 7,719 diseases recorded in the OMIM database till February 2014, whereas the most widely used phenotype similarity data as published in  covers only 5,080 (~66%) of such diseases. It is estimated that the human genome contains more than 20,000 genes, whereas the most widely used PPI data as published in  covers only 9,515 (< 50%) genes.
Motivated by these understandings, we propose a rigorous statistical model named pgFusion that integrates one type of phenotype similarity and seven types of gene similarities to pinpoint disease genes. The phenotype similarity data, which covers 7,719 diseases in the OMIM database, is derived using a text mining technique based on the Unified Medical Language System (UMLS)  and is the most comprehensive one among such data. The seven types of gene similarity data, including gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern, cover as many as 20,327 human genes, making the whole-genome scan of causative genes for a query disease possible. Based on these data, our method resorts to a linear regression model and a hypothesis testing procedure to derive 7 scores that quantify the strength of association between a query disease and a candidate gene from different perspectives, and further adopts the Fisher's method with dependence correction to combine these scores. We performed leave-one-out validation experiments to demonstrate the superior performance of pgFusion, and applied it to a real exome sequencing data set of epileptic encephalopathies , showing the capability of this approach in finding causative genes for complex disease. We finally provided the standalone software and user-friendly online services of our method at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgfusion.
Workflow of pgFusion
Derivation of phenotype similarity
We adopted the text mining technique to derive pairwise phenotype similarity between diseases. Briefly, we first extracted a total of 7,719 disease records from the OMIM database and split sentences in the TX and the CS fields of these records into words. Then, we mapped these words onto concepts in the UMLS database by using the MetaMap program . Next, for each OMIM record, we counted the frequency of occurrence of each concept in the record, obtaining a high dimensional numeric vector. Finally, we calculated pairwise phenotype similarity between diseases as the cosine of the angle between corresponding vectors. We assessed relationships between the phenotype similarity derived this way and several genotype similarities, and we found strong evidence to support the existence of correlations between the phenotype and genotype similarities.
Derivation of gene similarities
We derived gene functional similarity scores from 7 types of genomic data, including gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern. Each of such scores ranged from 0 to 1, denoting the lowest and highest similarities, respectively.
where was the final score for two genes g and h, the raw score, and σ(gexp) the standard deviation of raw scores for all gene pairs. With this transformation, the highest raw score (1.0) kept highest, while the lowest raw score (0.0) became exp[-(σ(gexp))-2], which was close to zero because the standard deviation σ(gexp) was typically small.
Focusing on the biological process domain of the gene ontology and associated annotations , we collected a total of 25,616 concepts in the annotations and characterized each human gene using a numeric vector of such number of dimensions, with each element being the information content of the corresponding concept. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the aforementioned exponential transformation to convert raw scores into final similarity scores. Note that although there have been quite a few methods for calculating gene semantic similarity based on the gene ontology , it has been shown recently that the cosine measure, though simple, often produces reasonable results .
Focusing on human pathways in the KEGG database  and discarding diseases-related ones to avoid biases towards well-studied diseases, we obtained a total of 238 pathways and characterized each human gene using a binary vector of such number of dimensions. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
We extracted a total of 20,274 human protein sequences from the Swiss-prot database  and ran the Smith-Waterman algorithm implemented in SSEARCH  to obtain their pairwise local sequence alignments. Then, we constructed a sequence similarity network of these proteins by connecting two proteins with an undirected edge if their alignment e-value is less than a predefined threshold (10-4). Next, we calculated the shortest path distance () for every pair of proteins (g and h) in this network and converted it to a similarity value in the range of 0 and 1 (). Finally, we applied the exponential transformation to obtain the similarity score. Note that the construction of a sequence similarity network in this procedure greatly reduced the sensitivity to the parameters involved and thus enhanced the robustness of this method.
We obtained a total of 14,831 domains from the Pfam database (Version 27.0)  and characterized each human protein using a binary vector of such number of dimensions. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
We extracted a total of 403,514 interactions among 13,747 proteins from the STRING database (Version 9.1)  and constructed a protein-protein interaction network accordingly. Then, we calculated the shortest path distance () for every pair of proteins (g and h) in this network and converted it into a value in the range of 0 and 1 (). Finally, we applied the exponential transformation to obtain the similarity score.
We extracted a total of 218 high confidence position specific scoring matrices for the same number of vertebrate transcription factors from the TRANSFAC database . We then searched 1,000 basepairs upstream for each human gene using the program MATCH to identify potential binding sites for each transcription factor. Next, we characterized each gene using a numeric vector of 218 dimensions, with each element indexing the number of potential binding sites for the corresponding transcription factor. Finally, for each pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
Scoring association strength by regression and hypothesis testing
with D and E being sets of genes known as associated with diseases d and e, respectively, and φ gh the functional similarity between genes g and h according to the genomic data in use.
where α and β are regression intercept and slope, respectively, the vector composed of phenotype similarities between d and all other n diseases in the similarity matrix, the vector of corresponding genotype similarities with and I i the set of genes known as associated with the i-th disease for i = 1,...,n, and ε = (ε1,...,ε n ) T with ε i ~ N(0,σ2) independent and identically distributed for i = 1,...,n.
where , and . It is obvious that the statistic has a student's t distribution with n-2 degrees of freedom under the null hypothesis and the normal assumption. The p-value of the proposed test can then be calculated as P(Tn-2 ≥ t) with t the realized value of the statistic.
However, in the case that the normal assumption does not hold, the p-value obtained from the t distribution may not reliably reflect the true statistical significance. We therefore calibrated the p-value by simulating the distribution of raw p-values for all disease-gene pairs that were not included in annotated associations and calculating the adjusted p-value as the proportion of raw p-values in this distribution that was smaller than or equal to the raw p-value need to be calibrated.
Fusion of association scores for multiple genomic data sources
We adopted Fisher's method to integrate p-value derived from different types of genomic data to obtain a single score, with an extra effort on the correction of dependence between the p-values.
where , n the sample size for obtaining Z i .
We further applied multiple testing corrections to the combined p-values by controlling the positive false discovery rate (pFDR) of candidate genes through their q-values . Existing studies have shown the significant improvement in the test power of this method over the traditional approach of Benjamini-Hochberg that controls the false discovery rate (FDR) . It is possible that some data sources are absent for a candidate gene. To deal with this problem, we ignored the missing data source in the Fisher's method and decreased the total number of p-values to be combined accordingly.
We extracted a total of 7,719 diseases from the OMIM database (accessed in February 2014) and derived pairwise phenotype similarities of these diseases by applying the text mining technique to their OMIM records with the use of UMLS (version 2014AA) as the standard vocabulary. We extracted a total of 4,368 associations between 3,709 of these diseases and 2,870 genes using the tool BioMart .
Coverage and accuracy of individual data sources.
Phenotype similarity correlates with genotype similarity
From the figure, we clearly see strong correlation between phenotype similarity and genotype similarity derived from each of the 7 genomic data sources. Taking gene expression as an example (Figure 2A), for disease pairs with very weak phenotype similarity (0.0~0.1), the genotype similarity is only 0.0145 on average. For disease pairs with strong phenotype similarity (0.9~1.0), the genotype similarity is as high as 0.2204 on average. For disease pairs with medium phenotype similarity (0.4~0.5), the genotype similarity is also at the medium level (0.0409). Furthermore, it is obvious that with the increase of the phenotype similarity, the genotype similarity also increases. For the other 6 genomic data, we observe similar pattern. These results suggest that diseases having weak phenotypic overlap tend to have small genotypic overlap, while diseases having strong phenotypic overlap tend to have large genotypic overlap, in accord with one of our previous analysis .
To quantitatively measure the correlation between phenotype similarity and genotype similarity, we derived for each genomic data source two vectors, one composed of mean phenotype similarities of disease pairs in the 10 bins and the other consisting of corresponding mean genotype similarities. We then calculated Pearson's correlation coefficient of these two vectors for each type of data. Results show that the correlation coefficients are 0.9626 (p-value = 8.193 × 10-6) for gene expression, 0.9341 (p-value = 7.607 × 10-5) for gene ontology, 0.9404 (p-value = 5.133 × 10-5) for KEGG pathway, 0.8987 (p-value = 4.076 × 10-4) for protein sequence, 0.9449 (p-value = 3.778 × 10-5) for protein domain, 0.9408 (p-value = 4.994 × 10-5) for protein-protein interaction, and 0.9322 (p-value = 8.512 × 10-5) for regulation pattern. We then conclude that the phenotype similarity positively correlates with the genotype similarity with strong statistical significance.
Data fusion improves prioritization performance
We then validated pgFusion using the 4,368 annotated associations between 3,709 diseases and 2,870 genes by a large-scale leave-one-out cross-validation experiment against a linkage interval. In each validation run, we focused on one disease-gene pair in an annotated association and saw whether our method can correctly identify the gene from a set of control genes. For this purpose, we took the disease as the query disease and the gene as the test gene, collected a set of 99 control genes that had the shortest distance to the test gene among all genes in the same chromosome as the test one, and ranked the test gene against the control genes using our method. In this procedure, we removed all annotated associations between the query disease and genes in the regression model to simulate the circumstance that the genetic basis of the query disease is completely unknown.
We then compared the performance of pgFusion with that of individual genomic data sources. As shown in Table 1, among the 7 data sources, the gene ontology (gobp) yields the highest performance (MRR = 11.94%, AUC = 88.56%), followed by the protein-protein interaction (strg) (MRR = 12.64%, AUC = 88.04%). The regulation pattern (tsfc) yields the lowest performance (MRR = 26.38%, AUC = 74.97%), followed by the gene expression (gexp) (MRR = 23.99%, AUC = 76.47%). The improvements of pgFusion over individual data sources are then as high as 64.19% when compared with the regulation pattern and as low as 20.89% when compared with the gene ontology, in terms of the MRR. These results clearly demonstrate the vast improvement of pgFusion over individual genomic data sources in the prioritization accuracy and suggest the power of data fusion.
In exome sequencing studies, genetic variants are sequence across the whole exome, it is therefore necessary to validate whether pgFusion is capable of identifying disease genes for a query disease from candidate genes spreading over the entire genome. For this purpose, we performed a large-scale leave-one-out cross-validation experiment against random controls. Specifically, in each validation run, we focused on one disease-gene pair in an annotated association, took the disease as the query disease and the gene as the test gene, collected a set of 99 control genes that were selected at random from the entire genome, and ranked the test gene against the control genes using our method. We also removed all annotated associations between the query disease and genes in the regression model to pretend that the genetic basis of the query disease is completely unknown. We summarized ranks of the test genes in this validation in Figure 2(C). In a total of 4,368 validation runs, pgFusion ranked 2,371 test genes at the top and 3,575 among top 10. Considering that a random guess procedure can only rank 43.7 test genes ranked at the top and 436.8 genes among top 10, the capability of our method in identifying disease genes from random controls is strongly supported. Besides, the low MRR (9.94%) and high AUC (90.48%) as shown in Table 1, together with the fast climbing shape of the ROC curve in Figure 3(D), further confirm the effectiveness of our method in this validation. Furthermore, comparison with individual data sources, as shown in Table 1, also demonstrate the vast improvement in the performance of pgFusion. For example, the improvements of pgFusion over the gene ontology (gobp) is 11.19% in terms of the MRR.
More importantly, the coverage of pgFusion also benefits from data fusion. For example, as shown in Table 1, KEGG covers only 6,468 genes. Gene ontology (gobp) covers 14,465 genes. Protein-protein interaction (strg) covers 12,432 genes. Regulation pattern (tsfc), though covers 20,314 genes, can only achieve the lowest accuracy. With data fusion, however, pgFusion covers 20,327 genes, much more than most individual data sources, and thus makes it feasible to perform a whole-genome scan for disease genes for a query disease.
Contributions of individual data sources
Comparison with existing methods
We compared the performance of pgFusion with that of two representative methods for gene prioritization, Cipher  and Endeavour . Briefly, Cipher represents a category of methods that rely on a single source of phenomic data and a single source of genomic data. This method is mathematically equivalent to our approach when using PPI information only. Therefore, it is obvious that our method outperforms Cipher in all evaluation criteria, as demonstrated in Table 1 and analysed in the above section.
Endeavour represents another category of methods that rely on multiple sources of genomic data to prioritize genes. This method was developed according to the guilt-by-association principle  and thus required a set of seed genes known to be associated with a query disease as an extra input . To meet this requirement, for a query disease, we resorted to the phenotype similarity data to select 5 to 20 diseases that owned the highest phenotype similarities with the query disease and then used genes known as associated with these diseases as seed genes for the query disease. We repeated the leave-one-out cross-validation experiment against a linkage interval for Endeavour, using the same 7 sources of genomic data. Results show that Endeavour achieves the highest performance (MRR = 16.61% and AUC = 83.64%) when seed genes are obtained from 20 diseases that are most similar to the query one. When 5, 10 and 15 most similar diseases are used to obtain seed genes, Endeavour achieves MRRs of 18.40%, 18.17% and 17.11%, respectively and AUCs of 81.86%, 82.09% and 83.14%, respectively. All these criteria are much lower than those achieved by pgFusion (MRR = 9.45% and AUC = 91.37%). We conjecture this observation can probably be attributed to the fact that pgFusion uses phenomic data in a global way, while in our experiment Endeavour only partially uses such information.
Application to exome sequencing studies
Recent advancements in exome sequencing studies have demonstrated that the collection of de novo mutations affecting different genes in different individuals might explain a proportion of such common complex diseases as epileptic encephalopathies . We therefore apply our method to the exome sequencing data of this complex disease to demonstrate the power of our method in diagnosing disease genes.
Top 20 candidate genes for epileptic encephalopathies.
Whole-genome scan of disease genes
We further performed a whole-genome scan of causative genes for a total of 7,719 diseases in the phenotype similarity matrix. Focusing on genes collected in either of the seven genomic data sources, we extracted a total of 20,327 genes that spread over the entire genome and applied pgFusion to score these genes for each disease. Prediction results, together with an online service and the standalone software of pgFusion, are available at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgfusion.
Conclusions and discussion
In this paper, we have proposed a bioinformatics approach called pgFusion that integrated one type of phenotype similarity and seven types of gene similarities for the inference of disease genes. The success of our method can be attributed to the carefully designed statistical model that relates the calculation of association strength to a hypothesis testing problem and combines multiple data sources with the consideration of their pairwise correlations. Grounded on the theoretical modelling, our method achieves not only high coverage but also superior accuracy, thereby providing a practical way in such analysis as the prioritization of candidate genes in whole-exome sequencing studies.
Certainly, our method can further be improved from the following aspects. First, although we currently focus on UMLS to derive phenotype similarity, other standard vocabularies such as the Medical Subject Headings (MeSH) and the human phenotype ontology (HPO) can also be adopted. Second, most existing methods for prioritizing candidate genes so far do not explicitly address the possible bias towards well-studied genes. This bias issue is alleviated with the integration of multiple types of data, because ddifferent data sources measure gene functions from different points of view and do not depend on a single type of data to make inference. However, how to explicitly eliminate the influence of bias is still an open question worth exploration. Third, we currently do not weight different data sources. Although theoretically it is not hard to assign different weights to different data sources in Fisher's method, how to determine these weights is itself a problem that needs careful exploration. Finally, in the era of big data, the integration of multiple types of heterogeneous data is itself an important problem, the method we used in this paper provides a means for solving two basic questions, the comparability of heterogeneous data and the integration of multiple types of data. How to incorporate our method into other research fields in systems biology is one of our future focuses.
This research was partially supported by the National Basic Research Program of China (2012CB316504), the National High Technology Research and Development Program of China (2012AA020401), and the National Natural Science Foundation of China (61175002).
Publication of this article was funded by the National Basic Research Program of China (2012CB316504), the National High Technology Research and Development Program of China (2012AA020401), and the National Natural Science Foundation of China (61175002).
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S2
- Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-237.View ArticlePubMedGoogle Scholar
- Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002, 31 (3): 316-319.PubMedGoogle Scholar
- Meyre D, Delplanque J, Chevre JC, Lecoeur C, Lobbens S, Gallina S, Durand E, Vatin V, Degraeve F, Proenca C, et al: Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat Genet. 2009, 41 (2): 157-159. 10.1038/ng.301.View ArticlePubMedGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews Genetics. 2011, 12 (11): 745-755. 10.1038/nrg3031.View ArticlePubMedGoogle Scholar
- Altshuler D, Daly M, Kruglyak L: Guilt by association. Nature genetics. 2000, 26 (2): 135-138. 10.1038/79839.View ArticlePubMedGoogle Scholar
- Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, et al: Genetics of gene expression and its effect on disease. Nature. 2008, 452 (7186): 423-428. 10.1038/nature06758.View ArticlePubMedGoogle Scholar
- Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic acids research. 2005, 33 (5): 1544-1552. 10.1093/nar/gki296.PubMed CentralView ArticlePubMedGoogle Scholar
- Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC bioinformatics. 2005, 6: 55-10.1186/1471-2105-6-55.PubMed CentralView ArticlePubMedGoogle Scholar
- Köhler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics. 2008, 82 (4): 949-958. 10.1016/j.ajhg.2008.02.013.View ArticlePubMedGoogle Scholar
- Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002, 18 (Suppl 2): S110-115. 10.1093/bioinformatics/18.suppl_2.S110.View ArticlePubMedGoogle Scholar
- Turner FS, Clutterbuck DR, Semple CA: POCUS: mining genomic sequence annotation to predict disease genes. Genome biology. 2003, 4 (11): R75-10.1186/gb-2003-4-11-r75.PubMed CentralView ArticlePubMedGoogle Scholar
- Lopez-Bigas N, Ouzounis CA: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic acids research. 2004, 32 (10): 3108-3114. 10.1093/nar/gkh605.PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, et al: Gene prioritization through genomic data fusion. Nat Biotechnol. 2006, 24 (5): 537-544. 10.1038/nbt1203.View ArticlePubMedGoogle Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research. 2005, 33 (suppl 1): D514-D517.PubMed CentralPubMedGoogle Scholar
- Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, et al: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25 (3): 309-316. 10.1038/nbt1295.View ArticlePubMedGoogle Scholar
- Wu X, Jiang R, Zhang MQ, Li S: Network-based global inference of human disease genes. Mol Syst Biol. 2008, 4: 189-PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Liu Q, Jiang R: Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics. 2009, 25 (1): 98-104. 10.1093/bioinformatics/btn593.View ArticlePubMedGoogle Scholar
- Li Y, Patra JC: Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010, 26 (9): 1219-1224. 10.1093/bioinformatics/btq108.View ArticlePubMedGoogle Scholar
- Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. PLoS computational biology. 2010, 6 (1): e1000641-10.1371/journal.pcbi.1000641.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Y, Jiang T, Jiang R: Uncover disease genes by maximizing information flow in the phenome-interactome network. Bioinformatics. 2011, 27 (13): i167-176. 10.1093/bioinformatics/btr213.PubMed CentralView ArticlePubMedGoogle Scholar
- van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome. European journal of human genetics: EJHG. 2006, 14 (5): 535-542. 10.1038/sj.ejhg.5201585.View ArticlePubMedGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al: Human Protein Reference Database--2009 update. Nucleic acids research. 2009, D767-772. 37 DatabaseGoogle Scholar
- Lindberg DA, Humphreys BL, McCray AT: The Unified Medical Language System. Methods of information in medicine. 1993, 32 (4): 281-291.PubMedGoogle Scholar
- Allen AS, Berkovic SF, Cossette P, Delanty N, Dlugos D, Eichler EE, Epstein MP, Glauser T, Goldstein DB, Han Y, et al: De novo mutations in epileptic encephalopathies. Nature. 2013, 501 (7466): 217-221. 10.1038/nature12439.View ArticlePubMedGoogle Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings/AMIA Annual Symposium AMIA Symposium. 2001, 17-21.Google Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang R, Gan M, He P: Constructing a gene semantic similarity network for the inference of disease genes. BMC systems biology. 2011, 5 (Suppl 2): S2-10.1186/1752-0509-5-S2-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Gan M: Correlating information contents of gene ontology terms to infer semantic similarity of gene products. Computational and mathematical methods in medicine. 2014, 2014: 891842-PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000, 28 (1): 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- UniProt C: The Universal Protein Resource (UniProt) in 2010. Nucleic acids research. 2010, D142-148. 38 DatabaseGoogle Scholar
- Li W, McWilliam H, Goujon M, Cowley A, Lopez R, Pearson WR: PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics. 2012, 28 (12): 1650-1651. 10.1093/bioinformatics/bts240.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al: The Pfam protein families database. Nucleic acids research. 2010, D211-222. 38 DatabaseGoogle Scholar
- Snel B, Lehmann G, Bork P, Huynen MA: STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic acids research. 2000, 28 (18): 3442-3444. 10.1093/nar/28.18.3442.PubMed CentralView ArticlePubMedGoogle Scholar
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic acids research. 2003, 31 (1): 374-378. 10.1093/nar/gkg108.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang JJ: Distribution of Fisher's combination statistic when the tests are dependent. Journal of Statistical Computation and Simulation. 2010, 80 (1): 1-12. 10.1080/00949650802412607.View ArticleGoogle Scholar
- Storey JD: The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics. 2003, 2013-2035.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological). 1995, 289-300.Google Scholar
- Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A: BioMart Central Portal--unified access to biological data. Nucleic acids research. 2009, 37 (suppl 2): W23-W27.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.