Workflow of pgFusion
The proposed method, named pgFusion, was designed based on the assumption that genes associated with diseases that shared common clinical traits would also share similar properties across multiple genomic data sources. As shown in Figure 1, inputs of this method included a query disease and a set of candidate genes, and the objective was to rank these genes according to their strength of association with the query disease. For this purpose, pgFusion relied on the OMIM and UMLS databases to calculate a phenotype similarity matrix for a total of 7,719 diseases and resorted to 7 genomic data sources (gene expression, gene ontology, KEGG pathway, protein sequence, protein domain, protein-protein interaction and regulation pattern) to derive 7 gene functional similarity matrices for a total of 20,327 human genes. With such phenomic and genomic information available, pgFusion resorted to a regression model and the Fisher's method to examine one candidate gene at a time. In the regression model, pgFusion explained the phenotype similarity between two diseases using their genotype similarity, which was defined as the total functional similarities of their associated genes under a certain genomic data source. The strength of association between the query disease and a candidate gene was then assessed by a hypothesis testing procedure and quantified by the corresponding p-value. Final results were then 7 p-values, one for a genomic data source. In the Fisher's method, pgFusion integrated the 7 p-values to calculate a single p-value, with the consideration of the dependence between these p-values. A multiple testing correction procedure was then applied to the final p-values of all candidate genes to control the positive false discovery rate of the results by calculating q-values from p-values. Finally, candidate genes were sorted according to their q-values to produce the output ranking list.
Derivation of phenotype similarity
We adopted the text mining technique to derive pairwise phenotype similarity between diseases. Briefly, we first extracted a total of 7,719 disease records from the OMIM database and split sentences in the TX and the CS fields of these records into words. Then, we mapped these words onto concepts in the UMLS database by using the MetaMap program [25]. Next, for each OMIM record, we counted the frequency of occurrence of each concept in the record, obtaining a high dimensional numeric vector. Finally, we calculated pairwise phenotype similarity between diseases as the cosine of the angle between corresponding vectors. We assessed relationships between the phenotype similarity derived this way and several genotype similarities, and we found strong evidence to support the existence of correlations between the phenotype and genotype similarities.
Derivation of gene similarities
We derived gene functional similarity scores from 7 types of genomic data, including gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern. Each of such scores ranged from 0 to 1, denoting the lowest and highest similarities, respectively.
Gene expression
Focusing on whole-genome microarrays for a total of 44,775 transcripts across 79 tissues [26], we characterized each human gene with a 79-dimensional numeric vector that represented expression levels of the gene across the tissues. For a pair of two genes, we calculated the absolute value of the Pearson's correlation coefficient of the corresponding vectors to obtain their raw similarity scores. Considering that such raw scores may include noise in the original expression data, we further applied an exponential transformation to convert raw scores into final similarity scores, as
where was the final score for two genes g and h, the raw score, and σ(gexp) the standard deviation of raw scores for all gene pairs. With this transformation, the highest raw score (1.0) kept highest, while the lowest raw score (0.0) became exp[-(σ(gexp))-2], which was close to zero because the standard deviation σ(gexp) was typically small.
Gene ontology
Focusing on the biological process domain of the gene ontology and associated annotations [27], we collected a total of 25,616 concepts in the annotations and characterized each human gene using a numeric vector of such number of dimensions, with each element being the information content of the corresponding concept. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the aforementioned exponential transformation to convert raw scores into final similarity scores. Note that although there have been quite a few methods for calculating gene semantic similarity based on the gene ontology [28], it has been shown recently that the cosine measure, though simple, often produces reasonable results [29].
Pathway membership
Focusing on human pathways in the KEGG database [30] and discarding diseases-related ones to avoid biases towards well-studied diseases, we obtained a total of 238 pathways and characterized each human gene using a binary vector of such number of dimensions. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
Protein sequence
We extracted a total of 20,274 human protein sequences from the Swiss-prot database [31] and ran the Smith-Waterman algorithm implemented in SSEARCH [32] to obtain their pairwise local sequence alignments. Then, we constructed a sequence similarity network of these proteins by connecting two proteins with an undirected edge if their alignment e-value is less than a predefined threshold (10-4). Next, we calculated the shortest path distance () for every pair of proteins (g and h) in this network and converted it to a similarity value in the range of 0 and 1 (). Finally, we applied the exponential transformation to obtain the similarity score. Note that the construction of a sequence similarity network in this procedure greatly reduced the sensitivity to the parameters involved and thus enhanced the robustness of this method.
Protein domain
We obtained a total of 14,831 domains from the Pfam database (Version 27.0) [33] and characterized each human protein using a binary vector of such number of dimensions. For a pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
Protein-protein interaction
We extracted a total of 403,514 interactions among 13,747 proteins from the STRING database (Version 9.1) [34] and constructed a protein-protein interaction network accordingly. Then, we calculated the shortest path distance () for every pair of proteins (g and h) in this network and converted it into a value in the range of 0 and 1 (). Finally, we applied the exponential transformation to obtain the similarity score.
Regulation pattern
We extracted a total of 218 high confidence position specific scoring matrices for the same number of vertebrate transcription factors from the TRANSFAC database [35]. We then searched 1,000 basepairs upstream for each human gene using the program MATCH to identify potential binding sites for each transcription factor. Next, we characterized each gene using a numeric vector of 218 dimensions, with each element indexing the number of potential binding sites for the corresponding transcription factor. Finally, for each pair of two genes, we calculated the cosine of the angle between the corresponding vectors to obtain their raw similarity scores and further applied the exponential transformation to obtain final similarity scores.
Scoring association strength by regression and hypothesis testing
Given the phenotype similarity matrix and a gene functional similarity matrix derived from a type of genomic data, we adopted a linear model as proposed in the literature [16] to explain the phenotype similarity between two diseases using functional similarities of genes associated with the diseases, as
where d and e indexes two diseases, Y
de
their phenotype similarity, ε
de
Gaussian noise, and x
de
their genotype similarity defined as
with D and E being sets of genes known as associated with diseases d and e, respectively, and φ
gh
the functional similarity between genes g and h according to the genomic data in use.
Particularly, suppose d to be the query disease and g a candidate gene, we assumed g would be the only gene associated with d and wrote a regression model as
where α and β are regression intercept and slope, respectively, the vector composed of phenotype similarities between d and all other n diseases in the similarity matrix, the vector of corresponding genotype similarities with and I
i
the set of genes known as associated with the i-th disease for i = 1,...,n, and ε = (ε1,...,ε
n
)T with ε
i
~ N(0,σ2) independent and identically distributed for i = 1,...,n.
With this regression model, we quantified the strength of association between the query disease d and the candidate gene g using the statistical significance of the hypothesis testing problem
Define the test statistic T as,
where , and . It is obvious that the statistic has a student's t distribution with n-2 degrees of freedom under the null hypothesis and the normal assumption. The p-value of the proposed test can then be calculated as P(Tn-2 ≥ t) with t the realized value of the statistic.
However, in the case that the normal assumption does not hold, the p-value obtained from the t distribution may not reliably reflect the true statistical significance. We therefore calibrated the p-value by simulating the distribution of raw p-values for all disease-gene pairs that were not included in annotated associations and calculating the adjusted p-value as the proportion of raw p-values in this distribution that was smaller than or equal to the raw p-value need to be calibrated.
Fusion of association scores for multiple genomic data sources
We adopted Fisher's method to integrate p-value derived from different types of genomic data to obtain a single score, with an extra effort on the correction of dependence between the p-values.
Specifically, given the p-values to be combined, denoted by p1,...,p
k
, where k = 7 is the total number of data sources, we defined the fisher's statistic as
It is clear that under the null hypothesis, p
i
~ Uniform(0,1) and . In the independent case, it is obvious that . In the dependent case, we follow the literature [36] to assume that T follows a scaled chi-squared distribution as . The problem is therefore how to estimate the scale r and the degrees of freedom v. Resorting to the method of moments, population mean and variance are given as
while the corresponding sample mean and variance are derived as
Matching these quantities for the population and the sample, we obtain
Covariances cov(V
i
,V
j
) can be estimated using a normal model as follows. Suppose p
i
= Φ(1 - z
i
), where Φ(·)is the cumulative distribution function of the standard normal distribution and Z
i
a statistic that has a standard normal distribution under the null hypothesis. As suggested in the literature [36], let
The covariance is then calculated as
where , n the sample size for obtaining Z
i
.
We further applied multiple testing corrections to the combined p-values by controlling the positive false discovery rate (pFDR) of candidate genes through their q-values [37]. Existing studies have shown the significant improvement in the test power of this method over the traditional approach of Benjamini-Hochberg that controls the false discovery rate (FDR) [38]. It is possible that some data sources are absent for a candidate gene. To deal with this problem, we ignored the missing data source in the Fisher's method and decreased the total number of p-values to be combined accordingly.