Construction of gene clusters resembling genetic causal mechanisms for common complex disease with an application to young-onset hypertension

Background Lack of power and reproducibility are caveats of genetic association studies of common complex diseases. Indeed, the heterogeneity of disease etiology demands that causal models consider the simultaneous involvement of multiple genes. Rothman’s sufficient-cause model, which is well known in epidemiology, provides a framework for such a concept. In the present work, we developed a three-stage algorithm to construct gene clusters resembling Rothman’s causal model for a complex disease, starting from finding influential gene pairs followed by grouping homogeneous pairs. Results The algorithm was trained and tested on 2,772 hypertensives and 6,515 normotensives extracted from four large Caucasian and Taiwanese databases. The constructed clusters, each featured by a major gene interacting with many other genes and identified a distinct group of patients, reproduced in both ethnic populations and across three genotyping platforms. We present the 14 largest gene clusters which were capable of identifying 19.3% of hypertensives in all the datasets and 41.8% if one dataset was excluded for lack of phenotype information. Although a few normotensives were also identified by the gene clusters, they usually carried less risky combinatory genotypes (insufficient causes) than the hypertensive counterparts. After establishing a cut-off percentage for risky combinatory genotypes in each gene cluster, the 14 gene clusters achieved a classification accuracy of 82.8% for all datasets and 98.9% if the information-short dataset was excluded. Furthermore, not only 10 of the 14 major genes but also many other contributing genes in the clusters are associated with either hypertension or hypertension-related diseases or functions. Conclusions We have shown with the constructed gene clusters that a multi-causal pie-multi-component approach can indeed improve the reproducibility of genetic markers for complex disease. In addition, our novel findings including a major gene in each cluster and sufficient risky genotypes in a cluster for disease onset (which coincides with Rothman’s sufficient cause theory) may not only provide a new research direction for complex diseases but also help to reveal the disease etiology.


Cluster visualization
We developed the following steps to generate gene-subject cluster plots for the demonstration of Rothman's genetic causal pies: Step 1 Construct a binary matrix for each dataset in which each row in the matrix represents a SNP pair and a non-zero element indicates a subject carrying a risky combinatory genotype associated with the SNP pair.
Step 2 Reorder rows in the matrix such that SNP pairs with a shared gene are grouped together.
Step 3 Sort the resulting gene clusters by their size in descending order.
Step 4 Merge rows (SNP pairs) of the same gene pairs in a gene cluster into a single row using the "OR" operator if a similar group of subjects is identified.
Step 5 Starting from the largest gene clusters, group columns that represent subjects carrying risky combinatory genotypes in the gene cluster.

Robustness evaluation of the gene cluster construction algorithm
We tested the robustness of our gene cluster construction algorithm to small changes in criteria of the risky combinatory genotype. Apart from the original setting of 2.0% of the case population, we first tested the algorithm after increasing the setting to 2.5% and then decreasing it to 1.8% and 1.5%. These proportions were selected so as to change the numbers of cases in the two training datasets. That is, compared with the original 2% of the case population (7 and 4 cases in FHS_Affy500k and Taiwan_Affy500k, respectively), the 2.5%, 1.8% and 1.5% of case population corresponded to (8 and 5), (6 and 4) and (5 and 3) cases in the two datasets. To compute the similarity of the two lists of gene clusters, we first ranked the gene clusters in descending order based on the cluster size and then compared the shared genes corresponding to the top n gene clusters (n = 5, 10, 15, 20, 30, 50, 100). The top n gene clusters of the two lists were said to have 100% overlap if their corresponding shared genes were the same (regardless of the ranking order).
The robustness of the developed gene cluster construction algorithm to changes in sample size was also tested. Because the Taiwan_Affy500k was already small in terms of NC sample size (184 NC subjects), and further reducing it could introduce a considerable amount of false-positive SNP pairs, we only reduced the sample size of FHS_Affy500k to 90%, 80%, 70%, 60%, and 50% of its original size in our experiments. For each size, five sub-datasets were constructed, each of which was randomly drawn from the original dataset. The similarity between the gene clusters computed from the reduced sub-dataset and those computed from the original dataset was evaluated via the same procedure as that used to evaluate the effect of criterion changes.
The robustness test to disease prevalance changes was similar to the test to sample size changes except that the NC sample size was unchanged as the HT samples size in the FHS_Affy500k dataset was reduced to 90%, 80%, 70%, 60%, and 50% of its original size making the resultant prevalance also reduced to 90%, 80%, 70%, 60%, and 50% of its original value. For each disease prevalance, five sub-datasets were constructed, each of which was randomly drawn from the original dataset. The similarity between the gene clusters computed from the reduced sub-dataset and those computed from the original dataset was evaluated via the same procedure as that used to evaluate the effect of criterion changes.