New methods for separating causes from effects in genomics data

Background The discovery of molecular pathways is a challenging problem and its solution relies on the identification of causal molecular interactions in genomics data. Causal molecular interactions can be discovered using randomized experiments; however such experiments are often costly, infeasible, or unethical. Fortunately, algorithms that infer causal interactions from observational data have been in development for decades, predominantly in the quantitative sciences, and many of them have recently been applied to genomics data. While these algorithms can infer unoriented causal interactions between involved molecular variables (i.e., without specifying which one is the cause and which one is the effect), causally orienting all inferred molecular interactions was assumed to be an unsolvable problem until recently. In this work, we use transcription factor-target gene regulatory interactions in three different organisms to evaluate a new family of methods that, given observational data for just two causally related variables, can determine which one is the cause and which one is the effect. Results We have found that a particular family of causal orientation methods (IGCI Gaussian) is often able to accurately infer directionality of causal interactions, and that these methods usually outperform other causal orientation techniques. We also introduced a novel ensemble technique for causal orientation that combines decisions of individual causal orientation methods. The ensemble method was found to be more accurate than any best individual causal orientation method in the tested data. Conclusions This work represents a first step towards establishing context for practical use of causal orientation methods in the genomics domain. We have found that some causal orientation methodologies yield accurate predictions of causal orientation in genomics data, and we have improved on this capability with a novel ensemble method. Our results suggest that these methods have the potential to facilitate reconstruction of molecular pathways by minimizing the number of required randomized experiments to find causal directionality and by avoiding experiments that are infeasible and/or unethical.

log(P(Y|X) . Repeat the process, this time estimating P(Y) and P(X|Y), to obtain DL(XY). If DL(XY) < DL(XY), conclude X  Y, otherwise X  Y.
 ANM-MML: Same as GPI-MML, except for a different method of estimating P(Y | X), where the covariance matrix used in the Gaussian process is constant with respect to the noise (which reflects the additive noise assumption). As before, we obtain the likelihood of the observed data given XY: DL(XY) = -log(P(X)) -log(P(Y|X) . Repeating the process, this time estimating P(Y) and P(X|Y), we obtain DL(X Y). If DL(XY) < DL(XY), conclude X  Y, otherwise X  Y.
 GPI: Similar to ANM, only we perform non-linear regression of Y on X and e. Since e is supposed to represent all the unobserved causes as well as noise, this can be thought of as accounting for latent variables. Estimate the noise e from the equation Y = f*(X,e), and test if it is independent of X using a kernelized statistical independence test (HSIC) to obtain a p-value p 1 . Repeat the same process in the opposite direction to obtain a p-value p 2 . If p 1 > p 2 conclude X  Y, otherwise X  Y.
 ANM-GAUSS: Same as ANM-MML, except for a different method of estimating P(X) using a single Gaussian, rather than a mixture model. As before, we obtain the likelihood of the observed data given XY: DL(XY) = -log(P(X)) -log(P(Y|X) . Repeating the process, this time estimating P(Y) and P(X|Y), we obtain DL(XY). If DL(XY) < DL(XY), conclude X  Y, otherwise X  Y.
 LINGAM: Estimate a model of the form Y=b 2 X+e 1 and X=b 1 Y+e 2 , where e 1 and e 2 are independent, using independent component analysis (ICA). If b 1 < b 2 , conclude X  Y, otherwise X  Y.

Results of causal orientation methods ANM, PNL, and GPI obtained by assessing statistical significance of the forward and backward causal models
Recall that all causal relations in the gold standard are of the type TF → G ("TF" stands for a transcription factor and "G" stands for its target gene). The tables below adopt the following notation:  TF → G: Number of times the method discovers that the model TF → G is statistically significant, while the model TF ← G is not statistically significant (at the given alpha level).
 TF ← G: Number of times the method discovers that the model TF ← G is statistically significant, while the model TF → G is not statistically significant (at the given alpha level).
 TF ↔ G: Number of times the method discovers that both models TF ← G and TF → G are statistically significant (at the given alpha level).
 TF G: Number of times the method discovers that neither model TF ← G and TF → G is statistically significant at the given alpha level.
 Accuracy*: Accuracy for confident decisions only, that is computed as: ECOLI results: of IGCI methods (Gaussian/Entropy or Gaussian/Integral) in real data (red line in the graphs and second column in the tables) and in 1,000 random datasets from Normal distribution with mean 0 and standard deviation 1 (blue histograms), as well as empirical probability of observing higher performance in the random data than the observed performance in the real data (third column in the tables).

Method
 Figure S1: IGCI Gaussian/Entropy method assessed with the accuracy metric.
 Figure S2: IGCI Gaussian/Entropy method assessed with the AUC metric.
 Figure S3: IGCI Gaussian/Integral method assessed with the accuracy metric.
 Figure S4: IGCI Gaussian/Integral method assessed with the AUC metric.
6 Figure S1 Performance increase due to adding noise: We have plotted output scores of the IGCI methods for each transcription factor both with and without added noise. The plots for 5%, 10%, 20%, 30%, 40%, and 50% of noise are given in Figures S5-S10. To interpret these figures we remind the readers that the negative scores correspond to correct orientations, whereas positive scores correspond to incorrect orientations. As can be seen, adding noise causes both negative and positive scores (corresponding to correct and incorrect predictions, respectively) to converge to zero, as expected. However, as we increase the noise level, the IGCI outputs for the cause-effect pairs that have been correctly predicted in the noiseless data (i.e., have negative scores) converge to zero slower than the IGCI outputs for the cause-effect pairs that have been incorrectly predicted in the noiseless data (i.e., have positive scores).
As a result, for small amounts of noise, most correct predictions in the noiseless data are retained (they still have negative scores) while the incorrect predictions increasingly behave like random. Overall, this results in an increase of accuracy.
For example, assume that we have 100 cause-effect pairs and IGCI correctly predicted 80 of them in the noiseless data, resulting in 80% accuracy. Then with the addition of a small amount of noise, we retain 80 correct predictions while the 20 other predictions are now classified randomly, resulting in 10 correct and 10 incorrect. Overall, this leads to 90% accuracy, so we have a 10% increase.
 Figure S5: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 5% noise. Cyan points correspond to the IGCI output scores in the noiseless data. Grey points correspond to the IGCI output scores for each of the 20 noisy datasets. Magenta points correspond to the average IGCI output scores over all noisy that is why the cyan points are monotonically increasing.
 Figure S6: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 10% noise.
 Figure S7: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 20% noise.
 Figure S8: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 30% noise.
 Figure S9: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 40% noise.
 Figure S10: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 50% noise.    Performance increase due to reducing sample size: We plotted similar graphs to those described above for sample sizes 310, 220, 130 and 40. Decreasing the sample size causes both negative and positive scores (corresponding to correct and incorrect predictions, respectively) to converge to zero, as expected. However, as we decrease the sample size, the IGCI outputs for the cause-effect pairs that have been correctly predicted using all samples (i.e., have negative scores) converge to zero slower than the IGCI outputs for the cause-effect pairs that have been incorrectly predicted using all samples (i.e., have positive scores). As a result, for certain sample sizes, most correct predictions in the full sample data are retained (they still have negative scores), while the incorrect predictions increasingly behave like random. Overall, this results in an increase of accuracy.

Figure S5
 Figure S11: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data using 310 samples. Cyan points correspond to the IGCI output scores using all 530 samples. Grey points correspond to the IGCI output scores for each of the 20 datasets of size 310. Magenta points correspond to the average IGCI output scores over all 20 sampled datasets of size 310. The results are plotted based on sorting of the IGCI output scores in the full sample data; that is why cyan points are monotonically increasing.
 Figure S12: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 220 samples.
 Figure S13: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 130 samples.
 Figure S14: Scores for each cause-effect pair in YEAST gold standard obtained using the IGCI Gaussian/Entropy method in the data with 40 samples.