Presentation on theme: "Evidence for Probabilistic Hypotheses: With Applications to Causal Modeling Malcolm R. Forster Department of Philosophy University of Wisconsin-Madison."— Presentation transcript:
Evidence for Probabilistic Hypotheses: With Applications to Causal Modeling Malcolm R. Forster Department of Philosophy University of Wisconsin-Madison 1 Vals, Switzerland, August 7, 2013
References 2 Forster, Malcolm R. (1984): Probabilistic Causality and the Foundations of Modern Science. Ph.D. Thesis, University of Western Ontario. Forster, Malcolm R. (1988): “Sober’s Principle of Common Cause and the Problem of Incomplete Hypotheses.” Philosophy of Science 55: 538 ‑ 59. Forster, Malcolm R. (2006), “Counterexamples to a Likelihood Theory of Evidence,” Mind and Machines, 16: 319-338. Whewell, William (1858): The History of Scientific Ideas, 2 vols, London, John W. Parker. Wright, Sewell (1921). “Correlation and Causation,” Journal of Agricultural Research 20: 557-585.
How to discover causes… TWO THESES 3 Thesis (a): Probabilistic independences provide a way to discover causal relations. Thesis (b) Probabilistic independences provide the only way to discover causal relations. The simplest way to argue against (b) is to show how data can favor X Y against Y X.
Back to first principles… Hypothesis testing in general... 4 Modus Tollens: Hypothesis H entails observation O, O is false, therefore H is false. Probabilistic Modus Tollens: H entails that observation O is highly probable, O is false, therefore H is false. THE PROBLEM: In most situations, all rival hypotheses give the total evidence E very low probability. Put O = not-E …run prob. modus tollens … end up rejecting EVERY hypothesis!!!
A response to the PROBLEM 5 (1)We should not focus exclusively on the total evidence E. (2)We should focus on those aspects of the data O that are central to what the hypothesis says. Example 2: The independencies entailed by d- separation in causal models. Example 1: The agreement of independent measurements of the parameters postulated by the model. E.g. in the Bernoulli model, or the agreement of independent measurements of the Earth’s mass.
A response to the PROBLEM …continued. 6 (3) We should look at what is entailed by the models by themselves, without the help of other data. Examples 1 and 2 meet this desideratum. Also justifies a faithfulness principle: Favor models that entail an independency over one that is merely able to accommodate it (even if the likelihoods go the other way). (I don’t see this as appealing to non-empirical biases, such as simplicity.)
Now apply the agreement of measurements idea to the testing of causal models… 7 What does Forward, X Y, entail? The independencies entailed by a DAG is part of what a causal model entails. But it often says something more… It says something the forward probabilities (or densities) p(y|x), and nothing (directly) about p(x) or p(x,y) or p(x|y). X Y says: If p 1 (x), then p 1 (x,y) = p 1 (x) p(y|x), If p 2 (x), then p 2 (x,y) = p 2 (x) p(y|x), and so on.
The key idea… 8 We can use data generated by p 1 (x,y), to estimate parameters in p(y|x). We can use data generated by p 2 (x,y), to estimate the same parameters in p(y|x). The two data clusters provide independent estimates of the parameters. If the estimates agree then we have an agreement of independent measurements. The hypothesis “stuck its neck out”, it risked falsification, it survived the test, and is thereby confirmed.
9 Prediction versus Accommodation -15-10-551015 -15 -10 -5 5 10 15 y x Both X Y and Y X are able to accommodate (that is, fit) the total evidence well. So a maximum likelihood comparison is not going to discriminate well. Cluster 1 generated by p 1 (x,y). Cluster 2 generated by p 2 (x,y) But suppose we fit a model to Cluster 1, and then to Cluster 2 to see whether the independent measurements of the parameters agree.
The content of X Y 10 X Y says: If p 1 (x), then p 1 (x,y) = p 1 (x) p(y|x), If p 2 (x), then p 2 (x,y) = p 2 (x) p(y|x), and so on. X Y also says: If p 1 (x), then p 1 (x|y) = p 1 (x,y)/p 1 (y). If p 2 (x), then p 2 (x|y) = p 2 (x,y)/p 2 (y)., and so on. In general, p 1 (x|y) p 2 (x|y). That is, X Y says that the backwards probabilities vary. If X Y is right then Y X is wrong. It’s metaphysically possible that that forward model say that forward probabilities depend on the input distribution. But we need to search for uniformities of nature…
-14-12-8-6 -14 -13 -12 -11 -9 -8 -7 -6 x y The data are generated from Y = X + U, where x is N(–10,1), U is N(0,1) and U is independent of X. The y on x regression is different from the x on y regression. 11 The Asymmetry of Regression…
12 Forward Model: X Y X Y : -15-10-551015 -10 -5 5 10 15 y x Cluster 1 Cluster 2 X Y passes the test because… Independent measurements agree!
13 -15-10-551015 -15 -10 -5 5 10 15 y x Backward Model Y X Y X says: Y X fails the test because… Not all independent measurements agree.
The forwards model fits Cluster 2 (top right) better than the backwards model. -10-5510 -10 -5 5 10 y x 14 Another way of seeing the same thing...
Summary Bullets 15 The phenomenon is completely general. It does not depend on any special features of the distribution, except the judicious splitting of the data into clusters. The method depends on a judicious splitting of the data. Bayesians (and likelihoods) do not split data. (They consider on the likelihoods relative to the total evidence.) If you don’t split data, then it more difficult to show that X Y is right and Y X is wrong.