Human causal induction Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

CS188: Computational Models of Human Behavior
The influence of domain priors on intervention strategy Neil Bramley.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Elena Popa.  Children’s causal learning and evidence.  Causation, intervention, and Bayes nets.  The conditional intervention principle and Woodward’s.
A Tutorial on Learning with Bayesian Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Probabilistic models Haixu Tang School of Informatics.
The World Bank Human Development Network Spanish Impact Evaluation Fund.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.
Parameter Estimation using likelihood functions Tutorial #1
Part II: Graphical models
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
A/Prof Geraint Lewis A/Prof Peter Tuthill
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
THE PROCESS OF SCIENCE. Assumptions  Nature is real, understandable, knowable through observation  Nature is orderly and uniform  Measurements yield.
1 gR2002 Peter Spirtes Carnegie Mellon University.
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.
Inference in Dynamic Environments Mark Steyvers Scott Brown UC Irvine This work is supported by a grant from the US Air Force Office of Scientific Research.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Theory-based causal induction Tom Griffiths Brown University Josh Tenenbaum MIT.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Tetrad project 
Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.
King Saud University Women Students
Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.
Bayesian evidence for visualizing model selection uncertainty Gordon L. Kindlmann
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
- 1 - Outline Introduction to the Bayesian theory –Bayesian Probability –Bayes’ Rule –Bayesian Inference –Historical Note Coin trials example Bayes rule.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Review of Probability.
Probability and Statistics
Bayesian Estimation and Confidence Intervals
Reasoning Under Uncertainty in Expert System
Maximum Likelihood Estimation
Data Mining Lecture 11.
Revealing priors on category structures through iterated learning
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
(Yes, taking notes is a good idea)
PSY 626: Bayesian Statistics for Psychological Science
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Human causal induction Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley

Three kinds of causal induction

contingency data

Dr. James Currie

Currie (1798) Medical Reports on, the Effects of Water, Cold and Warm, as a Remedy in Fevers and Febrile Diseases “The contagion spread rapidly and before its progress could be arrested, sixteen persons were affected of which two died. Of these sixteen, eight were under my care. On this occasion I used for the first time the affusion of cold water in the manner described by Dr. Wright. It was first tried in two cases... The effects corresponded exactly with those mentioned by him to have occurred in his own case and thus encouraged the remedy was employed in five other cases. It was repeated daily, and of these seven patients, the whole recovered.”

“Does the treatment cause recovery?” Recovered Died TreatedUntreated (Currie, 1798)

Causation from contingencies “Does C cause E?” (rate on a scale from 0 to 100) E present (e + ) E absent (e - ) C present (c + ) C absent (c - ) a b c d

Three kinds of causal induction contingency data physical systems

Blicket detector (Gopnik, Sobel, and colleagues) See this? It’s a blicket machine. Blickets make it go. Let’s put this one on the machine. Oooh, it’s a blicket!

Three kinds of causal induction contingency data physical systems perceived causality

Michotte (1963)

Three kinds of causal induction contingency data physical systems perceived causality

Three kinds of causal induction contingency data physical systems perceived causality

Causation from contingencies “Does C cause E?” (rate on a scale from 0 to 100) E present (e + ) E absent (e - ) C present (c + ) C absent (c - ) a b c d

Two models of causal judgment Delta-P (Jenkins & Ward, 1965): Power PC (Cheng, 1997): Power

Recovered Died TreatedUntreated = = 0.22 Power = 0.22/0.22 = 1.00 P(e + |c + ) = 7/7 = 1.00 P(e + |c - ) = 7/9 = 0.78

Buehner and Cheng (1997) People PP Power PP

Buehner and Cheng (1997) People PP Power Constant  P, changing judgments

Buehner and Cheng (1997) People PP Power Constant causal power, changing judgments

Buehner and Cheng (1997) People PP Power  P = 0, changing judgments

What is the computational problem?  P and causal power both seem to capture some part of causal induction, and miss something else… How can we formulate the problem of causal induction that people are trying to solve? –perhaps this way we can discover what’s missing A first step: a language for talking about causal relationships…

Bayesian networks Four random variables: X 1 coin toss produces heads X 2 pencil levitates X 3 friend has psychic powers X 4 friend has two-headed coin X1X1 X2X2 X3X3 X4X4 P(x4)P(x4)P(x3)P(x3) P(x2|x3)P(x2|x3)P(x 1 |x 3, x 4 ) Nodes: variables Links: dependency Each node has a conditional probability distribution Data: observations of x 1,..., x 4

Causal Bayesian networks Four random variables: X 1 coin toss produces heads X 2 pencil levitates X 3 friend has psychic powers X 4 friend has two-headed coin X1X1 X2X2 X3X3 X4X4 P(x4)P(x4)P(x3)P(x3) P(x2|x3)P(x2|x3)P(x 1 |x 3, x 4 ) Nodes: variables Links: causality Each node has a conditional probability distribution Data: observations of and interventions on x 1,..., x 4

Interventions Four random variables: X 1 coin toss produces heads X 2 pencil levitates X 3 friend has psychic powers X 4 friend has two-headed coin X1X1 X2X2 X3X3 X4X4 P(x4)P(x4)P(x3)P(x3) P(x2|x3)P(x2|x3)P(x 1 |x 3, x 4 ) hold down pencil X Cut all incoming links for the node that we intervene on Compute probabilities with “mutilated” Bayes net

Strength: how strong is a relationship? Structure: does a relationship exist? E B C E B C B B What is the computational problem?

Strength: how strong is a relationship? E B C E B C B B What is the computational problem?

Strength: how strong is a relationship? –requires defining nature of relationship E B C E B C B B What is the computational problem?

Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) p 00 p 10 p 01 p 11 p0p0p1p1p0p0p1p1 Generic (Bernoulli)

Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C w0w0 w1w1 w0w0 w 0, w 1 : strength parameters for B, C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) 0 w 1 w 0 w 1 + w 0 00w0w000w0w0 Linear

Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C w0w0 w1w1 w0w0 w 0, w 1 : strength parameters for B, C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) 0 w 1 w 0 w 1 + w 0 – w 1 w 0 00w0w000w0w0 “Noisy-OR”

Structure: does a relationship exist? E B C E B C B B What is the computational problem?

Approaches to structure learning Constraint-based: –dependency from statistical tests –deduce structure from dependencies E B C B (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning E B C B Attempts to reduce inductive problem to deductive problem Constraint-based: –dependency from statistical tests –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

Observationally equivalent aba b P(a)P(b|a) P(b)P(a|b) =

Common Cause Common Effect Two distinguishable networks a b c a b c P(a)P(b|a)P(c|a) P(a|b,c)P(c)P(b) =

Indistinguishable from observation a b c a b c a b c

Effect of intervention a b c a b c a b c x x x

a b c a b c a b c x x x …networks become distinguishable Effect of intervention

Approaches to structure learning E B C B Bayesian: –compute posterior probability of structures, given observed data E B C E B C P(h|d)  P(d|h) P(h) P(h 1 |data)P(h 0 |data) Constraint-based: –dependency from statistical tests –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993) (Heckerman, 1998; Friedman, 1999)

Bayesian Occam’s Razor All possible data sets d P(d | h ) h 0 (no relationship) h 1 (relationship) For any model h,

Strength: how strong is a relationship? Structure: does a relationship exist? E B C w0w0 w1w1 E B C w0w0 B B Causal structure vs. causal strength

Causal strength Assume structure:  P and causal power are maximum likelihood estimates of the strength parameter w 1, under different parameterizations for P(E|B,C): – linear   P, Noisy-OR  causal power E B C w0w0 w1w1 B

Hypotheses: h 1 = h 0 = Bayesian causal inference: support = E B C E B C B B Causal structure P(d|h1)P(d|h1) P(d|h0)P(d|h0) likelihood ratio (Bayes factor) gives evidence in favor of h 1

People  P (r = 0.89) Power (r = 0.88) Support (r = 0.97) Buehner and Cheng (1997)

Generativity is essential Predictions result from a “ceiling effect”, which only matters if you believe a cause increases the probability of an effect (follows from use of Noisy-OR, after Cheng, 1997) P(e+|c+)P(e+|c+) P(e+|c-)P(e+|c-) 8/8 6/8 4/8 2/8 0/8 Bayesian

Three kinds of causal induction contingency data physical systems perceived causality

Blicket detector (Gopnik, Sobel, and colleagues) See this? It’s a blicket machine. Blickets make it go. Let’s put this one on the machine. Oooh, it’s a blicket!

–Two objects: A and B –Trial 1: A B on detector – detector active –Trial 2: A on detector – detector active –4-year-olds judge whether each object is a blicket A: a blicket (100% say yes) B: probably not a blicket (34% say yes) “Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004) AB Trial A Trial AB

Possible hypotheses E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA E BA

Bayesian inference Evaluating causal models in light of data: Inferring a particular causal relation:

Bayesian inference With a uniform prior on hypotheses, and the generic parameterization, integrating over parameters (Cooper & Herskovits, 1992) AB Probability of being a blicket

Theory-based causal induction Theory Bayesian inference X Y Z X Y Z X Y Z X Y Z Hypothesis space generates Case X Y Z Data generates

An analogy to language Theory X Y Z X Y Z X Y Z X Y Z Hypothesis space generates Case X Y Z Data generates Grammar Parse trees generates Sentence generates The quick brown fox …

Component of theory: Ontology Plausible relations Functional form Generates: Variables Structure Conditional probabilities A causal theory is a hypothesis space generator P(h|data)  P(data|h) P(h) Hypotheses are evaluated by Bayesian inference Theory-based causal induction

Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Theory E A B

Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Theory E A B A = 1 if Contact(block A, detector, trial), else 0 B = 1 if Contact(block B, detector, trial), else 0 E = 1 if Active(detector, trial), else 0

Plausible relations –For any Block b and Detector d, with prior probability q  : For all trials t, Contact(b,d,t)  Active(d,t) Theory h 00 : h 10 : h 01 : h 11 : E A B E A B E A B E A B P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q) P(h 01 ) = (1 – q) qP(h 11 ) = q 2 No hypotheses with E B, E A, A B, etc. = “A is a blicket” E A

Functional form of causal relations –Causes of Active(d,t) are independent mechanisms, with causal strengths w b. A background cause has strength w 0. Assume a deterministic mechanism: w b = 1, w 0 = 0. Theory P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Plausible relations –For any Block b and Detector d, with prior probability q  : For all trials t, Contact(b,d,t)  Active(d,t) Functional form of causal relations –Causes of Active(d,t) are independent mechanisms, with causal strengths w i. A background cause has strength w 0. Assume a deterministic mechanism: w b = 1, w 0 = 0 Theory

Bayesian inference Evaluating causal models in light of data: Inferring a particular causal relation:

Modeling backwards blocking P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2 Modeling backwards blocking

P(E=1 | A=1, B=0): P(E=1 | A=1, B=1): E BA E BA E BA P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2 Modeling backwards blocking

Three kinds of causal induction contingency data physical systems perceived causality

Michotte (1963) Affected by… –timing of events –velocity of balls –proximity

A simple theory Newton+Noise

Strategy for evaluating theory (Sanborn, Mansinghka, & Griffiths, 2009) Generate a set of collisions varying in velocity, timing, and proximity Ask people to judge as “real” or “random” fixed estimate P(d|real) from judgments of P(real|d)

Results Newton + Noise x x Newton noise “real” “random”

Conclusions Causal induction from contingency data is a well-studied problem, and one where learning structure and strength can provide insights Approaching causal induction as Bayesian inference with causal graphical models has the potential to explain many other phenomena: –learning from rates –learning complex causal structures –detecting coincidences –perceptual causality –inferences about physical causal systems –learning and reasoning about dynamic causal systems