Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences.

Similar presentations


Presentation on theme: "Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences."— Presentation transcript:

1 Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

2 What to expect What youll get out of this tutorial: –Our view of what Bayesian models have to offer cognitive science. –In-depth examples of basic and advanced models: how the math works & what it buys you. –Some comparison to other approaches. –Opportunities to ask questions. What you wont get: –Detailed, hands-on how-to. –Where you can learn more:

3 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

4 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

5 Bayesian models in cognitive science Vision Motor control Memory Language Inductive learning and reasoning….

6 Everyday inductive leaps Learning concepts and words from examples horse

7 Learning concepts and words tufa Can you pick out the tufas?

8 Inductive reasoning Cows can get Hicks disease. Gorillas can get Hicks disease. All mammals can get Hicks disease. Input: Task: Judge how likely conclusion is to be true, given that premises are true. (premises) (conclusion)

9 Inferring causal relations Took vitamin B23Headache Day 1yesno Day 2yesyes Day 3noyes Day 4yesno Does vitamin B23 cause headaches? Input: Task: Judge probability of a causal link given several joint observations.

10 Everyday inductive leaps How can we learn so much about... –Properties of natural kinds –Meanings of words –Future outcomes of a dynamic process –Hidden causal properties of an object –Causes of a persons action (beliefs, goals) –Causal laws governing a domain... from such limited data?

11 The Challenge How do we generalize successfully from very limited data? –Just one or a few examples –Often only positive examples Philosophy: –Induction is a problem, a riddle, a paradox, a scandal, or a myth. Machine learning and statistics: –Focus on generalization from many examples, both positive and negative.

12 Rational statistical inference (Bayes, Laplace) Posterior probability LikelihoodPrior probability Sum over space of hypotheses

13 Shepard (1987) –Analysis of one-shot stimulus generalization, to explain the universal exponential law. Anderson (1990) –Models of categorization and causal induction. Oaksford & Chater (1994) –Model of conditional reasoning (Wason selection task). Heit (1998) –Framework for category-based inductive reasoning. Bayesian models of inductive learning: some recent history

14 Rational statistical inference (Bayes): Learners domain theories generate their hypothesis space H and prior p(h). –Well-matched to structure of the natural world. –Learnable from limited data. –Computationally tractable inference. Theory-Based Bayesian Models

15 What is a theory? Working definition –An ontology and a system of abstract principles that generates a hypothesis space of candidate world structures along with their relative probabilities. Analogy to grammar in language. Example: Newtons laws

16 Structure and statistics A framework for understanding how structured knowledge and statistical inference interact. –How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. –How simplicity trades off with fit to the data in evaluating structural hypotheses. –How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.

17 Structure and statistics A framework for understanding how structured knowledge and statistical inference interact. –How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. Hierarchical Bayes. –How simplicity trades off with fit to the data in evaluating structural hypotheses. Bayesian Occams Razor. –How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Non-parametric Bayes.

18 Alternative approaches to inductive generalization Associative learning Connectionist networks Similarity to examples Toolkit of simple heuristics Constraint satisfaction Analogical mapping

19 Marrs Three Levels of Analysis Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation and algorithm: Cognitive psychology Implementation: Neurobiology

20 Why Bayes? A framework for explaining cognition. –How people can learn so much from such limited data. –Why process-level models work the way that they do. –Strong quantitative models with minimal ad hoc assumptions. A framework for understanding how structured knowledge and statistical inference interact. –How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. –How simplicity trades off with fit to the data in evaluating structural hypotheses (Occams razor). –How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.

21 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

22 Coin flipping

23 HHTHT HHHHH What process produced these sequences?

24 Bayes rule Posterior probability: Prior probability: Likelihood: For data D and a hypothesis H, we have:

25 The origin of Bayes rule A simple consequence of using probability to represent degrees of belief For any two random variables:

26 Good statistics –consistency, and worst-case error bounds. Cox Axioms –necessary to cohere with common sense Dutch Book + Survival of the Fittest –if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of learning –a common currency for combining prior knowledge and the lessons of experience. Why represent degrees of belief with probabilities?

27 Bayes rule Posterior probability: Prior probability: Likelihood: For data D and a hypothesis H, we have:

28 Hypotheses in Bayesian inference Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D

29 Hypotheses in coin flipping Fair coin, P( H ) = 0.5 Coin with P( H ) = p Markov model Hidden Markov model... Describe processes by which D could be generated HHTHT D = statistical models

30 Hypotheses in coin flipping Fair coin, P( H ) = 0.5 Coin with P( H ) = p Markov model Hidden Markov model... Describe processes by which D could be generated generative models HHTHT D =

31 Representing generative models Graphical model notation –Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation HHTHT d 1 d 2 d 3 d 4 d 5 d 1 d 2 d 3 d 4 Fair coin, P( H ) = 0.5 d 1 d 2 d 3 d 4 Markov model

32 Models with latent structure Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved HHTHT d 1 d 2 d 3 d 4 d 5 d 1 d 2 d 3 d 4 Hidden Markov model s 1 s 2 s 3 s 4 d 1 d 2 d 3 d 4 P( H ) = p p

33 Coin flipping Comparing two simple hypotheses –P( H ) = 0.5 vs. P( H ) = 1.0 Comparing simple and complex hypotheses –P( H ) = 0.5 vs. P( H ) = p Comparing infinitely many hypotheses –P( H ) = p Psychology: Representativeness

34 Coin flipping Comparing two simple hypotheses –P( H ) = 0.5 vs. P( H ) = 1.0 Comparing simple and complex hypotheses –P( H ) = 0.5 vs. P( H ) = p Comparing infinitely many hypotheses –P( H ) = p Psychology: Representativeness

35 Comparing two simple hypotheses Contrast simple hypotheses: –H 1 : fair coin, P( H ) = 0.5 –H 2 :always heads, P( H ) = 1.0 Bayes rule: With two hypotheses, use odds form

36 Bayes rule in odds form P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) D: data H 1, H 2 : models P(H 1 |D): posterior probability H 1 generated the data P(D|H 1 ): likelihood of data under model H 1 P(H 1 ): prior probability H 1 generated the data = x

37 Coin flipping HHTHT HHHHH What process produced these sequences?

38 Comparing two simple hypotheses P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) D: HHTHT H 1, H 2 : fair coin, always heads P(D|H 1 ) =1/2 5 P(H 1 ) =999/1000 P(D|H 2 ) =0 P(H 2 ) = 1/1000 P(H 1 |D) / P(H 2 |D) = infinity = x

39 Comparing two simple hypotheses P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) D: HHHHH H 1, H 2 : fair coin, always heads P(D|H 1 ) =1/2 5 P(H 1 ) =999/1000 P(D|H 2 ) =1 P(H 2 ) = 1/1000 P(H 1 |D) / P(H 2 |D) 30 = x

40 Comparing two simple hypotheses P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) D: HHHHHHHHHH H 1, H 2 : fair coin, always heads P(D|H 1 ) =1/2 10 P(H 1 ) =999/1000 P(D|H 2 ) =1 P(H 2 ) = 1/1000 P(H 1 |D) / P(H 2 |D) 1 = x

41 Bayes rule tells us how to combine prior beliefs with new data –top-down and bottom-up influences As a model of human inference –predicts conclusions drawn from data –identifies point at which prior beliefs are overwhelmed by new experiences But… more complex cases? Comparing two simple hypotheses

42 Coin flipping Comparing two simple hypotheses –P( H ) = 0.5 vs. P( H ) = 1.0 Comparing simple and complex hypotheses –P( H ) = 0.5 vs. P( H ) = p Comparing infinitely many hypotheses –P( H ) = p Psychology: Representativeness

43 Comparing simple and complex hypotheses Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P( H ) = p? d 1 d 2 d 3 d 4 Fair coin, P( H ) = 0.5 d 1 d 2 d 3 d 4 P( H ) = p p vs.

44 P( H ) = p is more complex than P( H ) = 0.5 in two ways: –P( H ) = 0.5 is a special case of P( H ) = p –for any observed sequence X, we can choose p such that X is more probable than if P( H ) = 0.5 Comparing simple and complex hypotheses

45 Probability

46 Comparing simple and complex hypotheses Probability HHHHH p = 1.0

47 Comparing simple and complex hypotheses Probability HHTHT p = 0.6

48 P( H ) = p is more complex than P( H ) = 0.5 in two ways: –P( H ) = 0.5 is a special case of P( H ) = p –for any observed sequence X, we can choose p such that X is more probable than if P( H ) = 0.5 How can we deal with this? –frequentist: hypothesis testing –information theorist: minimum description length –Bayesian: just use probability theory! Comparing simple and complex hypotheses

49 P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) Computing P(D|H 1 ) is easy: P(D|H 1 ) = 1/2 N Compute P(D|H 2 ) by averaging over p: = x Comparing simple and complex hypotheses

50 Probability Distribution is an average over all values of p

51 Comparing simple and complex hypotheses Probability Distribution is an average over all values of p

52 Simple and complex hypotheses can be compared directly using Bayes rule –requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: Bayesian Occams razor This principle is used in model selection methods in psychology (e.g. Myung & Pitt, 1997) Comparing simple and complex hypotheses

53 Coin flipping Comparing two simple hypotheses –P( H ) = 0.5 vs. P( H ) = 1.0 Comparing simple and complex hypotheses –P( H ) = 0.5 vs. P( H ) = p Comparing infinitely many hypotheses –P( H ) = p Psychology: Representativeness

54 Comparing infinitely many hypotheses Assume data are generated from a model: What is the value of p? –each value of p is a hypothesis H –requires inference over infinitely many hypotheses d 1 d 2 d 3 d 4 P( H ) = p p

55 Flip a coin 10 times and see 5 heads, 5 tails. P( H ) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Future will be like the past. Suppose we had seen 4 heads and 6 tails. P( H ) on next flip? Closer to 50% than to 40%. Why? Prior knowledge. Comparing infinitely many hypotheses

56 Posterior distribution P(p | D) is a probability density over p = P( H ) Need to work out likelihood P(D | p) and specify prior distribution P(p) Integrating prior knowledge and data P(p | D) P(D | p) P(p)

57 Likelihood and prior Likelihood: P(D | p) = p N H (1-p) N T –N H : number of heads –N T : number of tails Prior: P(p) p F H-1 (1-p) F T-1 ?

58 A simple method of specifying priors Imagine some fictitious trials, reflecting a set of previous experiences –strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

59 Likelihood and prior Likelihood: P(D | p) = p N H (1-p) N T –N H : number of heads –N T : number of tails Prior: P(p) p F H-1 (1-p) F T-1 –F H : fictitious observations of heads –F T : fictitious observations of tails Beta(F H,F T )

60 Conjugate priors Exist for many standard distributions –formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) F H = F T = 1 F H = F T = 3 F H = F T = 1000

61 Likelihood and prior Likelihood: P(D | p) = p N H (1-p) N T –N H : number of heads –N T : number of tails Prior: P(p) p F H-1 (1-p) F T-1 –F H : fictitious observations of heads –F T : fictitious observations of tails

62 Posterior is Beta(N H +F H,N T +F T ) –same form as conjugate prior Posterior mean: Posterior predictive distribution: Comparing infinitely many hypotheses P(p | D) P(D | p) P(p) = p N H+ F H-1 (1-p) N T+ F T-1

63 Some examples e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair After seeing 4 heads, 6 tails, P( H ) on next flip = 1004 / ( ) = 49.95% e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair After seeing 4 heads, 6 tails, P( H ) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak

64 But… flipping thumbtacks e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads After seeing 2 heads, 0 tails, P( H ) on next flip = 6 / (6+3) = 67% Some prior knowledge is always necessary to avoid jumping to hasty conclusions... Suppose F = { }: After seeing 2 heads, 0 tails, P( H ) on next flip = 2 / (2+0) = 100%

65 Origin of prior knowledge Tempting answer: prior experience Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails By assuming all coins (and flips) are alike, these observations of other coins are as good as observations of the present coin

66 Problems with simple empiricism Havent really seen 2000 coin flips, or any flips of a thumbtack –Prior knowledge is stronger than raw experience justifies Havent seen exactly equal number of heads and tails –Prior knowledge is smoother than raw experience justifies Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins –Prior knowledge is more structured than raw experience

67 A simple theory Coins are manufactured by a standardized procedure that is effective but not perfect. –Justifies generalizing from previous coins to the present coin. –Justifies smoother and stronger prior than raw experience alone. –Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin. Tacks are asymmetric, and manufactured to less exacting standards.

68 Limitations Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations? Suppose you flip a coin 25 times and get all heads. Something funny is going on… But with F ={1000 heads, 1000 tails}, P( H ) on next flip = 1025 / ( ) = 50.6%. Looks like nothing unusual

69 Hierarchical priors Higher-order hypothesis: is this coin fair or unfair? Example probabilities: –P(fair) = 0.99 –P(p|fair) is Beta(1000,1000) –P(p|unfair) is Beta(1,1) 25 heads in a row propagates up, affecting p and then P(fair|D) d 1 d 2 d 3 d 4 p fair P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair) = = 9 x 10 -5

70 Latent structure can capture coin variability 10 flips from 200 coins is better than 2000 flips from a single coin: allows estimation of F H, F T More hierarchical priors d 1 d 2 d 3 d 4 p FH,FTFH,FT p p p ~ Beta(F H,F T ) Coin 1Coin 2Coin

71 Discrete beliefs (e.g. symmetry) can influence estimation of continuous properties (e.g. F H, F T ) Yet more hierarchical priors d 1 d 2 d 3 d 4 p FH,FTFH,FT p p physical knowledge

72 Apply Bayes rule to obtain posterior probability density Requires prior over all hypotheses –computation simplified by conjugate priors –richer structure with hierarchical priors Hierarchical priors indicate how simple theories can inform statistical inferences –one step towards structure and statistics Comparing infinitely many hypotheses

73 Coin flipping Comparing two simple hypotheses –P( H ) = 0.5 vs. P( H ) = 1.0 Comparing simple and complex hypotheses –P( H ) = 0.5 vs. P( H ) = p Comparing infinitely many hypotheses –P( H ) = p Psychology: Representativeness

74 Which sequence is more likely from a fair coin? HHTHT HHHHH more representative of a fair coin (Kahneman & Tversky, 1972)

75 What might representativeness mean? P(H 1 |D) P(D|H 1 ) P(H 1 ) P(H 2 |D) P(D|H 2 ) P(H 2 ) H 1 : random process (fair coin) H 2 : alternative processes = x Evidence for a random generating process likelihood ratio

76 A constrained hypothesis space Four hypotheses: h 1 fair coin HHTHTTTH h 2 always alternates HTHTHTHT h 3 mostly heads HHTHTHHH h 4 always heads HHHHHHHH

77 Representativeness judgments

78 Results Good account of representativeness data, with three pseudo-free parameters, = 0.91 –always alternates means 99% of the time –mostly heads means P( H ) = 0.85 –always heads means P( H ) = 0.99 With scaling parameter, r = 0.95 (Tenenbaum & Griffiths, 2001)

79 The role of theories The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. –Easy to imagine how a trick all-heads coin could work: high prior probability. –Hard to imagine how a trick HHTHT coin could work: low prior probability.

80 Summary Three kinds of Bayesian inference –comparing two simple hypotheses –comparing simple and complex hypotheses –comparing an infinite number of hypotheses Critical notions: –generative models, graphical models –Bayesian Occams razor –priors: conjugate, hierarchical (theories)

81 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

82 Rules and similarity

83 Structure versus statistics Rules Logic Symbols Statistics Similarity Typicality

84 A better metaphor

85

86 Structure and statistics Rules Logic Symbols Statistics Similarity Typicality

87 Structure and statistics Basic case study #1: Flipping coins –Learning and reasoning with structured statistical models. Basic case study #2: Rules and similarity –Statistical learning with structured representations.

88 The number game Program input: number between 1 and 100 Program output: yes or no

89 The number game Learning task: –Observe one or more positive (yes) examples. –Judge whether other numbers are yes or no.

90 The number game Examples of yes numbers Generalization judgments (N = 20) 60 Diffuse similarity

91 The number game Examples of yes numbers Generalization judgments (n = 20) Diffuse similarity Rule: multiples of 10

92 The number game Examples of yes numbers Generalization judgments (N = 20) Diffuse similarity Rule: multiples of 10 Focused similarity: numbers near 50-60

93 The number game Examples of yes numbers Generalization judgments (N = 20) Diffuse similarity Rule: powers of 2 Focused similarity: numbers near 20

94 Main phenomena to explain: –Generalization can appear either similarity- based (graded) or rule-based (all-or-none). –Learning from just a few positive examples Diffuse similarity Rule: multiples of 10 Focused similarity: numbers near The number game

95 Rule/similarity hybrid models Category learning –Nosofsky, Palmeri et al.: RULEX –Erickson & Kruschke: ATRIUM

96 Divisions into rule and similarity subsystems Category learning –Nosofsky, Palmeri et al.: RULEX –Erickson & Kruschke: ATRIUM Language processing –Pinker, Marcus et al.: Past tense morphology Reasoning –Sloman –Rips –Nisbett, Smith et al.

97 Rule/similarity hybrid models Why two modules? Why do these modules work the way that they do, and interact as they do? How do people infer a rule or similarity metric from just a few positive examples?

98 H: Hypothesis space of possible concepts: –h 1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (even numbers) –h 2 = {10, 20, 30, 40, …, 90, 100} (multiples of 10) –h 3 = {2, 4, 8, 16, 32, 64} (powers of 2) –h 4 = {50, 51, 52, …, 59, 60} (numbers between 50 and 60) –... Bayesian model Representational interpretations for H: – Candidate rules – Features for similarity – Consequential subsets (Shepard, 1987)

99 Inferring hypotheses from similarity judgment Additive clustering (Shepard & Arabie, 1977) : : similarity of stimuli i, j : weight of cluster k : membership of stimulus i in cluster k (1 if stimulus i in cluster k, 0 otherwise) Equivalent to similarity as a weighted sum of common features (Tversky, 1977).

100 Additive clustering for the integers 0-9 : RankWeight Stimuli in clusterInterpretation * * *powers of two * * * small numbers * * * multiples of three * * * * large numbers * * * * * middle numbers * * * * * odd numbers * * * * smallish numbers * * * * * largish numbers

101 Three hypothesis subspaces for number concepts Mathematical properties (24 hypotheses): –Odd, even, square, cube, prime numbers –Multiples of small integers –Powers of small integers Raw magnitude (5050 hypotheses): –All intervals of integers with endpoints between 1 and 100. Approximate magnitude (10 hypotheses): –Decades (1-10, 10-20, 20-30, …)

102 Hypothesis spaces and theories Why a hypothesis space is like a domain theory: –Represents one particular way of classifying entities in a domain. –Not just an arbitrary collection of hypotheses, but a principled system. Whats missing? –Explicit representation of the principles. Hypothesis spaces (and priors) are generated by theories. Some analogies: –Grammars generate languages (and priors over structural descriptions) –Hierarchical Bayesian modeling

103 H: Hypothesis space of possible concepts: –Mathematical properties: even, odd, square, prime,.... –Approximate magnitude: {1-10}, {10-20}, {20-30},.... –Raw magnitude: all intervals between 1 and 100. X = {x 1,..., x n }: n examples of a concept C. Evaluate hypotheses given data: –p(h) [prior]: domain knowledge, pre-existing biases –p(X|h) [likelihood]: statistical information in examples. –p(h|X) [posterior]: degree of belief that h is the true extension of C. Bayesian model

104 H: Hypothesis space of possible concepts: –Mathematical properties: even, odd, square, prime,.... –Approximate magnitude: {1-10}, {10-20}, {20-30},.... –Raw magnitude: all intervals between 1 and 100. X = {x 1,..., x n }: n examples of a concept C. Evaluate hypotheses given data: –p(h) [prior]: domain knowledge, pre-existing biases –p(X|h) [likelihood]: statistical information in examples. –p(h|X) [posterior]: degree of belief that h is the true extension of C. Bayesian model

105 Likelihood: p(X|h) Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. Follows from assumption of randomly sampled examples. Captures the intuition of a representative sample.

106 Illustrating the size principle h1h1 h2h2

107 Illustrating the size principle h1h1 h2h2 Data slightly more of a coincidence under h 1

108 Illustrating the size principle h1h1 h2h2 Data much more of a coincidence under h 1

109 Bayesian Occams Razor All possible data sets d p(D = d | M ) M1M1 M2M2 For any model M, Law of Conservation of Belief

110 Comparing simple and complex hypotheses Probability Distribution is an average over all values of p

111 Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. multiples of 10 except 50 and 70.

112 Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. multiples of 10 except 50 and 70. p(h) encodes relative weights of alternative theories: H 1 : Math properties (24) even numbers powers of two multiples of three …. H 2 : Raw magnitude (5050) …. H 3 : Approx. magnitude (10) …. H: Total hypothesis space p(H 1 ) = 1/5 p(H 2 ) = 3/5 p(H 3 ) = 1/5 p(h) = p(H 1 ) / 24p(h) = p(H 2 ) / 5050p(h) = p(H 3 ) / 10

113 A more complex approach to priors Start with a base set of regularities R and combination operators C. Hypothesis space = closure of R under C. –C = {and, or}: H = unions and intersections of regularities in R (e.g., multiples of 10 between 30 and 70). –C = {and-not}: H = regularities in R with exceptions (e.g., multiples of 10 except 50 and 70). Two qualitatively similar priors: –Description length: number of combinations in C needed to generate hypothesis from R. –Bayesian Occams Razor, with model classes defined by number of combinations: more combinations more hypotheses lower prior

114 Posterior: X = {60, 80, 10, 30} Why prefer multiples of 10 over even numbers? p(X|h). Why prefer multiples of 10 over multiples of 10 except 50 and 20? p(h). Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h)

115 Bayesian Occams Razor Probabilities provide a common currency for balancing model complexity with fit to the data.

116 Generalizing to new objects Given p(h|X), how do we compute, the probability that C applies to some new stimulus y?

117 Generalizing to new objects Hypothesis averaging: Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X):

118 Examples: 16

119 Connection to feature-based similarity Additive clustering model of similarity: Bayesian hypothesis averaging: Equivalent if we identify features f k with hypotheses h, and weights w k with.

120 Examples:

121 Examples:

122 Model fits Examples of yes numbers Generalization judgments (N = 20) Bayesian Model (r = 0.96)

123 Model fits Examples of yes numbers Generalization judgments (N = 20) Bayesian Model (r = 0.93)

124 Summary of the Bayesian model How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule

125 Summary of the Bayesian model How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? Many h of similar size: broad p(h|X) One h much smaller: narrow p(h|X)

126 Alternative models Neural networks evenmultiple of 10 power of 2 multiple of

127 Alternative models Neural networks Hypothesis ranking and elimination evenmultiple of 10 power of 2 multiple of Hypothesis ranking: …. ….

128 Model (r = 0.80) Data Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars –Average similarity:

129 Model (r = 0.64) Data Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars –Max similarity:

130 Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars –Average similarity –Max similarity –Flexible similarity? Bayes.

131 Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars Toolbox of simple heuristics –60: general similarity – : most specific rule (subset principle). – : similarity in magnitude Why these heuristics? When to use which heuristic? Bayes.

132 Summary Generalization from limited data possible via the interaction of structured knowledge and statistics. –Structured knowledge: space of candidate rules, theories generate hypothesis space (c.f. hierarchical priors) –Statistics: Bayesian Occams razor. Better understand the interactions between traditionally opposing concepts: –Rules and statistics –Rules and similarity Explains why central but notoriously slippery processing-level concepts work the way they do. –Similarity –Representativeness –Rules and representativeness

133 Why Bayes? A framework for explaining cognition. –How people can learn so much from such limited data. –Why process-level models work the way that they do. –Strong quantitative models with minimal ad hoc assumptions. A framework for understanding how structured knowledge and statistical inference interact. –How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. –How simplicity trades off with fit to the data in evaluating structural hypotheses (Occams razor). –How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.

134 Rational statistical inference (Bayes): Learners domain theories generate their hypothesis space H and prior p(h). –Well-matched to structure of the natural world. –Learnable from limited data. –Computationally tractable inference. Theory-Based Bayesian Models

135 Looking towards the afternoon How do we apply these ideas to more natural and complex aspects of cognition? Where do the hypothesis spaces come from? Can we formalize the contributions of domain theories?

136

137 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

138 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

139 Marrs Three Levels of Analysis Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation and algorithm: Cognitive psychology Implementation: Neurobiology

140 Working at the computational level What is the computational problem? –input: data –output: solution statistical

141 Working at the computational level What is the computational problem? –input: data –output: solution What knowledge is available to the learner? Where does that knowledge come from? statistical

142 Rational statistical inference (Bayes): Learners domain theories generate their hypothesis space H and prior p(h). –Well-matched to structure of the natural world. –Learnable from limited data. –Computationally tractable inference. Theory-Based Bayesian Models

143 Causality

144 Bayes nets and beyond... Increasingly popular approach to studying human causal inferences (e.g. Glymour, 2001; Gopnik et al., 2004) Three reactions: –Bayes nets are the solution! –Bayes nets are missing the point, not sure why… –what is a Bayes net?

145 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: elemental causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

146 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: elemental causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

147 Graphical models Express the probabilistic dependency structure among a set of variables (Pearl, 1988) Consist of –a set of nodes, corresponding to variables –a set of edges, indicating dependency –a set of functions defined on the graph that defines a probability distribution

148 Undirected graphical models Consist of –a set of nodes –a set of edges –a potential for each clique, multiplied together to yield the distribution over variables Examples –statistical physics: Ising model, spinglasses –early neural networks (e.g. Boltzmann machines) X1X1 X2X2 X3X3 X4X4 X5X5

149 Directed graphical models X3X3 X4X4 X5X5 X1X1 X2X2 Consist of –a set of nodes –a set of edges –a conditional probability distribution for each node, conditioned on its parents, multiplied together to yield the distribution over variables Constrained to directed acyclic graphs (DAG) AKA: Bayesian networks, Bayes nets

150 Bayesian networks and Bayes Two different problems –Bayesian statistics is a method of inference –Bayesian networks are a form of representation There is no necessary connection –many users of Bayesian networks rely upon frequentist statistical methods (e.g. Glymour) –many Bayesian inferences cannot be easily represented using Bayesian networks

151 Properties of Bayesian networks Efficient representation and inference –exploiting dependency structure makes it easier to represent and compute with probabilities Explaining away –pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI

152 Three binary variables: Cavity, Toothache, Catch Efficient representation and inference

153 Three binary variables: Cavity, Toothache, Catch Specifying P(Cavity, Toothache, Catch) requires 7 parameters (1 for each set of values, minus 1 because its a probability distribution) With n variables, we need 2 n -1 parameters Here n=3. Realistically, many more: X-ray, diet, oral hygiene, personality,.... Efficient representation and inference

154 All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity In probabilistic terms: With n evidence variables, x 1, …, x n, we need 2 n conditional probabilities: Conditional independence

155 Graphical representation of relations between a set of random variables: Probabilistic interpretation: factorizing complex terms A simple Bayesian network Cavity ToothacheCatch

156 Joint distribution sufficient for any inference: A more complex system Battery RadioIgnitionGas Starts On time to work

157 Joint distribution sufficient for any inference: A more complex system Battery RadioIgnitionGas Starts On time to work

158 Joint distribution sufficient for any inference: General inference algorithm: local message passing (belief propagation; Pearl, 1988) –efficiency depends on sparseness of graph structure A more complex system Battery RadioIgnitionGas Starts On time to work

159 Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on: Explaining away RainSprinkler Grass Wet

160 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

161 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

162 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

163 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

164 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet: Between 1 and P(s)

165 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet and sprinklers were left on: Both terms = 1

166 Explaining away RainSprinkler Grass Wet Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

167 Explaining away RainSprinkler Grass Wet Discounting to prior probability.

168 Formulate IF-THEN rules: –IF Rain THEN Wet –IF Wet THEN Rain Rules do not distinguish directions of inference Requires combinatorial explosion of rules Contrast w/ production system Rain Grass Wet Sprinkler IF Wet AND NOT Sprinkler THEN Rain

169 Observing rain, Wet becomes more active. Observing grass wet, Rain and Sprinkler become more active. Observing grass wet and sprinkler, Rain cannot become less active. No explaining away! Excitatory links: Rain Wet, Sprinkler Wet Contrast w/ spreading activation RainSprinkler Grass Wet

170 Observing grass wet, Rain and Sprinkler become more active. Observing grass wet and sprinkler, Rain becomes less active: explaining away. Excitatory links: Rain Wet, Sprinkler Wet Inhibitory link: Rain Sprinkler Contrast w/ spreading activation RainSprinkler Grass Wet

171 Each new variable requires more inhibitory connections. Interactions between variables are not causal. Not modular. –Whether a connection exists depends on what other connections exist, in non-transparent ways. –Big holism problem. –Combinatorial explosion. Contrast w/ spreading activation Rain Sprinkler Grass Wet Burst pipe

172 Graphical models Capture dependency structure in distributions Provide an efficient means of representing and reasoning with probabilities Allow kinds of inference that are problematic for other representations: explaining away –hard to capture in a production system –hard to capture with spreading activation

173 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

174 Causal graphical models Graphical models represent statistical dependencies among variables (ie. correlations) –can answer questions about observations Causal graphical models represent causal dependencies among variables –express underlying causal structure –can answer questions about both observations and interventions (actions upon a variable)

175 Observation and intervention Battery RadioIgnitionGas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition))

176 Observation and intervention Battery RadioIgnitionGas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition)) graph surgery produces mutilated graph

177 Assessing interventions To compute P(Y|do(X=x)), delete all edges coming into X and reason with the resulting Bayesian network (do calculus; Pearl, 2000) Allows a single structure to make predictions about both observations and interventions

178 Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that symptoms causes diseases: Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch). Causality simplifies inference AcheCatch Cavity

179 Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that symptoms causes diseases: Inserting a new arrow allows us to capture this correlation. This model is too complex: do not believe that AcheCatch Cavity Causality simplifies inference

180 Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that symptoms causes diseases: New symptoms require a combinatorial proliferation of new arrows. This reduces efficiency of inference. Ache Catch Cavity X-ray Causality simplifies inference

181 Strength: how strong is a relationship? Structure: does a relationship exist? E B C E B C B B Learning causal graphical models

182 Strength: how strong is a relationship? E B C E B C B B Causal structure vs. causal strength

183 Strength: how strong is a relationship? –requires defining nature of relationship E B C w0w0 w1w1 E B C w0w0 B B Causal structure vs. causal strength

184 Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) p 00 p 10 p 01 p 11 p0p0p1p1p0p0p1p1 Generic

185 Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C w0w0 w1w1 w0w0 w 0, w 1 : strength parameters for B, C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) 0 w 1 w 0 w 1 + w 0 00w0w000w0w0 Linear

186 Parameterization Structures: h 1 = h 0 = Parameterization: E B C E B C w0w0 w1w1 w0w0 w 0, w 1 : strength parameters for B, C C B h 1 : P(E = 1 | C, B) h 0 : P(E = 1| C, B) 0 w 1 w 0 w 1 + w 0 – w 1 w 0 00w0w000w0w0 Noisy-OR

187 Parameter estimation Maximum likelihood estimation: maximize i P(b i,c i,e i ; w 0, w 1 ) Bayesian methods: as in the Comparing infinitely many hypotheses example…

188 Structure: does a relationship exist? E B C E B C B B Causal structure vs. causal strength

189 Approaches to structure learning Constraint-based –dependency from statistical tests (eg. 2 ) –deduce structure from dependencies E B C B (Pearl, 2000; Spirtes et al., 1993)

190 Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests (eg. 2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

191 Approaches to structure learning E B C B Constraint-based: –dependency from statistical tests (eg. 2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

192 Approaches to structure learning E B C B Attempts to reduce inductive problem to deductive problem Constraint-based: –dependency from statistical tests (eg. 2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993)

193 Approaches to structure learning E B C B Bayesian: –compute posterior probability of structures, given observed data E B C E B C P(S|data) P(data|S) P(S) P(S 1 |data)P(S 0 |data) Constraint-based: –dependency from statistical tests (eg. 2 ) –deduce structure from dependencies (Pearl, 2000; Spirtes et al., 1993) (Heckerman, 1998; Friedman, 1999)

194 Causal graphical models Extend graphical models to deal with interventions as well as observations Respecting the direction of causality results in efficient representation and inference Two steps in learning causal models –parameter estimation –structure learning

195 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: elemental causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

196 Elemental causal induction To what extent does C cause E? E present E absent C presentC absent a b c d

197 Strength: how strong is a relationship? Structure: does a relationship exist? E B C w0w0 w1w1 E B C w0w0 B B Causal structure vs. causal strength

198 Causal strength Assume structure: Leading models ( P and causal power) are maximum likelihood estimates of the strength parameter w 1, under different parameterizations for P(E|B,C): – linear P, Noisy-OR causal power E B C w0w0 w1w1 B

199 Hypotheses: h 1 = h 0 = Bayesian causal inference: support = E B C E B C B B Causal structure

200 People P (r = 0.89) Power (r = 0.88) Support (r = 0.97) Buehner and Cheng (1997)

201 The importance of parameterization Noisy-OR incorporates mechanism assumptions: –generativity: causes increase probability of effects –each cause is sufficient to produce the effect –causes act via independent mechanisms (Cheng, 1997) Consider other models: –statistical dependence: 2 test –generic parameterization (Anderson, computer science)

202 People Support (Noisy-OR) 2 Support (generic)

203 Generativity is essential Predictions result from ceiling effect –ceiling effects only matter if you believe a cause increases the probability of an effect P(e+|c+)P(e+|c+) P(e+|c-)P(e+|c-) 8/8 6/8 4/8 2/8 0/8 Support

204 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: elemental causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

205 Hamadeh et al. (2002) Toxicological sciences. ClofibrateWyeth 14,643 GemfibrozilPhenobarbital p450 2B1 Carnitine Palmitoyl Transferase 1 chemicals genes

206 ClofibrateWyeth 14,643 GemfibrozilPhenobarbital p450 2B1 Carnitine Palmitoyl Transferase 1 X Hamadeh et al. (2002) Toxicological sciences. chemicals genes

207 ClofibrateWyeth 14,643 GemfibrozilPhenobarbital p450 2B1 Carnitine Palmitoyl Transferase 1 Chemical X peroxisome proliferators Hamadeh et al. (2002) Toxicological sciences. chemicals genes

208 Using causal graphical models Three questions (usually solved by researcher) –what are the variables? –what structures are plausible? –how do variables interact? How are these questions answered if causal graphical models are used in cognition?

209 Bayes nets and beyond... What are Bayes nets? –graphical models –causal graphical models An example: elemental causal induction Beyond Bayes nets… –other knowledge in causal induction –formalizing causal theories

210 Theory-based causal induction Causal theory –Ontology –Plausible relations –Functional form Z B Y X Z B Y X h0:h0:h1:h1: P(h 1 ) = P(h 0 ) =1 – Hypothesis space of causal graphical models Generates P(h|data) P(data|h) P(h) Evaluated by statistical inference

211 Blicket detector (Gopnik, Sobel, and colleagues) See this? Its a blicket machine. Blickets make it go. Lets put this one on the machine. Oooh, its a blicket!

212 –Two objects: A and B –Trial 1: A on detector – detector active –Trial 2: B on detector – detector inactive –Trials 3,4: A B on detector – detector active –3, 4-year-olds judge whether each object is a blicket A: a blicket B: not a blicket Blocking Trial 1 Trials 3, 4 AB Trial 2

213 A deductive inference? Causal law: detector activates if and only if one or more objects on top of it are blickets. Premises: –Trial 1: A on detector – detector active –Trial 2: B on detector – detector inactive –Trials 3,4: A B on detector – detector active Conclusions deduced from premises and causal law: –A: a blicket –B: not a blicket

214 –Two objects: A and B –Trial 1: A B on detector – detector active –Trial 2: A on detector – detector active –4-year-olds judge whether each object is a blicket A: a blicket (100% of judgments) B: probably not a blicket (66% of judgments) Backwards blocking (Sobel, Tenenbaum & Gopnik, 2004) Trial 1 Trial 2 AB

215 Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Constraints on causal relations –For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) Functional form of causal relations –Causes of Active(d,t) are independent mechanisms, with causal strengths w i. A background cause has strength w 0. Assume a near-deterministic mechanism: w i ~ 1, w 0 ~ 0. Theory

216 Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Theory E A B

217 Ontology –Types: Block, Detector, Trial –Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Theory E A B A = 1 if Contact(block A, detector, trial), else 0 B = 1 if Contact(block B, detector, trial), else 0 E = 1 if Active(detector, trial), else 0

218 Constraints on causal relations –For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) Theory h 00 : h 10 : h 01 : h 11 : E A B E A B E A B E A B P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q) P(h 01 ) = (1 – q) qP(h 11 ) = q 2 No hypotheses with E B, E A, A B, etc. = A is a blicket E A

219 Functional form of causal relations –Causes of Active(d,t) are independent mechanisms, with causal strengths w b. A background cause has strength w 0. Assume a near-deterministic mechanism: w b ~ 1, w 0 ~ 0. Theory Activation law: E=1 if and only if A=1 or B=1. P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

220 Bayesian inference Evaluating causal models in light of data: Inferring a particular causal relation:

221 Modeling backwards blocking P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

222 P(E=1 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2 Modeling backwards blocking

223 P(E=1 | A=1, B=0): P(E=1 | A=1, B=1): E BA E BA E BA P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2 Modeling backwards blocking

224 After each trial, adults judge the probability that each object is a blicket. Trial 1 Trial 2 BA I. Pre-training phase: Blickets are rare.... II. Backwards blocking phase: Manipulating the prior

225 Rare condition: First observe 12 objects on detector, of which 2 set it off.

226 Common condition: First observe 12 objects on detector, of which 10 set it off.

227 After each trial, adults judge the probability that each object is a blicket. Trial 1 Trial 2 BA I. Pre-training phase: Blickets are rare.... II. Two trials: A B detector, B C detector Inferences from ambiguous data C

228 Hypotheses: h 000 = h 100 = h 010 = h 001 = h 110 = h 011 = h 101 = h 111 = Likelihoods: E A B C E A B C E A B C E A B C E A B C E A B C E A B C E A B C if A = 1 and A E exists, or B = 1 and B E exists, or C = 1 and C E exists, else 0. P(E=1| A, B, C; h) = 1 Same domain theory generates hypothesis space for 3 objects:

229 Rare condition: First observe 12 objects on detector, of which 2 set it off.

230 The role of causal mechanism knowledge Is mechanism knowledge necessary? –Constraint-based learning using 2 tests of conditional independence. How important is the deterministic functional form of causal relations? –Bayes with noisy sufficient causes theory (c.f., Chengs causal power theory).

231 Bayes with correct theory: Bayes with noisy sufficient causes theory:

232 Theory-based causal induction Explains one-shot causal inferences about physical systems: blicket detectors Captures a spectrum of inferences: –unambiguous data: adults and children make all- or-none inferences –ambiguous data: adults and children make more graded inferences Extends to more complex cases with hidden variables, dynamic systems: come to my talk!

233 Summary Causal graphical models provide a language for asking questions about causality Key issues in modeling causal induction: –what do we mean by causal induction? –how do knowledge and statistics interact? Bayesian approach allows exploration of different answers to these questions

234 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

235 Property induction

236 Collaborators Charles KempNeville Sanjana Lauren Schmidt Amy Perfors Fei XuLiz Baraff Pat Shafto

237 The Big Question How can we generalize new concepts reliably from just one or a few examples? –Learning word meanings horse

238 The Big Question How can we generalize new concepts reliably from just one or a few examples? –Learning word meanings, causal relations, social rules, …. –Property induction How probable is the the conclusion (target) given the premises (examples)? Gorillas have T4 cells. Squirrels have T4 cells. All mammals have T4 cells.

239 The Big Question How can we generalize new concepts reliably from just one or a few examples? –Learning word meanings, causal relations, social rules, …. –Property induction Gorillas have T4 cells. Squirrels have T4 cells. All mammals have T4 cells. Gorillas have T4 cells. Chimps have T4 cells. All mammals have T4 cells. More diverse examples stronger generalization

240 Is rational inference the answer? Everyday induction often appears to follow principles of rational scientific inference. –Could that explain its success? Goal of this work: a rational computational model of human inductive generalization. –Explain peoples judgments as approximations to optimal inference in natural environments. –Close quantitative fits to peoples judgments with a minimum of free parameters or assumptions.

241 Rational statistical inference (Bayes): Learners domain theories generate their hypothesis space H and prior p(h). –Well-matched to structure of the natural world. –Learnable from limited data. –Computationally tractable inference. Theory-Based Bayesian Models

242 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

243 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

244 20 subjects rated the strength of 45 arguments: X 1 have property P. X 2 have property P. X 3 have property P. All mammals have property P. 40 different subjects rated the similarity of all pairs of 10 mammals. An experiment ( Osherson et al., 1990)

245 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x

246 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x

247 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x

248 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x

249 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Sum-Similarity:

250 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Max-Similarity: max

251 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Max-Similarity:

252 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Max-Similarity:

253 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Max-Similarity:

254 Similarity-based models (Osherson et al.) strength(all mammals | X ) x x x Mammals: Examples: x Max-Similarity:

255 Sum-sim versus Max-sim Two models appear functionally similar: –Both increase monotonically as new examples are observed. Reasons to prefer Sum-sim: –Standard form of exemplar models of categorization, memory, and object recognition. –Analogous to kernel density estimation techniques in statistical pattern recognition. Reasons to prefer Max-sim: –Fit to generalization judgments....

256 Model Data Data vs. models Each represents one argument: X 1 have property P. X 2 have property P. X 3 have property P. All mammals have property P..

257 Three data sets Max-sim Sum-sim Conclusion kind: Number of examples: all mammalshorses 3 2 1, 2, or 3

258 Feature rating data (Osherson and Wilkie) People were given 48 animals, 85 features, and asked to rate whether each animal had each feature. E.g., elephant: 'gray' 'hairless' 'toughskin' 'big' 'bulbous' 'longleg' 'tail' 'chewteeth' 'tusks' 'smelly' 'walks' 'slow' 'strong' 'muscle 'quadrapedal' 'inactive' 'vegetation' 'grazer' 'oldworld' 'bush' 'jungle' 'ground' 'timid' 'smart' 'group'

259 Compute similarity based on Hamming distance, or cosine. Generalize based on Max-sim or Sum-sim. ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features New property ?

260 Three data sets Max-Sim Sum-Sim Conclusion kind: Number of examples: all mammalshorses 3 2 1, 2, or 3 r = 0.77r = 0.75r = 0.94 r = – 0.21r = 0.63r = 0.19

261 Problems for sim-based approach No principled explanation for why Max-Sim works so well on this task, and Sum-Sim so poorly, when Sum- Sim is the standard in other similarity-based models. Free parameters mixing similarity and coverage terms, and possibly Max-Sim and Sum-Sim terms. Does not extend to induction with other kinds of properties, e.g., from Smith et al., 1993: Dobermanns can bite through wire. German shepherds can bite through wire. Poodles can bite through wire. German shepherds can bite through wire.

262 Marrs Three Levels of Analysis Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation and algorithm: Max-sim, Sum-sim Implementation: Neurobiology

263 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

264 Scientific biology: species generated by an evolutionary branching process. –A tree-structured taxonomy of species. Taxonomy also central in folkbiology (Atran). Theory-based induction

265 Begin by reconstructing intuitive taxonomy from similarity judgments: chimp gorilla horse cow elephant rhino mouse squirrel dolphin seal clustering Theory-based induction

266 How taxonomy constrains induction Atran (1998): Fundamental principle of systematic induction (Warburton 1967, Bock 1973) –Given a property found among members of any two species, the best initial hypothesis is that the property is also present among all species that are included in the smallest higher-order taxon containing the original pair of species.

267 elephant squirrel chimp gorilla horse cow rhino mouse dolphin seal all mammals Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Strong (0.76 [max = 0.82])

268 elephant squirrel chimp gorilla horse cow rhino mouse dolphin seal Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Cows have property P. Horses have property P. Rhinos have property P. All mammals have property P. large herbivores Strong: 0.76 [max = 0.82])Weak: 0.17 [min = 0.14]

269 elephant squirrel chimp gorilla horse cow rhino mouse dolphin seal Seals have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. all mammals Strong: 0.76 [max = 0.82]Weak: 0.30 [min = 0.14]

270 Max-sim Sum-sim Conclusion kind: Number of examples: all mammalshorses 3 2 1, 2, or 3 Taxonomic distance

271 The challenge Can we build models with the best of both traditional approaches? –Quantitatively accurate predictions. –Strong rational basis. Will require novel ways of integrating structured knowledge with statistical inference.

272 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

273 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property ? Features

274 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

275 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

276 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

277 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

278 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

279 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? New property Generalization Hypothesis Features

280 The Bayesian approach ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 p(h)p(h) New property Generalization Hypothesis hd p(d |h) Features

281 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd Bayes rule: p(h)p(h) p(d |h) Features

282 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd Probability that property Q holds for species x: p(d |h) p(h)p(h) Features

283 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd if d is consistent with h otherwise p(d |h) p(h)p(h) Features Size principle: |h | = # of positive instances of h

284 The size principle h1h1 h2h2 even numbersmultiples of 10

285 The size principle Data slightly more of a coincidence under h 1 h1h1 h2h2 even numbersmultiples of 10

286 The size principle Data much more of a coincidence under h 1 h1h1 h2h2 even numbersmultiples of 10

287 Illustrating the size principle Grizzly bears have property P. All mammals have property P. Grizzly bears have property P. Brown bears have property P. Polar bears have property P. All mammals have property P. Non-monotonicity Which argument is stronger?

288 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypotheses hd... p(Q(x)|d) p(h)p(h) Probability that property Q holds for species x: p(d |h)

289 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd Probability that property Q holds for species x: p(d |h) p(h)p(h) Features

290 Specifying the prior p(h) A good prior must focus on a small subset of all 2 n possible hypotheses, in order to: –Match the distribution of properties in the world. –Be learnable from limited data. –Be efficiently computationally. We consider two approaches: –Empiricist Bayes: unstructured prior based directly on known features. –Theory-based Bayes: structured prior based on rational domain theory, tuned to known features.

291 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 p(h) = New property Generalization Hypothesis hd Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 h 10 h 11 h 12 Features Empiricist Bayes: (Heit, 1998)

292 Results Max-Sim r = 0.77r = 0.75r = 0.94 Empiricist Bayes r = 0.38r = 0.16r = 0.79

293 Why doesnt Empiricist Bayes work? With no structural bias, requires too many features to estimate the prior reliably. An analogy: Estimating a smooth probability density function by local interpolation. N = 5 N = 100N = 500

294 Why doesnt Empiricist Bayes work? N = 5 Assuming an appropriately structured form for density (e.g., Gaussian) leads to better generalization from sparse data. With no structural bias, requires too many features to estimate the prior reliably. An analogy: Estimating a smooth probability density function by local interpolation.

295 Theory-based Bayes Theory: Two principles based on the structure of species and properties in the natural world. 1. Species generated by an evolutionary branching process. –A tree-structured taxonomy of species (Atran, 1998). 2. Features generated by stochastic mutation process and passed on to descendants. –Novel features can appear anywhere in tree, but some distributions are more likely than others.

296 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 Mutation process generates p(h|T): –Choose label for root. –Probability that label mutates along branch b : = mutation rate | b| = length of branch b p(h|T)p(h|T) Features

297 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T Mutation process generates p(h|T): –Choose label for root. –Probability that label mutates along branch b : = mutation rate | b| = length of branch b x x x p(h|T)p(h|T) Features

298 Samples from the prior Labelings that cut the data along fewer branches are more probable: > monophyleticpolyphyletic

299 Samples from the prior Labelings that cut the data along longer branches are more probable: > more distinctiveless distinctive

300 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 Mutation process over tree T generates p(h|T). Message passing over tree T efficiently sums over all h. How do we know which tree T to use? p(h|T)p(h|T) Features

301 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 The same mutation process generates p( Features |T): –Assume each feature generated independently over the tree. –Use MCMC to infer most likely tree T and mutation rate given observed features. –No free parameters! p(h|T)p(h|T) Features

302 Results Max-Sim r = 0.77r = 0.75r = 0.94 r = 0.38r = 0.16r = 0.79 Theory-based Bayes r = 0.91r = 0.95r = 0.91 Empiricist Bayes

303 Reconstruct intuitive taxonomy from similarity judgments: chimp gorilla horse cow elephant rhino mouse squirrel dolphin seal clustering Grounding in similarity

304 Max-sim Sum-sim Conclusion kind: Number of examples: all mammalshorses 3 2 1, 2, or 3 Theory-based Bayes

305 Explaining similarity Why does Max-sim fit so well? –An efficient and accurate approximation to this Theory-Based Bayesian model. –Theorem. Nearest neighbor classification approximates evolutionary Bayes in the limit of high mutation rate, if domain is tree-structured. Correlation (r) Mean r = 0.94 –Correlation with Bayes on three- premise general arguments, over 100 simulated trees:

306 Alternative feature-based models Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) > monophyleticpolyphyletic

307 Alternative feature-based models Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) PDP network (Rogers and McClelland)

308 Results PDP network r = 0.41r = 0.62r = 0.71 Taxonomic Bayes r = 0.51r = 0.53r = 0.85 Theory-based Bayes r = 0.91r = 0.95r = 0.91 Note: PDP graph is mocked up, correlations OK. Tax Bayes graphs OK, not sure about correlations. Bias is too strong Bias is too weak Bias is just right!

309 Mutation principle versus pure Occams Razor Mutation principle provides a version of Occams Razor, by favoring hypotheses that span fewer disjoint clusters. Could we use a more generic Bayesian Occams Razor, without the biological motivation of mutation?

310 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 Mutation process generates p(h|T): –Choose label for root. –Probability that label mutates along branch b : = mutation rate | b| = length of branch b p(h|T)p(h|T) Features

311 ???????????????? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 New property Generalization Hypothesis hd T s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 Mutation process generates p(h|T): –Choose label for root. –Probability that label mutates along branch b : = mutation rate | b| = length of branch b p(h|T)p(h|T) Features

312 Bayes (taxonomy+ Occam) Max-sim Conclusion kind: Number of examples: all mammals 1 Premise typicality effect (Rips, 1975; Osherson et al., 1990): Strong: Weak: Horses have property P. All mammals have property P. Seals have property P. All mammals have property P. Bayes (taxonomy+ mutation)

313 Typicality meets hierarchies Collins and Quillian: semantic memory structured hierarchically Traditional story: Simple hierarchical structure uncomfortable with typicality effects & exceptions. New story: Typicality & exceptions compatible with rational statistical inference over hierarchy.

314 Intuitive versus scientific theories of biology Same structure for how species are related. –Tree-structured taxonomy. Same probabilistic model for traits –Small probability of occurring along any branch at any time, plus inheritance. Different features –Scientist: genes –People: coarse anatomy and behavior

315 Induction in Biology: summary Theory-based Bayesian inference explains taxonomic inductive reasoning in folk biology. Insight into processing-level accounts. –Why Max-sim over Sum-sim in this domain? –How is hierarchical representation compatible with typicality effects & exceptions? Reveals essential principles of domain theory. –Category structure: taxonomic tree. –Feature distribution: stochastic mutation process + inheritance.

316 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

317 Hyena Lion Giraffe Gazelle Monkey Gorilla Cheetah Property type Generic essence Theory Structure Taxonomic Tree Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey...

318 Hyena Lion Giraffe Gazelle Monkey Gorilla Cheetah Hyena Lion Giraffe Gazelle Monkey Gorilla Cheetah Hyena Lion Giraffe Gazelle Monkey Gorilla Cheetah Property type Generic essence Size-related Food-carried Theory Structure Taxonomic Tree Dimensional Directed Acyclic Network Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey...

319 One-dimensional predicates Q = Have skins that are more resistant to penetration than most synthetic fibers. –Unknown relevant property: skin toughness –Model influence of known properties via judged prior probability that each species has Q. Skin toughness House cat Camel Elephant Rhino threshold for Q

320 Max-sim Bayes (taxonomy+ mutation) Bayes (1D model) One-dimensional predicates

321 Disease Property MammalsIsland r = r = 0.77 r = 0.82 r = Food web model fits (Shafto et al.)

322 Disease Property MammalsIsland Taxonomic tree model fits (Shafto et al.) r = 0.81 r = r = 0.16 r = 0.62

323 The plan Similarity-based models Theory-based model Bayesian models –Empiricist Bayes –Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes –Learning with multiple domain theories –Learning domain theories

324 Domain Structure Theory Species organized in taxonomic tree structure Feature i generated by mutation process with rate i Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data S3S4S1S2S9S10S5S6S7S8 F1F2F3F4F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F10 p(S|T) p(D|S) 10 high ~ weight low

325 Theory Species organized in taxonomic tree structure Feature i generated by mutation process with rate i Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data S3S4S1S2S9S10S5S6S7S8 F1F2F3F4F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F10 p(S|T) p(D|S) ? ? ? ? ? ? ? ? ? ? ? ? ? Species X Domain Structure

326 Theory Species organized in taxonomic tree structure Feature i generated by mutation process with rate i Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data S3S4S1S2S9S10S5S6S7S8 F1F2F3F4F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F10 p(S|T) p(D|S) SX Species X Domain Structure

327 Where does the domain theory come from? Innate. –Atran (1998): The tendency to group living kinds into hierarchies reflects an innately determined cognitive structure. Emerges (only approximately) through learning in unstructured connectionist networks. –McClelland and Rogers (2003).

328 Bayesian inference to theories Challenge to the nativist-empiricist dichotomy. –We really do have structured domain theories. –We really do learn them. Bayesian framework applies over multiple levels: –Given hypothesis space + data, infer concepts. –Given theory + data, infer hypothesis space. –Given X + data, infer theory.

329 Bayesian inference to theories Candidate theories for biological species and their features: –T0: Features generated independently for each species. (c.f. naive Bayes, Andersons rational model.) –T1: Features generated by mutation in tree-structured taxonomy of species. –T2: Features generated by mutation in a one-dimensional chain of species. Score theories by likelihood on object-feature matrix:

330 T0: No organizational structure for species. Features distributed independently over species. Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data S1S2S3S4S5S6S7S8S9S10 F1 F2 F5 F8 F9 F2 F4 F6 F7 F9 F14 F1 F2 F3 F5 F7 F8 F10 F12 F13 F2 F4 F7 F9 F12 F14 F1 F5 F7 F13 F14 F1 F6 F7 F8 F9 F10 F13 F2 F4 F5 F12 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 F2 F4 F8 F9 F10 F11 F14

331 T0: No organizational structure for species. Features distributed independently over species. S1S2S3S4S5S6S7S8S9S10 F1 F6 F7 F8 F9 F11 F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F4 F8 F9 F4 F8 F9 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11

332 T1: Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. T0: No organizational structure for species. Features distributed independently over species. S3S4S1S2S9S10S5S6S7S8 F1F2F3F4F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 S1S2S3S4S5S6S7S8S9S10 F1 F6 F7 F8 F9 F11 F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F4 F8 F9 F4 F8 F9 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11

333 T1: p(Data|T2) ~ 2.42 x Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. T0: p(Data|T1) ~ 1.83 x No organizational structure for species. Features distributed independently over species. S3S4S1S2S9S10S5S6S7S8 F1F2F3F4F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 S1S2S3S4S5S6S7S8S9S10 F1 F6 F7 F8 F9 F11 F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F4 F8 F9 F4 F8 F9 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11

334 T0: No organizational structure for species. Features distributed independently over species. Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data S1S2S3S4S5S6S7S8S9S10 F1 F2 F5 F8 F9 F2 F4 F6 F7 F9 F14 F1 F2 F3 F5 F7 F8 F10 F12 F13 F2 F4 F7 F9 F12 F14 F1 F5 F7 F13 F14 F1 F6 F7 F8 F9 F10 F13 F2 F4 F5 F12 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 F2 F4 F8 F9 F10 F11 F14 T1: Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. S2S4S7S10S8S1S9S6S3S5 F1 F2 F3 F4 F5 F7 F10 F11 F12 F13 F14 F2 F3 F5 F5 F7 F13 F6 F7 F8 F9 F8 F9 F10 F12 F13 F14 F11

335 T0: p(Data|T1) ~ 2.29 x No organizational structure for species. Features distributed independently over species. Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data S1S2S3S4S5S6S7S8S9S10 F1 F2 F5 F8 F9 F2 F4 F6 F7 F9 F14 F1 F2 F3 F5 F7 F8 F10 F12 F13 F2 F4 F7 F9 F12 F14 F1 F5 F7 F13 F14 F1 F6 F7 F8 F9 F10 F13 F2 F4 F5 F12 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 F2 F4 F8 F9 F10 F11 F14 T1: p(Data|T2) ~ 4.38 x Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. S2S4S7S10S8S1S9S6S3S5 F1 F2 F3 F4 F5 F7 F10 F11 F12 F13 F14 F2 F3 F5 F5 F7 F13 F6 F7 F8 F9 F8 F9 F10 F12 F13 F14 F11

336 Empirical tests Synthetic data: 32 objects, 120 features –tree-structured generative model –linear chain generative model –unconstrained (independent features). Real data –Animal feature judgments: 48 species, 85 features. –US Supreme Court decisions, : 9 people, 637 cases.

337 Results Preferred Model Null Tree Linear Tree Linear

338 Theory acquisition: summary So far, just a computational proof of concept. Future work: –Experimental studies of theory acquisition in the lab, with adult and child subjects. –Modeling developmental or historical trajectories of theory change. Sources of hypotheses for candidate theories: –What is innate? –Role of analogy?

339 Outline Morning –Introduction (Josh) –Basic case study #1: Flipping coins (Tom) –Basic case study #2: Rules and similarity (Josh) Afternoon –Advanced case study #1: Causal induction (Tom) –Advanced case study #2: Property induction (Josh) –Quick tour of more advanced topics (Tom)

340 Advanced topics

341 Structure and statistics Statistical language modeling –topic models Relational categorization –attributes and relations

342 Structure and statistics Statistical language modeling –topic models Relational categorization –attributes and relations

343 Statistical language modeling A variety of approaches to statistical language modeling are used in cognitive science –e.g. LSA (Landauer & Dumais, 1997) –distributional clustering (Redington, Chater, & Finch, 1998) Generative models have unique advantages –identify assumed causal structure of language –make use of standard tools of Bayesian statistics –easily extended to capture more complex structure

344 Generative models for language latent structure observed data

345 Generative models for language meaning sentences

346 Topic models Each document a mixture of topics Each word chosen from a single topic Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999) Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)

347 Generating a document z w z z ww distribution over topics topic assignments observed words

348 HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) = (1) w P(w|z = 2) = (2)

349 Choose mixture weights for each document, generate bag of words = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

350 THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE NATURE WORLD HUMAN PHILOSOPHY MORAL KNOWLEDGE THOUGHT REASON SENSE OUR TRUTH NATURAL EXISTENCE BEING LIFE MIND ARISTOTLE BELIEVED EXPERIENCE REALITY A selection of topics (from 500) THIRD FIRST SECOND THREE FOURTH FOUR GRADE TWO FIFTH SEVENTH SIXTH EIGHTH HALF SEVEN SIX SINGLE NINTH END TENTH ANOTHER

351 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN A selection of topics (from 500) FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE

352 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN A selection of topics (from 500) FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE

353 Learning topic hiearchies (Blei, Griffiths, Jordan, & Tenenbaum, 2004)

354 Syntax and semantics from statistics z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents (Griffiths, Steyvers, Blei, & Tenenbaum, submitted)

355 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

356 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE ……………………………… z = z = x = 1 x = 3 x = 2

357 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE…………………… z = z = x = 1 x = 3 x = 2

358 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF……………… z = z = x = 1 x = 3 x = 2

359 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF RESEARCH …… z = z = x = 1 x = 3 x = 2

360 FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic categories PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW

361 GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic categories BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME

362 Statistical language modeling Generative models provide –transparent assumptions about causal process –opportunities to combine and extend models Richer generative models... –probabilistic context-free grammars –paragraph or sentence-level dependencies –more complex semantics

363 Structure and statistics Statistical language modeling –topic models Relational categorization –attributes and relations

364 Relational categorization Most approaches to categorization in psychology and machine learning focus on attributes - properties of objects –words in titles of CogSci posters But… a significant portion of knowledge is organized in terms of relations –co-authors on posters –who talks to whom (Kemp, Griffiths, & Tenenbaum, 2004)

365 Attributes and relations Data Model objects attributes objects P(X) = ik z P(x ik |z i ) i P(z i ) X Y P(Y) = ij z P(y ij |z i ) i P(z i ) mixture model (c.f. Anderson, 1990) stochastic blockmodel

366 Stochastic blockmodels For any pair of objects, (i,j), probability of relation is determined by classes, (z i, z j ) Allows types of objects and class probabilities to be learned from data From type i To type j Each entity has a type = Z P(Z, |Y) P(Y|Z, P(Z)P(

367 Stochastic blockmodels C BA C B A CBA C BA C B A CBA D D D

368 Categorizing words Relational data: word association norms (Nelson, McEvoy, & Schreiber, 1998) 5018 x 5018 matrix of associations –symmetrized –all words with 10 associates –2513 nodes, links

369

370

371 Categorizing words BAND INSTRUMENT BLOW HORN FLUTE BRASS GUITAR PIANO TUBA TRUMPET TIE COAT SHOES ROPE LEATHER SHOE HAT PANTS WEDDING STRING SEW MATERIAL WOOL YARN WEAR TEAR FRAY JEANS COTTON CARPET WASH LIQUID BATHROOM SINK CLEANER STAIN DRAIN DISHES TUB SCRUB

372 Categorizing actors Internet Movie Database (IMDB) data, from the start of cinema to 1960 (Jeremy Kubica) Relational data: collaboration 5000 x 5000 matrix of most prolific actors –all actors with 1 collaborators –2275 nodes, links

373

374

375 Categorizing actors Albert Lieven Karel Stepanek Walter Rilla Anton Walbrook Moore Marriott Laurence Hanray Gus McNaughton Gordon Harker Helen Haye Alfred Goddard Morland Graham Margaret Lockwood Hal Gordon Bromley Davenport Gino Cervi Nadia Gray Enrico Glori Paolo Stoppa Bernardi Nerio Amedeo Nazzari Gina Lollobrigida Aldo Silvani Franco Interlenghi Guido Celano Archie Ricks Helen Gibson Oscar Gahan Buck Moulton Buck Connors Clyde McClary Barney Beasley Buck Morgan Tex Phelps George Sowards Germany UK British comedyItalian US Westerns

376 Structure and statistics Bayesian approach allows us to specify structured probabilistic models Explore novel representations and domains –topics for semantic representation –relational categorization Use powerful methods for inference, developed in statistics and machine learning

377 Other methods and tools... Inference algorithms –belief propagation –dynamic programming –the EM algorithm and variational methods –Markov chain Monte Carlo More complex models –Dirichlet processes and Bayesian non-parametrics –Gaussian processes and kernel methods Reading list at

378 Taking stock

379 Bayesian models of inductive learning Inductive leaps can be explained with hierarchical Theory-based Bayesian models: Domain Theory Structural Hypotheses Data Probabilistic Generative Model Bayesian inference

380 Bayesian models of inductive learning Inductive leaps can be explained with hierarchical Theory-based Bayesian models: T S D SS D D D D D D D D...

381 Bayesian models of inductive learning Inductive leaps can be explained with hierarchical Theory-based Bayesian models. What the approach offers: –Strong quantitative models of generalization behavior. –Flexibility to model different patterns of reasoning that in different tasks and domains, using differently structured theories, but the same general-purpose Bayesian engine. –Framework for explaining why inductive generalization works, where knowledge comes from as well as how it is used.

382 Bayesian models of inductive learning Inductive leaps can be explained with hierarchical Theory-based Bayesian models. Challenges: –Theories are hard.

383 Bayesian models of inductive learning Inductive leaps can be explained with hierarchical Theory-based Bayesian models: The interaction between structure and statistics is crucial. –How structured knowledge supports statistical learning, by constraining hypothesis spaces. –How statistics supports reasoning with and learning structured knowledge. –How complex structures can grow from data, rather than being fully specified in advance.

384


Download ppt "Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences."

Similar presentations


Ads by Google