Download presentation
Presentation is loading. Please wait.
1
Bayesian models of inductive learning
Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
2
What to expect What you’ll get out of this tutorial:
Our view of what Bayesian models have to offer cognitive science. In-depth examples of basic and advanced models: how the math works & what it buys you. Some comparison to other approaches. Opportunities to ask questions. What you won’t get: Detailed, hands-on how-to. Where you can learn more:
3
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
4
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
5
Bayesian models in cognitive science
Vision Motor control Memory Language Inductive learning and reasoning….
6
Everyday inductive leaps
Learning concepts and words from examples “horse” “horse” I’m going to tell you about a broad research program… The problems that intrigue me are all things which people do effortlessly and for th most part quite well, but which we still don’t know how to get computers get do -- which is a sign that we don’t understand the computational basis of how people do these things. “horse”
7
Learning concepts and words
“tufa” Can you pick out the tufas?
8
Inductive reasoning Input: Task: Judge how likely conclusion is to be
Cows can get Hick’s disease. Gorillas can get Hick’s disease. All mammals can get Hick’s disease. (premises) (conclusion) Task: Judge how likely conclusion is to be true, given that premises are true.
9
Inferring causal relations
Input: Took vitamin B23 Headache Day 1 yes no Day 2 yes yes Day 3 no yes Day 4 yes no Does vitamin B23 cause headaches? Task: Judge probability of a causal link given several joint observations.
10
Everyday inductive leaps
How can we learn so much about . . . Properties of natural kinds Meanings of words Future outcomes of a dynamic process Hidden causal properties of an object Causes of a person’s action (beliefs, goals) Causal laws governing a domain . . . from such limited data?
11
The Challenge How do we generalize successfully from very limited data? Just one or a few examples Often only positive examples Philosophy: Induction is a “problem”, a “riddle”, a “paradox”, a “scandal”, or a “myth”. Machine learning and statistics: Focus on generalization from many examples, both positive and negative.
12
Rational statistical inference (Bayes, Laplace)
Likelihood Prior probability Posterior probability Sum over space of hypotheses
13
Bayesian models of inductive learning: some recent history
Shepard (1987) Analysis of one-shot stimulus generalization, to explain the universal exponential law. Anderson (1990) Models of categorization and causal induction. Oaksford & Chater (1994) Model of conditional reasoning (Wason selection task). Heit (1998) Framework for category-based inductive reasoning.
14
Theory-Based Bayesian Models
Rational statistical inference (Bayes): Learners’ domain theories generate their hypothesis space H and prior p(h). Well-matched to structure of the natural world. Learnable from limited data. Computationally tractable inference.
15
What is a theory? Working definition Analogy to grammar in language.
An ontology and a system of abstract principles that generates a hypothesis space of candidate world structures along with their relative probabilities. Analogy to grammar in language. Example: Newton’s laws
16
Structure and statistics
A framework for understanding how structured knowledge and statistical inference interact. How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. How simplicity trades off with fit to the data in evaluating structural hypotheses. How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.
17
Structure and statistics
A framework for understanding how structured knowledge and statistical inference interact. How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. Hierarchical Bayes. How simplicity trades off with fit to the data in evaluating structural hypotheses. Bayesian Occam’s Razor. How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Non-parametric Bayes.
18
Alternative approaches to inductive generalization
Associative learning Connectionist networks Similarity to examples Toolkit of simple heuristics Constraint satisfaction Analogical mapping
19
Marr’s Three Levels of Analysis
Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” Representation and algorithm: Cognitive psychology Implementation: Neurobiology Cognitive psychology traditionally focuses on 2nd level. But this is unsatisying, for the same reason as in vision: - Ad hoc models, with arbitrary assumptions and free parameters. - Lots of different models with little sense of how they all fit together, or what the real differences are. Cognitive neuroscience has focused on the link between levels 2 and 3. But that doesn’t address what is to me the biggest mystery: how we are able to succeed in these inductive inference tasks, given that induction has been called a puzzle, a paradox, a scandal! Going back to plato, aristotle. Scandal though it might be, we do these things on a daily basis. That requires level 1. But outside of vision or language, not very much work on level 1, or the link from level 1 to level 2 and 3. No explanatory adequacy. Describe how the mind works, but don’t explain why it works that way. Why these inference strategies lead to success in the real world.
20
Why Bayes? A framework for explaining cognition.
How people can learn so much from such limited data. Why process-level models work the way that they do. Strong quantitative models with minimal ad hoc assumptions. A framework for understanding how structured knowledge and statistical inference interact. How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor). How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.
21
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
22
Coin flipping
23
Coin flipping HHTHT HHHHH What process produced these sequences?
24
Bayes’ rule For data D and a hypothesis H, we have:
“Posterior probability”: “Prior probability”: “Likelihood”:
25
The origin of Bayes’ rule
A simple consequence of using probability to represent degrees of belief For any two random variables:
26
Why represent degrees of belief with probabilities?
Good statistics consistency, and worst-case error bounds. Cox Axioms necessary to cohere with common sense “Dutch Book” + Survival of the Fittest if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of learning a common currency for combining prior knowledge and the lessons of experience.
27
Bayes’ rule For data D and a hypothesis H, we have:
“Posterior probability”: “Prior probability”: “Likelihood”:
28
Hypotheses in Bayesian inference
Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D
29
Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... statistical models
30
Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... generative models
31
Representing generative models
Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d d d d4 Fair coin, P(H) = 0.5 d d d d4 Markov model HHTHT d1 d2 d3 d4 d5
32
Models with latent structure
d d d d4 P(H) = p p Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved d d d d4 Hidden Markov model s s s s4 HHTHT d1 d2 d3 d4 d5
33
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p Comparing infinitely many hypotheses P(H) = p Psychology: Representativeness
34
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p Comparing infinitely many hypotheses P(H) = p Psychology: Representativeness
35
Comparing two simple hypotheses
Contrast simple hypotheses: H1: “fair coin”, P(H) = 0.5 H2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form
36
Bayes’ rule in odds form
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: data H1, H2: models P(H1|D): posterior probability H1 generated the data P(D|H1): likelihood of data under model H1 P(H1): prior probability H1 generated the data = x
37
Coin flipping HHTHT HHHHH What process produced these sequences?
38
Comparing two simple hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity = x
39
Comparing two simple hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 30 = x
40
Comparing two simple hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 1 = x
41
Comparing two simple hypotheses
Bayes’ rule tells us how to combine prior beliefs with new data top-down and bottom-up influences As a model of human inference predicts conclusions drawn from data identifies point at which prior beliefs are overwhelmed by new experiences But… more complex cases?
42
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p Comparing infinitely many hypotheses P(H) = p Psychology: Representativeness
43
Comparing simple and complex hypotheses
d d d d4 P(H) = p p vs. d d d d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?
44
Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5
45
Comparing simple and complex hypotheses
Probability
46
Comparing simple and complex hypotheses
Probability HHHHH p = 1.0
47
Comparing simple and complex hypotheses
Probability HHTHT p = 0.6
48
Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 How can we deal with this? frequentist: hypothesis testing information theorist: minimum description length Bayesian: just use probability theory!
49
Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x
50
Comparing simple and complex hypotheses
Probability Distribution is an average over all values of p
51
Comparing simple and complex hypotheses
Probability Distribution is an average over all values of p
52
Comparing simple and complex hypotheses
Simple and complex hypotheses can be compared directly using Bayes’ rule requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” This principle is used in model selection methods in psychology (e.g. Myung & Pitt, 1997)
53
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p Comparing infinitely many hypotheses P(H) = p Psychology: Representativeness
54
Comparing infinitely many hypotheses
Assume data are generated from a model: What is the value of p? each value of p is a hypothesis H requires inference over infinitely many hypotheses d d d d4 P(H) = p p
55
Comparing infinitely many hypotheses
Flip a coin 10 times and see 5 heads, 5 tails. P(H) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. “Future will be like the past.” Suppose we had seen 4 heads and 6 tails. P(H) on next flip? Closer to 50% than to 40%. Why? Prior knowledge.
56
Integrating prior knowledge and data
Posterior distribution P(p | D) is a probability density over p = P(H) Need to work out likelihood P(D | p) and specify prior distribution P(p) P(p | D) P(D | p) P(p)
57
? Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p) pFH-1 (1-p)FT-1 ?
58
A simple method of specifying priors
Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...
59
Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p) pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails Beta(FH,FT)
60
Conjugate priors Exist for many standard distributions
formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000
61
Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p) pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails
62
Comparing infinitely many hypotheses
P(p | D) P(D | p) P(p) = pNH+FH-1 (1-p)NT+FT-1 Posterior is Beta(NH+FH,NT+FT) same form as conjugate prior Posterior mean: Posterior predictive distribution:
63
Some examples e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / ( ) = 49.95% e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak
64
But… flipping thumbtacks
e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads After seeing 2 heads, 0 tails, P(H) on next flip = 6 / (6+3) = 67% Some prior knowledge is always necessary to avoid jumping to hasty conclusions... Suppose F = { }: After seeing 2 heads, 0 tails, P(H) on next flip = 2 / (2+0) = 100%
65
Origin of prior knowledge
Tempting answer: prior experience Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails By assuming all coins (and flips) are alike, these observations of other coins are as good as observations of the present coin
66
Problems with simple empiricism
Haven’t really seen 2000 coin flips, or any flips of a thumbtack Prior knowledge is stronger than raw experience justifies Haven’t seen exactly equal number of heads and tails Prior knowledge is smoother than raw experience justifies Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins Prior knowledge is more structured than raw experience
67
A simple theory “Coins are manufactured by a standardized procedure that is effective but not perfect.” Justifies generalizing from previous coins to the present coin. Justifies smoother and stronger prior than raw experience alone. Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin. “Tacks are asymmetric, and manufactured to less exacting standards.”
68
Limitations Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations? Suppose you flip a coin 25 times and get all heads Something funny is going on… But with F ={1000 heads, 1000 tails}, P(H) on next flip = 1025 / ( ) = 50.6%. Looks like nothing unusual
69
Hierarchical priors Higher-order hypothesis: is this coin fair or unfair? Example probabilities: P(fair) = 0.99 P(p|fair) is Beta(1000,1000) P(p|unfair) is Beta(1,1) 25 heads in a row propagates up, affecting p and then P(fair|D) fair p d d d d4 P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair) = = 9 x 10-5
70
More hierarchical priors
Latent structure can capture coin variability 10 flips from 200 coins is better than 2000 flips from a single coin: allows estimation of FH, FT p ~ Beta(FH,FT) FH,FT ... Coin 1 Coin 2 Coin 200 p p p d d d d4 d d d d4 d d d d4
71
Yet more hierarchical priors
physical knowledge Discrete beliefs (e.g. symmetry) can influence estimation of continuous properties (e.g. FH, FT) FH,FT p p p d d d d4 d d d d4 d d d d4
72
Comparing infinitely many hypotheses
Apply Bayes’ rule to obtain posterior probability density Requires prior over all hypotheses computation simplified by conjugate priors richer structure with hierarchical priors Hierarchical priors indicate how simple theories can inform statistical inferences one step towards structure and statistics
73
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p Comparing infinitely many hypotheses P(H) = p Psychology: Representativeness
74
Psychology: Representativeness
Which sequence is more likely from a fair coin? HHTHT more representative of a fair coin (Kahneman & Tversky, 1972) HHHHH
75
What might representativeness mean?
Evidence for a random generating process P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) H1: random process (fair coin) H2: alternative processes = x likelihood ratio
76
A constrained hypothesis space
Four hypotheses: h1 fair coin HHTHTTTH h2 “always alternates” HTHTHTHT h3 “mostly heads” HHTHTHHH h4 “always heads” HHHHHHHH
77
Representativeness judgments
78
Results Good account of representativeness data, with three pseudo-free parameters, = 0.91 “always alternates” means 99% of the time “mostly heads” means P(H) = 0.85 “always heads” means P(H) = 0.99 With scaling parameter, r = 0.95 (Tenenbaum & Griffiths, 2001)
79
The role of theories The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. Easy to imagine how a trick all-heads coin could work: high prior probability. Hard to imagine how a trick “HHTHT” coin could work: low prior probability.
80
Summary Three kinds of Bayesian inference Critical notions:
comparing two simple hypotheses comparing simple and complex hypotheses comparing an infinite number of hypotheses Critical notions: generative models, graphical models Bayesian Occam’s razor priors: conjugate, hierarchical (theories)
81
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
82
Rules and similarity
83
Structure versus statistics
Rules Logic Symbols Statistics Similarity Typicality
84
A better metaphor
85
A better metaphor
86
Structure and statistics
Similarity Typicality Rules Logic Symbols
87
Structure and statistics
Basic case study #1: Flipping coins Learning and reasoning with structured statistical models. Basic case study #2: Rules and similarity Statistical learning with structured representations.
88
The number game Program input: number between 1 and 100
Program output: “yes” or “no”
89
The number game Learning task:
Observe one or more positive (“yes”) examples. Judge whether other numbers are “yes” or “no”.
90
The number game Examples of “yes” numbers Generalization
judgments (N = 20) 60 Diffuse similarity
91
The number game Examples of “yes” numbers Generalization
judgments (n = 20) 60 Diffuse similarity Rule: “multiples of 10”
92
The number game Examples of “yes” numbers Generalization
judgments (N = 20) 60 Diffuse similarity Rule: “multiples of 10” Focused similarity: numbers near 50-60
93
The number game Examples of “yes” numbers Generalization
judgments (N = 20) 16 Diffuse similarity Rule: “powers of 2” Focused similarity: numbers near 20
94
The number game Main phenomena to explain:
60 Diffuse similarity Rule: “multiples of 10” Focused similarity: numbers near 50-60 Main phenomena to explain: Generalization can appear either similarity-based (graded) or rule-based (all-or-none). Learning from just a few positive examples.
95
Rule/similarity hybrid models
Category learning Nosofsky, Palmeri et al.: RULEX Erickson & Kruschke: ATRIUM
96
Divisions into “rule” and “similarity” subsystems
Category learning Nosofsky, Palmeri et al.: RULEX Erickson & Kruschke: ATRIUM Language processing Pinker, Marcus et al.: Past tense morphology Reasoning Sloman Rips Nisbett, Smith et al.
97
Rule/similarity hybrid models
Why two modules? Why do these modules work the way that they do, and interact as they do? How do people infer a rule or similarity metric from just a few positive examples?
98
Bayesian model H: Hypothesis space of possible concepts:
h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”) h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”) h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”) h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”) . . . Representational interpretations for H: Candidate rules Features for similarity “Consequential subsets” (Shepard, 1987)
99
Inferring hypotheses from similarity judgment
Additive clustering (Shepard & Arabie, 1977): : similarity of stimuli i, j : weight of cluster k : membership of stimulus i in cluster k (1 if stimulus i in cluster k, 0 otherwise) Equivalent to similarity as a weighted sum of common features (Tversky, 1977).
100
Additive clustering for the integers 0-9:
Rank Weight Stimuli in cluster Interpretation * * * powers of two * * * small numbers * * * multiples of three * * * * large numbers * * * * * middle numbers * * * * * odd numbers * * * * smallish numbers * * * * * largish numbers
101
Three hypothesis subspaces for number concepts
Mathematical properties (24 hypotheses): Odd, even, square, cube, prime numbers Multiples of small integers Powers of small integers Raw magnitude (5050 hypotheses): All intervals of integers with endpoints between 1 and 100. Approximate magnitude (10 hypotheses): Decades (1-10, 10-20, 20-30, …)
102
Hypothesis spaces and theories
Why a hypothesis space is like a domain theory: Represents one particular way of classifying entities in a domain. Not just an arbitrary collection of hypotheses, but a principled system. What’s missing? Explicit representation of the principles. Hypothesis spaces (and priors) are generated by theories. Some analogies: Grammars generate languages (and priors over structural descriptions) Hierarchical Bayesian modeling
103
Bayesian model H: Hypothesis space of possible concepts:
Mathematical properties: even, odd, square, prime, Approximate magnitude: {1-10}, {10-20}, {20-30}, Raw magnitude: all intervals between 1 and 100. X = {x1, , xn}: n examples of a concept C. Evaluate hypotheses given data: p(h) [“prior”]: domain knowledge, pre-existing biases p(X|h) [“likelihood”]: statistical information in examples. p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
104
Bayesian model H: Hypothesis space of possible concepts:
Mathematical properties: even, odd, square, prime, Approximate magnitude: {1-10}, {10-20}, {20-30}, Raw magnitude: all intervals between 1 and 100. X = {x1, , xn}: n examples of a concept C. Evaluate hypotheses given data: p(h) [“prior”]: domain knowledge, pre-existing biases p(X|h) [“likelihood”]: statistical information in examples. p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
105
Likelihood: p(X|h) Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. Follows from assumption of randomly sampled examples. Captures the intuition of a representative sample.
106
Illustrating the size principle
h2
107
Illustrating the size principle
h2 Data slightly more of a coincidence under h1
108
Illustrating the size principle
h2 Data much more of a coincidence under h1
109
Bayesian Occam’s Razor
Law of “Conservation of Belief” M1 p(D = d | M ) M2 All possible data sets d For any model M,
110
Comparing simple and complex hypotheses
Probability Distribution is an average over all values of p
111
Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”.
112
Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”. p(h) encodes relative weights of alternative theories: H: Total hypothesis space p(H1) = 1/5 p(H2) = 3/5 p(H3) = 1/5 H1: Math properties (24) even numbers powers of two multiples of three …. H2: Raw magnitude (5050) 10-15 20-32 37-54 …. H3: Approx. magnitude (10) 10-20 20-30 30-40 …. p(h) = p(H1) / 24 p(h) = p(H2) / 5050 p(h) = p(H3) / 10
113
A more complex approach to priors
Start with a base set of regularities R and combination operators C. Hypothesis space = closure of R under C. C = {and, or}: H = unions and intersections of regularities in R (e.g., “multiples of 10 between 30 and 70”). C = {and-not}: H = regularities in R with exceptions (e.g., “multiples of 10 except 50 and 70”). Two qualitatively similar priors: Description length: number of combinations in C needed to generate hypothesis from R. Bayesian Occam’s Razor, with model classes defined by number of combinations: more combinations more hypotheses lower prior
114
Posterior: X = {60, 80, 10, 30} Why prefer “multiples of 10” over “even numbers”? p(X|h). Why prefer “multiples of 10” over “multiples of 10 except 50 and 20”? p(h). Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h)
115
Bayesian Occam’s Razor
Probabilities provide a common currency for balancing model complexity with fit to the data.
116
Generalizing to new objects
Given p(h|X), how do we compute , the probability that C applies to some new stimulus y? The judgments people make are not directly about which hypothesis is correct for the meaning of the word “blicket”, but about which things are blickets. How do we use our knowledge about which hypotheses are likely to correspond to the extension of the word blicket, encoded in the posterior, to generalize the word to new objects? Bayes tell us to compute….
117
Generalizing to new objects
Hypothesis averaging: Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X): The judgments people make are not directly about which hypothesis is correct for the meaning of the word “blicket”, but about which things are blickets. How do we use our knowledge about which hypotheses are likely to correspond to the extension of the word blicket, encoded in the posterior, to generalize the word to new objects? Bayes tell us to compute….
118
Examples: 16
119
Connection to feature-based similarity
Additive clustering model of similarity: Bayesian hypothesis averaging: Equivalent if we identify features fk with hypotheses h, and weights wk with
120
Examples: 16 8 2 64
121
Examples: 16 23 19 20
122
Model fits Examples of “yes” numbers Generalization judgments (N = 20)
Bayesian Model (r = 0.96) 60
123
Model fits Examples of “yes” numbers Generalization judgments (N = 20)
Bayesian Model (r = 0.93) 16
124
Summary of the Bayesian model
How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule
125
Summary of the Bayesian model
How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? Many h of similar size: broad p(h|X) One h much smaller: narrow p(h|X)
126
Alternative models Neural networks 60 even multiple of 10 power of 2
80 10 30
127
Alternative models Neural networks Hypothesis ranking and elimination
even multiple of 10 multiple of 3 power of 2 …. 60 80 10 30
128
Alternative models Neural networks Hypothesis ranking and elimination
Similarity to exemplars Average similarity: 60 Data Model (r = 0.80)
129
Alternative models Neural networks Hypothesis ranking and elimination
Similarity to exemplars Max similarity: 60 Data Model (r = 0.64)
130
Alternative models Neural networks Hypothesis ranking and elimination
Similarity to exemplars Average similarity Max similarity Flexible similarity? Bayes.
131
Alternative models Neural networks Hypothesis ranking and elimination
Similarity to exemplars Toolbox of simple heuristics 60: “general” similarity : most specific rule (“subset principle”). : similarity in magnitude Why these heuristics? When to use which heuristic? Bayes.
132
Summary Generalization from limited data possible via the interaction of structured knowledge and statistics. Structured knowledge: space of candidate rules, theories generate hypothesis space (c.f. hierarchical priors) Statistics: Bayesian Occam’s razor. Better understand the interactions between traditionally opposing concepts: Rules and statistics Rules and similarity Explains why central but notoriously slippery processing-level concepts work the way they do. Similarity Representativeness Rules and representativeness
133
Why Bayes? A framework for explaining cognition.
How people can learn so much from such limited data. Why process-level models work the way that they do. Strong quantitative models with minimal ad hoc assumptions. A framework for understanding how structured knowledge and statistical inference interact. How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor). How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.
134
Theory-Based Bayesian Models
Rational statistical inference (Bayes): Learners’ domain theories generate their hypothesis space H and prior p(h). Well-matched to structure of the natural world. Learnable from limited data. Computationally tractable inference.
135
Looking towards the afternoon
How do we apply these ideas to more natural and complex aspects of cognition? Where do the hypothesis spaces come from? Can we formalize the contributions of domain theories?
137
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
138
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
139
Marr’s Three Levels of Analysis
Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” Representation and algorithm: Cognitive psychology Implementation: Neurobiology Cognitive psychology traditionally focuses on 2nd level. But this is unsatisying, for the same reason as in vision: - Ad hoc models, with arbitrary assumptions and free parameters. - Lots of different models with little sense of how they all fit together, or what the real differences are. Cognitive neuroscience has focused on the link between levels 2 and 3. But that doesn’t address what is to me the biggest mystery: how we are able to succeed in these inductive inference tasks, given that induction has been called a puzzle, a paradox, a scandal! Going back to plato, aristotle. Scandal though it might be, we do these things on a daily basis. That requires level 1. But outside of vision or language, not very much work on level 1, or the link from level 1 to level 2 and 3. No explanatory adequacy. Describe how the mind works, but don’t explain why it works that way. Why these inference strategies lead to success in the real world.
140
Working at the computational level
statistical What is the computational problem? input: data output: solution
141
Working at the computational level
statistical What is the computational problem? input: data output: solution What knowledge is available to the learner? Where does that knowledge come from?
142
Theory-Based Bayesian Models
Rational statistical inference (Bayes): Learners’ domain theories generate their hypothesis space H and prior p(h). Well-matched to structure of the natural world. Learnable from limited data. Computationally tractable inference.
143
Causality
144
Bayes nets and beyond... Increasingly popular approach to studying human causal inferences (e.g. Glymour, 2001; Gopnik et al., 2004) Three reactions: Bayes nets are the solution! Bayes nets are missing the point, not sure why… what is a Bayes net?
145
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: elemental causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
146
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: elemental causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
147
Graphical models Express the probabilistic dependency structure among a set of variables (Pearl, 1988) Consist of a set of nodes, corresponding to variables a set of edges, indicating dependency a set of functions defined on the graph that defines a probability distribution
148
Undirected graphical models
X3 X4 X1 Consist of a set of nodes a set of edges a potential for each clique, multiplied together to yield the distribution over variables Examples statistical physics: Ising model, spinglasses early neural networks (e.g. Boltzmann machines) X2 X5
149
Directed graphical models
X3 X4 X1 Consist of a set of nodes a set of edges a conditional probability distribution for each node, conditioned on its parents, multiplied together to yield the distribution over variables Constrained to directed acyclic graphs (DAG) AKA: Bayesian networks, Bayes nets X2 X5
150
Bayesian networks and Bayes
Two different problems Bayesian statistics is a method of inference Bayesian networks are a form of representation There is no necessary connection many users of Bayesian networks rely upon frequentist statistical methods (e.g. Glymour) many Bayesian inferences cannot be easily represented using Bayesian networks
151
Properties of Bayesian networks
Efficient representation and inference exploiting dependency structure makes it easier to represent and compute with probabilities Explaining away pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI
152
Efficient representation and inference
Three binary variables: Cavity, Toothache, Catch
153
Efficient representation and inference
Three binary variables: Cavity, Toothache, Catch Specifying P(Cavity, Toothache, Catch) requires 7 parameters (1 for each set of values, minus 1 because it’s a probability distribution) With n variables, we need 2n -1 parameters Here n=3. Realistically, many more: X-ray, diet, oral hygiene, personality,
154
Conditional independence
All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity In probabilistic terms: With n evidence variables, x1, …, xn, we need 2 n conditional probabilities:
155
A simple Bayesian network
Graphical representation of relations between a set of random variables: Probabilistic interpretation: factorizing complex terms Cavity Toothache Catch
156
A more complex system Joint distribution sufficient for any inference:
Battery Radio Ignition Gas Starts On time to work Joint distribution sufficient for any inference:
157
A more complex system Joint distribution sufficient for any inference:
Battery Radio Ignition Gas Starts On time to work Joint distribution sufficient for any inference:
158
A more complex system Joint distribution sufficient for any inference:
Battery Radio Ignition Gas Starts On time to work Joint distribution sufficient for any inference: General inference algorithm: local message passing (belief propagation; Pearl, 1988) efficiency depends on sparseness of graph structure
159
Explaining away Rain Sprinkler Grass Wet Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:
160
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet:
161
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet:
162
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet:
163
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet:
164
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet: Between 1 and P(s)
165
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on: Both terms = 1
166
Explaining away Rain Sprinkler Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on:
167
Explaining away Rain Sprinkler Grass Wet “Discounting” to
prior probability.
168
Contrast w/ production system
Rain Sprinkler Grass Wet Formulate IF-THEN rules: IF Rain THEN Wet IF Wet THEN Rain Rules do not distinguish directions of inference Requires combinatorial explosion of rules IF Wet AND NOT Sprinkler THEN Rain
169
Contrast w/ spreading activation
Rain Sprinkler Grass Wet Excitatory links: Rain Wet, Sprinkler Wet Observing rain, Wet becomes more active. Observing grass wet, Rain and Sprinkler become more active. Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!
170
Contrast w/ spreading activation
Rain Sprinkler Grass Wet Excitatory links: Rain Wet, Sprinkler Wet Inhibitory link: Rain Sprinkler Observing grass wet, Rain and Sprinkler become more active. Observing grass wet and sprinkler, Rain becomes less active: explaining away.
171
Contrast w/ spreading activation
Rain Burst pipe Sprinkler Grass Wet Each new variable requires more inhibitory connections. Interactions between variables are not causal. Not modular. Whether a connection exists depends on what other connections exist, in non-transparent ways. Big holism problem. Combinatorial explosion.
172
Graphical models Capture dependency structure in distributions
Provide an efficient means of representing and reasoning with probabilities Allow kinds of inference that are problematic for other representations: explaining away hard to capture in a production system hard to capture with spreading activation
173
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
174
Causal graphical models
Graphical models represent statistical dependencies among variables (ie. correlations) can answer questions about observations Causal graphical models represent causal dependencies among variables express underlying causal structure can answer questions about both observations and interventions (actions upon a variable)
175
Observation and intervention
Battery Radio Ignition Gas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition))
176
Observation and intervention
Battery Radio Ignition Gas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition)) “graph surgery” produces “mutilated graph”
177
Assessing interventions
To compute P(Y|do(X=x)), delete all edges coming into X and reason with the resulting Bayesian network (“do calculus”; Pearl, 2000) Allows a single structure to make predictions about both observations and interventions
178
Causality simplifies inference
Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch). Ache Catch Cavity
179
Causality simplifies inference
Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: Inserting a new arrow allows us to capture this correlation. This model is too complex: do not believe that Ache Catch Cavity
180
Causality simplifies inference
Using a representation in which the direction of causality is correct produces sparser graphs Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: New symptoms require a combinatorial proliferation of new arrows. This reduces efficiency of inference. Ache X-ray Catch Cavity
181
Learning causal graphical models
Strength: how strong is a relationship? Structure: does a relationship exist? B E B C E B C B
182
Causal structure vs. causal strength
Strength: how strong is a relationship? B E B C E B C B
183
Causal structure vs. causal strength
Strength: how strong is a relationship? requires defining nature of relationship B E B C w0 w1 E B C w0 B
184
Parameterization Generic Structures: h1 = h0 = Parameterization: C B B
h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
185
Parameterization Linear Structures: h1 = h0 = Parameterization: C B B
w0 w1 w0, w1: strength parameters for B, C E E w1 w0 w1+ w0 Linear C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
186
Parameterization “Noisy-OR” Structures: h1 = h0 = Parameterization: C
B C B C w0 w1 w0, w1: strength parameters for B, C E E w1 w0 w1+ w0 – w1 w0 “Noisy-OR” C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
187
maximize i P(bi,ci,ei; w0, w1)
Parameter estimation Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1) Bayesian methods: as in the “Comparing infinitely many hypotheses” example…
188
Causal structure vs. causal strength
Structure: does a relationship exist? B E B C E B C B
189
Approaches to structure learning
Constraint-based dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)
190
Approaches to structure learning
Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)
191
Approaches to structure learning
Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)
192
Approaches to structure learning
Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993) Attempts to reduce inductive problem to deductive problem
193
Approaches to structure learning
Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993) Bayesian: compute posterior probability of structures, given observed data B C B C E E P(S1|data) P(S0|data) P(S|data) P(data|S) P(S) (Heckerman, 1998; Friedman, 1999)
194
Causal graphical models
Extend graphical models to deal with interventions as well as observations Respecting the direction of causality results in efficient representation and inference Two steps in learning causal models parameter estimation structure learning
195
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: elemental causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
196
Elemental causal induction
C present C absent E present a c E absent b d “To what extent does C cause E?”
197
Causal structure vs. causal strength
Strength: how strong is a relationship? Structure: does a relationship exist? B E B C w0 w1 E B C w0 B
198
Causal strength Assume structure:
Leading models (DP and causal power) are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): linear DP, Noisy-OR causal power B E B C w0 w1
199
Causal structure Hypotheses: h1 = h0 = Bayesian causal inference: B E
support = B E B C E B C B
200
Buehner and Cheng (1997) People DP (r = 0.89) Power (r = 0.88)
Support (r = 0.97)
201
The importance of parameterization
Noisy-OR incorporates mechanism assumptions: generativity: causes increase probability of effects each cause is sufficient to produce the effect causes act via independent mechanisms (Cheng, 1997) Consider other models: statistical dependence: 2 test generic parameterization (Anderson, computer science)
202
People Support (Noisy-OR) 2 Support (generic)
203
Generativity is essential
P(e+|c+) 8/8 6/8 4/8 2/8 0/8 P(e+|c-) 100 50 Support Predictions result from “ceiling effect” ceiling effects only matter if you believe a cause increases the probability of an effect
204
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: elemental causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
205
Hamadeh et al. (2002) Toxicological sciences.
chemicals genes Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital Carnitine Palmitoyl Transferase 1 p450 2B1 Hamadeh et al. (2002) Toxicological sciences.
206
Hamadeh et al. (2002) Toxicological sciences.
chemicals genes X Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital Carnitine Palmitoyl Transferase 1 p450 2B1 Hamadeh et al. (2002) Toxicological sciences.
207
Hamadeh et al. (2002) Toxicological sciences.
chemicals genes peroxisome proliferators Chemical X Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital + Carnitine Palmitoyl Transferase 1 p450 2B1 Hamadeh et al. (2002) Toxicological sciences.
208
Using causal graphical models
Three questions (usually solved by researcher) what are the variables? what structures are plausible? how do variables interact? How are these questions answered if causal graphical models are used in cognition?
209
Bayes nets and beyond... What are Bayes nets?
graphical models causal graphical models An example: elemental causal induction Beyond Bayes nets… other knowledge in causal induction formalizing causal theories
210
Theory-based causal induction
Causal theory Ontology Plausible relations Functional form P(h|data) P(data|h) P(h) Evaluated by statistical inference Z B Y X h0: h1: P(h1) = r P(h0) =1 – r Hypothesis space of causal graphical models Generates
211
Blicket detector (Gopnik, Sobel, and colleagues)
Oooh, it’s a blicket! Let’s put this one on the machine. See this? It’s a blicket machine. Blickets make it go.
212
“Blocking” Two objects: A and B
Trial 1 Trial 2 Trials 3, 4 Two objects: A and B Trial 1: A on detector – detector active Trial 2: B on detector – detector inactive Trials 3,4: A B on detector – detector active 3, 4-year-olds judge whether each object is a blicket A: a blicket B: not a blicket
213
A deductive inference? Causal law: detector activates if and only if one or more objects on top of it are blickets. Premises: Trial 1: A on detector – detector active Trial 2: B on detector – detector inactive Trials 3,4: A B on detector – detector active Conclusions deduced from premises and causal law: A: a blicket B: not a blicket
214
“Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004)
Trial 1 Trial 2 Two objects: A and B Trial 1: A B on detector – detector active Trial 2: A on detector – detector active 4-year-olds judge whether each object is a blicket A: a blicket (100% of judgments) B: probably not a blicket (66% of judgments)
215
Theory Ontology Constraints on causal relations
Types: Block, Detector, Trial Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) Constraints on causal relations For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) Functional form of causal relations Causes of Active(d,t) are independent mechanisms, with causal strengths wi. A background cause has strength w0. Assume a near-deterministic mechanism: wi ~ 1, w0 ~ 0.
216
Theory Ontology Types: Block, Detector, Trial Predicates:
Contact(Block, Detector, Trial) Active(Detector, Trial) E A B
217
Theory Ontology Types: Block, Detector, Trial Predicates:
Contact(Block, Detector, Trial) Active(Detector, Trial) A B E A = 1 if Contact(block A, detector, trial), else 0 B = 1 if Contact(block B, detector, trial), else 0 E = 1 if Active(detector, trial), else 0
218
Theory h00 : h10 : h01 : h11 : Constraints on causal relations
For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) P(h00) = (1 – q)2 P(h10) = q(1 – q) h00 : h10 : h01 : h11 : B A B No hypotheses with E B, E A, A B, etc. A E E P(h01) = (1 – q) q P(h11) = q2 = “A is a blicket” E A A B E A B E
219
“Activation law”: E=1 if and only if A=1 or B=1.
Theory Functional form of causal relations Causes of Active(d,t) are independent mechanisms, with causal strengths wb. A background cause has strength w0. Assume a near-deterministic mechanism: wb ~ 1, w0 ~ 0. P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): “Activation law”: E=1 if and only if A=1 or B=1.
220
Bayesian inference Evaluating causal models in light of data:
Inferring a particular causal relation:
221
Modeling backwards blocking
P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1):
222
Modeling backwards blocking
P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=1, B=1):
223
Modeling backwards blocking
P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B E E E P(E=1 | A=1, B=0): P(E=1 | A=1, B=1):
224
Manipulating the prior
I. Pre-training phase: Blickets are rare II. Backwards blocking phase: B A Trial 1 Trial 2 After each trial, adults judge the probability that each object is a blicket.
225
“Rare” condition: First observe 12 objects on detector, of which 2 set it off.
226
“Common” condition: First observe 12 objects on detector, of which 10 set it off.
227
Inferences from ambiguous data
I. Pre-training phase: Blickets are rare II. Two trials: A B detector, B C detector A B C Trial 1 Trial 2 After each trial, adults judge the probability that each object is a blicket.
228
Same domain theory generates hypothesis space for 3 objects:
Hypotheses: h000 = h100 = h010 = h001 = h110 = h011 = h101 = h111 = Likelihoods: E E A B C A B C E E A B C A B C E E A B C A B C E E P(E=1| A, B, C; h) = 1 if A = 1 and A E exists, or B = 1 and B E exists, or C = 1 and C E exists, else 0.
229
“Rare” condition: First observe 12 objects on detector, of which 2 set it off.
230
The role of causal mechanism knowledge
Is mechanism knowledge necessary? Constraint-based learning using c2 tests of conditional independence. How important is the deterministic functional form of causal relations? Bayes with “noisy sufficient causes” theory (c.f., Cheng’s causal power theory).
231
Bayes with correct theory:
Bayes with “noisy sufficient causes” theory:
232
Theory-based causal induction
Explains one-shot causal inferences about physical systems: blicket detectors Captures a spectrum of inferences: unambiguous data: adults and children make all-or-none inferences ambiguous data: adults and children make more graded inferences Extends to more complex cases with hidden variables, dynamic systems: come to my talk!
233
Summary Causal graphical models provide a language for asking questions about causality Key issues in modeling causal induction: what do we mean by causal induction? how do knowledge and statistics interact? Bayesian approach allows exploration of different answers to these questions
234
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
235
Property induction
236
Collaborators Charles Kemp Neville Sanjana Lauren Schmidt Amy Perfors
Fei Xu Liz Baraff Pat Shafto
237
The Big Question How can we generalize new concepts reliably from just one or a few examples? Learning word meanings “horse” “horse” “horse”
238
The Big Question How can we generalize new concepts reliably from just one or a few examples? Learning word meanings, causal relations, social rules, …. Property induction How probable is the the conclusion (target) given the premises (examples)? Gorillas have T4 cells. Squirrels have T4 cells. All mammals have T4 cells.
239
The Big Question How can we generalize new concepts reliably from just one or a few examples? Learning word meanings, causal relations, social rules, …. Property induction More diverse examples stronger generalization Gorillas have T4 cells. Squirrels have T4 cells. All mammals have T4 cells. Gorillas have T4 cells. Chimps have T4 cells. All mammals have T4 cells.
240
Is rational inference the answer?
Everyday induction often appears to follow principles of rational scientific inference. Could that explain its success? Goal of this work: a rational computational model of human inductive generalization. Explain people’s judgments as approximations to optimal inference in natural environments. Close quantitative fits to people’s judgments with a minimum of free parameters or assumptions.
241
Theory-Based Bayesian Models
Rational statistical inference (Bayes): Learners’ domain theories generate their hypothesis space H and prior p(h). Well-matched to structure of the natural world. Learnable from limited data. Computationally tractable inference.
242
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
243
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
244
An experiment (Osherson et al., 1990)
20 subjects rated the strength of 45 arguments: X1 have property P. X2 have property P. X3 have property P. All mammals have property P. 40 different subjects rated the similarity of all pairs of 10 mammals.
245
Similarity-based models (Osherson et al.)
strength(“all mammals” | X ) x x x Mammals: Examples: x
246
Similarity-based models (Osherson et al.)
strength(“all mammals” | X ) x x x Mammals: Examples: x
247
Similarity-based models (Osherson et al.)
strength(“all mammals” | X ) x x x Mammals: Examples: x
248
Similarity-based models (Osherson et al.)
strength(“all mammals” | X ) x x x Mammals: Examples: x
249
Similarity-based models (Osherson et al.)
Sum-Similarity: strength(“all mammals” | X ) x x S x Mammals: Examples: x
250
Similarity-based models (Osherson et al.)
Max-Similarity: strength(“all mammals” | X ) x x max x Mammals: Examples: x
251
Similarity-based models (Osherson et al.)
Max-Similarity: strength(“all mammals” | X ) x x x Mammals: Examples: x
252
Similarity-based models (Osherson et al.)
Max-Similarity: strength(“all mammals” | X ) x x x Mammals: Examples: x
253
Similarity-based models (Osherson et al.)
Max-Similarity: strength(“all mammals” | X ) x x x Mammals: Examples: x
254
Similarity-based models (Osherson et al.)
Max-Similarity: strength(“all mammals” | X ) x x x Mammals: Examples: x
255
Sum-sim versus Max-sim
Two models appear functionally similar: Both increase monotonically as new examples are observed. Reasons to prefer Sum-sim: Standard form of exemplar models of categorization, memory, and object recognition. Analogous to kernel density estimation techniques in statistical pattern recognition. Reasons to prefer Max-sim: Fit to generalization judgments
256
Data vs. models . Data Model Each “ ” represents one argument:
X1 have property P. X2 have property P. X3 have property P. All mammals have property P. Each “ ” represents one argument:
257
Three data sets Max-sim Sum-sim Conclusion kind: “all mammals”
“horses” “horses” Number of examples: , 2, or 3
258
Feature rating data (Osherson and Wilkie)
People were given 48 animals, 85 features, and asked to rate whether each animal had each feature. E.g., elephant: 'gray' 'hairless' 'toughskin' 'big' 'bulbous' 'longleg' 'tail' 'chewteeth' 'tusks' 'smelly' 'walks' 'slow' 'strong' 'muscle’ 'quadrapedal' 'inactive' 'vegetation' 'grazer' 'oldworld' 'bush' 'jungle' 'ground' 'timid' 'smart' 'group'
259
Compute similarity based on Hamming distance, or cosine.
? Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features New property Compute similarity based on Hamming distance, or cosine. Generalize based on Max-sim or Sum-sim.
260
Three data sets r = 0.77 r = 0.75 r = 0.94 Max-Sim r = – 0.21 r = 0.63
Sum-Sim Conclusion kind: “all mammals” “horses” “horses” Number of examples: , 2, or 3
261
Problems for sim-based approach
No principled explanation for why Max-Sim works so well on this task, and Sum-Sim so poorly, when Sum-Sim is the standard in other similarity-based models. Free parameters mixing similarity and coverage terms, and possibly Max-Sim and Sum-Sim terms. Does not extend to induction with other kinds of properties, e.g., from Smith et al., 1993: Dobermanns can bite through wire. German shepherds can bite through wire. Poodles can bite through wire. German shepherds can bite through wire.
262
Marr’s Three Levels of Analysis
Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” Representation and algorithm: Max-sim, Sum-sim Implementation: Neurobiology Cognitive psychology traditionally focuses on 2nd level. But this is unsatisying, for the same reason as in vision: - Ad hoc models, with arbitrary assumptions and free parameters. - Lots of different models with little sense of how they all fit together, or what the real differences are. Cognitive neuroscience has focused on the link between levels 2 and 3. But that doesn’t address what is to me the biggest mystery: how we are able to succeed in these inductive inference tasks, given that induction has been called a puzzle, a paradox, a scandal! Going back to plato, aristotle. Scandal though it might be, we do these things on a daily basis. That requires level 1. But outside of vision or language, not very much work on level 1, or the link from level 1 to level 2 and 3. No explanatory adequacy. Describe how the mind works, but don’t explain why it works that way. Why these inference strategies lead to success in the real world.
263
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
264
Theory-based induction
Scientific biology: species generated by an evolutionary branching process. A tree-structured taxonomy of species. Taxonomy also central in folkbiology (Atran).
265
Theory-based induction
Begin by reconstructing intuitive taxonomy from similarity judgments: clustering chimp gorilla horse cow rhino seal elephant mouse squirrel dolphin
266
How taxonomy constrains induction
Atran (1998): “Fundamental principle of systematic induction” (Warburton 1967, Bock 1973) Given a property found among members of any two species, the best initial hypothesis is that the property is also present among all species that are included in the smallest higher-order taxon containing the original pair of species.
267
Strong (0.76 [max = 0.82]) elephant squirrel chimp gorilla horse cow
rhino mouse dolphin seal “all mammals” Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Strong (0.76 [max = 0.82])
268
Strong: 0.76 [max = 0.82]) Weak: 0.17 [min = 0.14] elephant squirrel
chimp gorilla horse cow rhino mouse dolphin seal “large herbivores” Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Cows have property P. Horses have property P. Rhinos have property P. All mammals have property P. Strong: 0.76 [max = 0.82]) Weak: 0.17 [min = 0.14]
269
Strong: 0.76 [max = 0.82] Weak: 0.30 [min = 0.14] elephant squirrel
chimp gorilla horse cow rhino mouse dolphin seal “all mammals” Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Seals have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Strong: 0.76 [max = 0.82] Weak: 0.30 [min = 0.14]
270
Taxonomic distance Max-sim Sum-sim Conclusion kind: “all mammals”
“horses” “horses” Number of examples: , 2, or 3
271
The challenge Can we build models with the best of both traditional approaches? Quantitatively accurate predictions. Strong rational basis. Will require novel ways of integrating structured knowledge with statistical inference.
272
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
273
The Bayesian approach ? Features New property Species 1 Species 2 ?
274
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
275
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
276
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
277
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
278
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
279
The Bayesian approach ? Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
280
The Bayesian approach h d p(h) p(d |h) Features New property
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
281
Bayes’ rule: h d p(h) p(d |h) Features New property Generalization
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
282
Probability that property Q holds for species x:
p(h) p(d |h) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
283
h d p(h) p(d |h) if d is consistent “Size principle”: with h
|h | = # of positive instances of h otherwise p(h) p(d |h) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
284
The size principle h1 h2 “even numbers” “multiples of 10”
285
The size principle Data slightly more of a coincidence under h1 h1 h2
“even numbers” “multiples of 10” Data slightly more of a coincidence under h1
286
The size principle Data much more of a coincidence under h1 h1 h2
“even numbers” “multiples of 10” Data much more of a coincidence under h1
287
Illustrating the size principle
Which argument is stronger? “Non-monotonicity” Grizzly bears have property P. All mammals have property P. Grizzly bears have property P. Brown bears have property P. Polar bears have property P. All mammals have property P.
288
... Probability that property Q holds for species x: p(Q(x)|d) h d
p(h) p(d |h) p(Q(x)|d) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? ... Generalization Hypotheses New property
289
Probability that property Q holds for species x:
p(h) p(d |h) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
290
Specifying the prior p(h)
A good prior must focus on a small subset of all 2n possible hypotheses, in order to: Match the distribution of properties in the world. Be learnable from limited data. Be efficiently computationally. We consider two approaches: “Empiricist” Bayes: unstructured prior based directly on known features. “Theory-based” Bayes: structured prior based on rational domain theory, tuned to known features.
291
“Empiricist” Bayes: h d p(h) = (Heit, 1998) Features New property
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 p(h) = h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
292
Results r = 0.38 r = 0.16 r = 0.79 “Empiricist” Bayes r = 0.77
Max-Sim
293
Why doesn’t “Empiricist” Bayes work?
With no structural bias, requires too many features to estimate the prior reliably. An analogy: Estimating a smooth probability density function by local interpolation. N = 100 N = 500 N = 5
294
Why doesn’t “Empiricist” Bayes work?
With no structural bias, requires too many features to estimate the prior reliably. An analogy: Estimating a smooth probability density function by local interpolation. Assuming an appropriately structured form for density (e.g., Gaussian) leads to better generalization from sparse data. N = 5 N = 5
295
“Theory-based” Bayes Theory: Two principles based on the structure of species and properties in the natural world. 1. Species generated by an evolutionary branching process. A tree-structured taxonomy of species (Atran, 1998). 2. Features generated by stochastic mutation process and passed on to descendants. Novel features can appear anywhere in tree, but some distributions are more likely than others.
296
T h d Mutation process generates p(h|T): |b| = length of branch b
Choose label for root. Probability that label mutates along branch b : l = mutation rate |b| = length of branch b s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
297
T h d Mutation process generates p(h|T): |b| = length of branch b
Choose label for root. Probability that label mutates along branch b : l = mutation rate |b| = length of branch b x x x T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
298
Samples from the prior >
Labelings that cut the data along fewer branches are more probable: > “monophyletic” “polyphyletic”
299
Samples from the prior >
Labelings that cut the data along longer branches are more probable: > “more distinctive” “less distinctive”
300
T h d Mutation process over tree T generates p(h|T).
Message passing over tree T efficiently sums over all h. How do we know which tree T to use? s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
301
T h d The same mutation process generates p(Features|T): p(h|T)
Assume each feature generated independently over the tree. Use MCMC to infer most likely tree T and mutation rate l given observed features. No free parameters! s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
302
Results r = 0.91 r = 0.95 r = 0.91 “Theory-based” Bayes r = 0.38
“Empiricist” Bayes r = 0.77 r = 0.75 r = 0.94 Max-Sim
303
Grounding in similarity
Reconstruct intuitive taxonomy from similarity judgments: clustering chimp gorilla horse cow rhino seal elephant mouse squirrel dolphin
304
Theory-based Bayes Max-sim Sum-sim Conclusion kind: “all mammals”
“horses” “horses” Number of examples: , 2, or 3
305
Explaining similarity
Why does Max-sim fit so well? An efficient and accurate approximation to this Theory-Based Bayesian model. Theorem. Nearest neighbor classification approximates evolutionary Bayes in the limit of high mutation rate, if domain is tree-structured. Correlation with Bayes on three-premise general arguments, over 100 simulated trees: Mean r = 0.94 Correlation (r)
306
Alternative feature-based models
Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) > “monophyletic” “polyphyletic”
307
Alternative feature-based models
Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) PDP network (Rogers and McClelland) Features Species
308
Results Note: PDP graph is mocked up, correlations OK. Tax Bayes
Bias is too strong weak just right! Note: PDP graph is mocked up, correlations OK. Tax Bayes graphs OK, not sure about correlations. Theory-based Bayes r = 0.51 r = 0.53 r = 0.85 Taxonomic Bayes r = 0.41 r = 0.62 r = 0.71 PDP network
309
Mutation principle versus pure Occam’s Razor
Mutation principle provides a version of Occam’s Razor, by favoring hypotheses that span fewer disjoint clusters. Could we use a more generic Bayesian Occam’s Razor, without the biological motivation of mutation?
310
T h d Mutation process generates p(h|T): |b| = length of branch b
Choose label for root. Probability that label mutates along branch b : l = mutation rate |b| = length of branch b s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
311
T h d Mutation process generates p(h|T): |b| = length of branch b
Choose label for root. Probability that label mutates along branch b : l = mutation rate |b| = length of branch b s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 T p(h|T) h d Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features Generalization Hypothesis New property
312
Premise typicality effect (Rips, 1975; Osherson et al., 1990): Strong:
Bayes (taxonomy+ mutation) Premise typicality effect (Rips, 1975; Osherson et al., 1990): Strong: Weak: Bayes (taxonomy+ Occam) Horses have property P. All mammals have property P. Max-sim Seals have property P. All mammals have property P. Conclusion kind: “all mammals” Number of examples: 1
313
Typicality meets hierarchies
Collins and Quillian: semantic memory structured hierarchically Traditional story: Simple hierarchical structure uncomfortable with typicality effects & exceptions. New story: Typicality & exceptions compatible with rational statistical inference over hierarchy.
314
Intuitive versus scientific theories of biology
Same structure for how species are related. Tree-structured taxonomy. Same probabilistic model for traits Small probability of occurring along any branch at any time, plus inheritance. Different features Scientist: genes People: coarse anatomy and behavior
315
Induction in Biology: summary
Theory-based Bayesian inference explains taxonomic inductive reasoning in folk biology. Insight into processing-level accounts. Why Max-sim over Sum-sim in this domain? How is hierarchical representation compatible with typicality effects & exceptions? Reveals essential principles of domain theory. Category structure: taxonomic tree. Feature distribution: stochastic mutation process inheritance.
316
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
317
Property type Generic “essence” Theory Structure Taxonomic Tree . . .
Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey . . .
318
Generic “essence” Size-related Food-carried Theory Structure
Property type Generic “essence” Size-related Food-carried Theory Structure Taxonomic Tree Dimensional Directed Acyclic Network Giraffe Lion Giraffe Cheetah Lion Lion Gorilla Hyena Gazelle Giraffe Hyena Hyena Cheetah Gazelle Gazelle Monkey Gorilla Cheetah Monkey Monkey Gorilla Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey . . . . . . . . .
319
One-dimensional predicates
Q = “Have skins that are more resistant to penetration than most synthetic fibers”. Unknown relevant property: skin toughness Model influence of known properties via judged prior probability that each species has Q. threshold for Q Skin toughness House cat Camel Elephant Rhino
320
One-dimensional predicates
Bayes (taxonomy+ mutation) Max-sim Bayes (1D model)
321
Food web model fits (Shafto et al.)
Disease r = 0.77 r = 0.82 Property r = -0.35 r = -0.05 Mammals Island
322
Taxonomic tree model fits (Shafto et al.)
Disease r = -0.12 r = 0.16 Property r = 0.81 r = 0.62 Mammals Island
323
The plan Similarity-based models Theory-based model Bayesian models
“Empiricist” Bayes Theory-based Bayes, with different theories Connectionist (PDP) models Advanced Theory-based Bayes Learning with multiple domain theories Learning domain theories Size principle: genericity, nonaccidental, Occam.
324
Theory Domain Structure Data
Species organized in taxonomic tree structure Feature i generated by mutation process with rate li p(S|T) F9 Domain Structure F8 F7 F11 F14 F13 F6 F12 F14 F10 F3 F1 F2 F4 F5 F10 F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) l10 high ~ weight low Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data
325
Theory Domain Structure Data
Species organized in taxonomic tree structure Feature i generated by mutation process with rate li p(S|T) F9 Domain Structure F8 F7 F11 F14 F13 F6 F12 F14 F10 F3 F1 F2 F4 F5 F10 F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data ? ? ? ? ? ? ? ? ? ? ? ? ? Species X
326
Theory Domain Structure Data
Species organized in taxonomic tree structure Feature i generated by mutation process with rate li p(S|T) F9 Domain Structure F8 F7 F11 F14 F13 F6 F12 F14 F10 F3 F1 F2 F4 F5 F10 SX F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data Species X
327
Where does the domain theory come from?
Innate. Atran (1998): The tendency to group living kinds into hierarchies reflects an “innately determined cognitive structure”. Emerges (only approximately) through learning in unstructured connectionist networks. McClelland and Rogers (2003).
328
Bayesian inference to theories
Challenge to the nativist-empiricist dichotomy. We really do have structured domain theories. We really do learn them. Bayesian framework applies over multiple levels: Given hypothesis space + data, infer concepts. Given theory + data, infer hypothesis space. Given X + data, infer theory.
329
Bayesian inference to theories
Candidate theories for biological species and their features: T0: Features generated independently for each species. (c.f. naive Bayes, Anderson’s rational model.) T1: Features generated by mutation in tree-structured taxonomy of species. T2: Features generated by mutation in a one-dimensional chain of species. Score theories by likelihood on object-feature matrix:
330
Data T0: No organizational structure for species. Features distributed
independently over species. F1 F2 F3 F5 F7 F8 F10 F12 F13 F1 F6 F7 F8 F9 F10 F13 F2 F4 F8 F9 F10 F11 F14 F2 F4 F6 F7 F9 F14 F2 F4 F7 F9 F12 F14 F2 F4 F5 F12 F13 F14 F1 F2 F5 F8 F9 F1 F5 F7 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data Features
331
Data T0: No organizational structure for species. Features distributed
independently over species. F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F1 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F4 F8 F9 F4 F8 F9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data
332
Data T0: T1: No organizational structure Species organized in
for species. Features distributed independently over species. T1: Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F1 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F4 F8 F9 F4 F8 F9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data
333
Data T0: p(Data|T1) ~ 1.83 x 10-41 T1: p(Data|T2) ~ 2.42 x 10-32
No organizational structure for species. Features distributed independently over species. T1: p(Data|T2) ~ 2.42 x 10-32 Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F1 F6 F7 F8 F9 F10 F11 F3 F7 F8 F9 F11 F12 F14 F3 F7 F8 F9 F11 F12 F14 F1 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F2 F6 F7 F8 F9 F11 F5 F9 F10 F13 F14 F5 F9 F10 F13 F14 F4 F8 F9 F4 F8 F9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Data
334
Data T0: T1: No organizational structure Species organized in
for species. Features distributed independently over species. T1: Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. F1 F2 F3 F5 F7 F8 F10 F12 F13 F2 F4 F1 F5 F7 F13 F1 F6 F7 F8 F9 F10 F13 F2 F4 F8 F9 F10 F11 F14 F14 F8 F9 F12 F2 F4 F6 F7 F9 F14 F2 F4 F7 F9 F12 F14 F2 F4 F5 F12 F13 F14 F9 F13 F10 F1 F2 F5 F8 F9 F1 F5 F7 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 F7 F11 F13 F10 F8 F13 F10 F11 F12 F7 F3 F12 F12 F9 F6 F5 F6 F5 F8 F3 F2 F6 F6 F2 F14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S2 S4 S7 S10 S8 S1 S9 S6 S3 S5 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data Features
335
Data T0: p(Data|T1) ~ 2.29 x 10-42 T1: p(Data|T2) ~ 4.38 x 10-53
No organizational structure for species. Features distributed independently over species. T1: p(Data|T2) ~ 4.38 x 10-53 Species organized in taxonomic tree structure. Features distributed via stochastic mutation process. F1 F2 F3 F5 F7 F8 F10 F12 F13 F2 F4 F1 F5 F7 F13 F1 F6 F7 F8 F9 F10 F13 F2 F4 F8 F9 F10 F11 F14 F14 F8 F9 F12 F2 F4 F6 F7 F9 F14 F2 F4 F7 F9 F12 F14 F2 F4 F5 F12 F13 F14 F9 F13 F10 F1 F2 F5 F8 F9 F1 F5 F7 F13 F14 F2 F3 F6 F11 F13 F1 F6 F8 F9 F12 F7 F11 F13 F10 F8 F13 F10 F11 F12 F7 F3 F12 F12 F9 F6 F5 F6 F5 F8 F3 F2 F6 F6 F2 F14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S2 S4 S7 S10 S8 S1 S9 S6 S3 S5 Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Data Features
336
Empirical tests Synthetic data: 32 objects, 120 features Real data
tree-structured generative model linear chain generative model unconstrained (independent features). Real data Animal feature judgments: 48 species, 85 features. US Supreme Court decisions, : 9 people, 637 cases.
337
Results Preferred Model Null Tree Linear
338
Theory acquisition: summary
So far, just a computational proof of concept. Future work: Experimental studies of theory acquisition in the lab, with adult and child subjects. Modeling developmental or historical trajectories of theory change. Sources of hypotheses for candidate theories: What is innate? Role of analogy?
339
Outline Morning Afternoon Introduction (Josh)
Basic case study #1: Flipping coins (Tom) Basic case study #2: Rules and similarity (Josh) Afternoon Advanced case study #1: Causal induction (Tom) Advanced case study #2: Property induction (Josh) Quick tour of more advanced topics (Tom)
340
Advanced topics
341
Structure and statistics
Statistical language modeling topic models Relational categorization attributes and relations
342
Structure and statistics
Statistical language modeling topic models Relational categorization attributes and relations
343
Statistical language modeling
A variety of approaches to statistical language modeling are used in cognitive science e.g. LSA (Landauer & Dumais, 1997) distributional clustering (Redington, Chater, & Finch, 1998) Generative models have unique advantages identify assumed causal structure of language make use of standard tools of Bayesian statistics easily extended to capture more complex structure
344
Generative models for language
latent structure observed data
345
Generative models for language
meaning sentences
346
Topic models Each document a mixture of topics
Each word chosen from a single topic Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999) Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)
347
Generating a document q distribution over topics z z z
topic assignments w w w observed words
348
w P(w|z = 1) = f (1) w P(w|z = 2) = f (2) HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC KNOWLEDGE 0.0 WORK 0.0 RESEARCH 0.0 MATHEMATICS 0.0 HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY 0.0 SCIENTIFIC KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 topic 1 topic 2
349
Choose mixture weights for each document, generate “bag of words”
q = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
350
A selection of topics (from 500)
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE NATURE WORLD HUMAN PHILOSOPHY MORAL KNOWLEDGE THOUGHT REASON SENSE OUR TRUTH NATURAL EXISTENCE BEING LIFE MIND ARISTOTLE BELIEVED EXPERIENCE REALITY THIRD FIRST SECOND THREE FOURTH FOUR GRADE TWO FIFTH SEVENTH SIXTH EIGHTH HALF SEVEN SIX SINGLE NINTH END TENTH ANOTHER
351
A selection of topics (from 500)
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY
352
A selection of topics (from 500)
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY
353
Learning topic hiearchies
(Blei, Griffiths, Jordan, & Tenenbaum, 2004)
354
Syntax and semantics from statistics
Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents semantics: probabilistic topics q z z z w w w x x x syntax: probabilistic regular grammar (Griffiths, Steyvers, Blei, & Tenenbaum, submitted)
355
x = 2 x = 1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.8 z = z = HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 0.7 0.3 0.1 0.2 x = 3 THE 0.6 A 0.3 MANY 0.1 0.9
356
THE ……………………………… x = 2 x = 1 x = 3 0.8 z = 1 0.4 z = 2 0.6 0.7 0.3 0.1
OF 0.6 FOR 0.3 BETWEEN 0.1 0.8 z = z = HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 0.7 0.3 0.1 0.2 x = 3 THE 0.6 A 0.3 MANY 0.1 0.9 THE ………………………………
357
THE LOVE…………………… x = 2 x = 1 x = 3 0.8 z = 1 0.4 z = 2 0.6 0.7 0.3 0.1
OF 0.6 FOR 0.3 BETWEEN 0.1 0.8 z = z = HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 0.7 0.3 0.1 0.2 x = 3 THE 0.6 A 0.3 MANY 0.1 0.9 THE LOVE……………………
358
THE LOVE OF……………… x = 2 x = 1 x = 3 0.8 z = 1 0.4 z = 2 0.6 0.7 0.3
FOR 0.3 BETWEEN 0.1 0.8 z = z = HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 0.7 0.3 0.1 0.2 x = 3 THE 0.6 A 0.3 MANY 0.1 0.9 THE LOVE OF………………
359
THE LOVE OF RESEARCH …… x = 2 x = 1 x = 3 0.8 z = 1 0.4 z = 2 0.6 0.7
FOR 0.3 BETWEEN 0.1 0.8 z = z = HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 0.7 0.3 0.1 0.2 x = 3 THE 0.6 A 0.3 MANY 0.1 0.9 THE LOVE OF RESEARCH ……
360
Semantic categories FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY
MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW
361
Syntactic categories BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP
KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN
362
Statistical language modeling
Generative models provide transparent assumptions about causal process opportunities to combine and extend models Richer generative models... probabilistic context-free grammars paragraph or sentence-level dependencies more complex semantics
363
Structure and statistics
Statistical language modeling topic models Relational categorization attributes and relations
364
Relational categorization
Most approaches to categorization in psychology and machine learning focus on attributes - properties of objects words in titles of CogSci posters But… a significant portion of knowledge is organized in terms of relations co-authors on posters who talks to whom (Kemp, Griffiths, & Tenenbaum, 2004)
365
Attributes and relations
Data Model objects P(X) = ik z P(xik|zi) i P(zi) attributes X mixture model (c.f. Anderson, 1990) objects Y P(Y) = ij z P(yij|zi) i P(zi) objects stochastic blockmodel
366
Stochastic blockmodels
For any pair of objects, (i,j), probability of relation is determined by classes, (zi, zj) Allows types of objects and class probabilities to be learned from data l21 l22 l23 l31 l32 l33 l11 l12 l13 = L From type i To type j Each entity has a type = Z P(Z,L|Y) P(Y|Z,L)P(Z)P(L)
367
Stochastic blockmodels
368
Categorizing words Relational data: word association norms (Nelson, McEvoy, & Schreiber, 1998) 5018 x 5018 matrix of associations symmetrized all words with < 50 and > 10 associates 2513 nodes, links
371
Categorizing words BAND INSTRUMENT BLOW HORN FLUTE BRASS GUITAR PIANO
TUBA TRUMPET TIE COAT SHOES ROPE LEATHER SHOE HAT PANTS WEDDING STRING SEW MATERIAL WOOL YARN WEAR TEAR FRAY JEANS COTTON CARPET WASH LIQUID BATHROOM SINK CLEANER STAIN DRAIN DISHES TUB SCRUB
372
Categorizing actors Internet Movie Database (IMDB) data, from the start of cinema to 1960 (Jeremy Kubica) Relational data: collaboration 5000 x 5000 matrix of most prolific actors all actors with < 400 and > 1 collaborators 2275 nodes, links
375
Categorizing actors Albert Lieven Karel Stepanek Walter Rilla
Anton Walbrook Moore Marriott Laurence Hanray Gus McNaughton Gordon Harker Helen Haye Alfred Goddard Morland Graham Margaret Lockwood Hal Gordon Bromley Davenport Gino Cervi Nadia Gray Enrico Glori Paolo Stoppa Bernardi Nerio Amedeo Nazzari Gina Lollobrigida Aldo Silvani Franco Interlenghi Guido Celano Archie Ricks Helen Gibson Oscar Gahan Buck Moulton Buck Connors Clyde McClary Barney Beasley Buck Morgan Tex Phelps George Sowards Germany UK British comedy Italian US Westerns
376
Structure and statistics
Bayesian approach allows us to specify structured probabilistic models Explore novel representations and domains topics for semantic representation relational categorization Use powerful methods for inference, developed in statistics and machine learning
377
Other methods and tools...
Inference algorithms belief propagation dynamic programming the EM algorithm and variational methods Markov chain Monte Carlo More complex models Dirichlet processes and Bayesian non-parametrics Gaussian processes and kernel methods Reading list at
378
Taking stock
379
Bayesian models of inductive learning
Inductive leaps can be explained with hierarchical Theory-based Bayesian models: Domain Theory Probabilistic Generative Model Bayesian inference Structural Hypotheses Data
380
Bayesian models of inductive learning
Inductive leaps can be explained with hierarchical Theory-based Bayesian models: T ... S S S ... D D D D D D D D D
381
Bayesian models of inductive learning
Inductive leaps can be explained with hierarchical Theory-based Bayesian models. What the approach offers: Strong quantitative models of generalization behavior. Flexibility to model different patterns of reasoning that in different tasks and domains, using differently structured theories, but the same general-purpose Bayesian engine. Framework for explaining why inductive generalization works, where knowledge comes from as well as how it is used.
382
Bayesian models of inductive learning
Inductive leaps can be explained with hierarchical Theory-based Bayesian models. Challenges: Theories are hard.
383
Bayesian models of inductive learning
Inductive leaps can be explained with hierarchical Theory-based Bayesian models: The interaction between structure and statistics is crucial. How structured knowledge supports statistical learning, by constraining hypothesis spaces. How statistics supports reasoning with and learning structured knowledge. How complex structures can grow from data, rather than being fully specified in advance.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.