Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.

Slides:



Advertisements
Similar presentations
The influence of domain priors on intervention strategy Neil Bramley.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Dynamic Bayesian Networks (DBNs)
5/17/20151 Probabilistic Reasoning CIS 479/579 Bruce R. Maxim UM-Dearborn.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Part II: Graphical models
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Bayesian Belief Networks
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
Scientific Thinking - 1 A. It is not what the man of science believes that distinguishes him, but how and why he believes it. B. A hypothesis is scientific.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
Hypothesis Testing:.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Bayesian Networks. Male brain wiring Female brain wiring.
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Basic Concepts of Discrete Probability (Theory of Sets: Continuation) 1.
Learning from observations
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Classification Techniques: Bayesian Classification
Biological Science.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Making sense of randomness
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Unit 1 Review 1. To say that learning has taken place, we must observe a change in a subject’s behavior. What two requirements must this behavioral change.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Human causal induction Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Structured Probabilistic Models: A New Direction in Cognitive Science
CHAPTER 4 Designing Studies
Unit 5: Hypothesis Testing
CHAPTER 4 Designing Studies
Data Mining Lecture 11.
CHAPTER 4 Designing Studies
Revealing priors on category structures through iterated learning
Example Human males have one X-chromosome and one Y-chromosome,
CHAPTER 4 Designing Studies
PSY 626: Bayesian Statistics for Psychological Science
Class #21 – Monday, November 10
CHAPTER 4 Designing Studies
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Machine Learning: UNIT-3 CHAPTER-1
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Generalized Diagnostics with the Non-Axiomatic Reasoning System (NARS)
Presentation transcript:

Bayesian approaches to cognitive sciences

Word learning Bayesian property induction Theory-based causal inference

Word Learning

Some constrains on word learning 1.Very few examples required 2.Learning is possible with only positive examples 3.Word meanings overlap 4.Learning is often graded

Word Learning Given a few instances of a particular words, say ‘dog’, how do we generalize to new instance? Hypothesis elimination: use deductive logic (along with prior knowledge) to eliminate hypothesis that are inconsistent with the use of the word.

Word Learning Some constrains on word learning 1.Very few examples required 2.Learning is possible with only positive examples 3.Word meaning overlap 4.Learning is often graded

Word Learning Given a few instances of a particular words, say ‘dog’, how do we generalize to new instance? Connectionist (associative) approach: compute the probability of co-occurrences of object features and the corresponding word

Word Learning Some constrains on word learning 1.Very few examples required 2.Learning is possible with only positive examples 3.Word meaning overlap 4.Learning is often graded

Word learning Alternative: Rational statistical inference with structure hypothesis space Suppose you see a Dalmatian and you hear ‘fep’. Does ‘fep’ refer to all dogs or just Dalmatians? What if you hear 3 more example, all corresponding to Dalmatians? Then it should be clear ‘fep’ are Dalmatians because this observation would be ‘suspicious coincidence’ if ‘fep’ referred to all dogs. Therefore, logic is not enough, you also need probabilities. However, you don’t need that many examples. And co- occurrence frequencies is not enough (in our example, fep is associated 100% of the time with Dalmatians whether you see one or three examples) We need structured prior knowledge

Word Learning Suppose objects are organized in taxonomic trees (animals) (dogs) (dalmatians)

Word learning We’re given N examples of a word C. The goal of learning will be to determine whether C corresponds to the subordinate, basic or superordinate level. The level in the taxonomy is what we mean by ‘meaning’. h: Hypothesis, i.e., ‘word meaning.’ The set of possible hypothesis is strongly constrained by the tree structure.

Word Learning H: hypotheses T: Tree structure (animals) (dogs) (dalmatians)

Word learning Inference just follow Bayes rule h: Hypothesis: is this a ‘dog/basic level’ word? T: type of representation being assumed (e.g. tree stucture) x: Data, e.g. a set of labeled images of animals

Word learning Inference just follow Bayes rule Likelihood function: probability of the data Prior: the prior is strongly constrained by the tree structure. Only some hypothesis are possible (the ones corresponding to the hierarchical levels in the tree)

Word learning Likelihood functions and the ‘size’ principle Assume you’re given n example of a particular group (e.g. 3 examples of dogs, or 3 examples of dalmatians). Then:

Word learning Lets assume there are 100 dogs in the world, 10 of them Dalmatians. If examples are drawn randomly with replacement from those pools, we have

Word learning More generally, the probability of getting n examples of a particular hypothesis h is given by: This is known as ‘the size principle’. Multiple examples drawn from smaller sets are more likely.

Word learning Let say you’re given 1 example of a Dalmatian Conclusion: it’s very likely to be a subordinate word

Word learning Let say you’re given 4 examples, all Dalmatians. Conclusion: it’s a subordinate word (dalmatian) with near certainty or it’s a very ‘suspicious coincidence’!

Word learning Let say you’re given 5 examples, 2 Dalmatians and 3 German Shepards. Conclusion: it’s a basic level word (dog) with near certainty Probablity that images got mislabeled. Assumed to be very small.

Word Learning Subject shown one Dalmatian and told it’s a ‘fep’ Subord. match: subject is shown a new dalmatian and asked if it’s a ‘fep’ Basic match: subject is shown a new dog (non-dalmatian) and asked whether it’s a ‘fep’

Word Learning Subject shown three Dalmatians and told they are ‘feps’ Subord. match: subject is shown a new dalmatian and asked if it’s a ‘fep’ Basic match: subject is shown a new dog (non-dalmatian) and asked whether it’s a ‘fep’ As more subord. examples are collected, probability for basic and superord. level go down.

Word Learning As more basic examples are collected, probability for basic level goes up. With only one example, sub level is favored.

Word Learning Model produces similar behavior

Bayesian property induction

Given that Gorillas and Chimpanzees have gene X, do Macaques have gene X? If cheetah and giraffes carry disease X, do polar bear carry disease X? Classic approach: boolean logic. Problem: such questions are inherently probabilistic. Answering yes or no would be very misleading. Fuzzy logic?

Bayesian property induction C: concept (e.g. mammals can get disease X), i.e., a set of animals defined by a particular property. H: Hypothesis space. Space of all possible concepts, i.e., all possible sets of animal. With 10 animals, H contains 2 10 sets. h: a particular set. Note that there is an h for which h=C. y: a particular statement (Dolphins can get disease X), that is, a subset of any hypothesis. X: a set of observations drawn for the concept C.

Bayesian property induction The goal of inference is to determine whether y belong to a concept C for which we have samples X: E.g., Given that Gorillas and Chimpanzees have gene X, do Macaques have gene X? X = {Gorillas, Chimpanzees } y = {Macaques } C = the set of all animals with gene X. Note that we don’t know the full list of animal in set C. C is a hidden (or latent) variable. We need to integrate it out.

Bayesian property induction Animal ={bufallo, zebra, giraffe, seal}={b,z,g,s} X={b,z} have property  y=g, do giraffes have property  Probability that g has property  given that h contains g. It must be equal to 1. Probability that g has property  given that b and z have property 

Bayesian property induction More formally: Warning: this is the probability that X will be observed given h. It is not the probability that members of X belong to h.

Bayesian property induction The likelihood function Animal ={bufallo, zebra, giraffe, seal}={b,z,g,s} If h={b,z}  p(X=b | h)=0.5 p(X={b,z} |h)=0.5*0.5 p(X=g | h)=0. This rules out all hypothesis that do not contain all of the observations If h={b,z,g} have property  p(X={b,z}|h)= The larger the set, the smaller the likelihood. Occam’s razor.

Bayesian property induction The likelihood function Non zero only if X is subset of h. Note that sets that contains no or some elements of X, but no all elements of X, are ruled out. Also, data are less likely to come from large sets (Occam’s razor?).

Bayesian property induction The Prior: the prior should embed knowledge of the domain. Naive approach: If we have 10 animals and we consider all possible sets, we ended up with 2 10 =1024 sets. A flat prior over this would yield a prior of 1/1024 for each set.

Bayesian taxonomic prior Note: only 19 sets have non zero prior =1005 sets have a prior of zero. HUGE constrain! Bayesian property induction

Taxonomic prior is not enough. Why? 1.Seals and squirrels catch disease X, so horses are also susceptible 2.Seals and cows catch disease X, so horses are also susceptible Most people say that statement 2 is stronger.

Bayesian property induction Seals and squirrels catch disease X, so horses are also susceptible Seals and cows catch disease X, so horses are also susceptible Only hypothesis that can have all three animals

Bayesian property induction This approach does not distinguish the two statements Only hypothesis that can have all three animals

Bayesian property induction Evolutionary Bayes

Bayesian property induction Evolutionary Bayes: all 1024 sets of animals are possible, but they differ by their prior. Sets that contains animals that are nearby in the tree are more likely.

Bayesian property induction Comparison x x x p p2p2

Bayesian property induction Comparison likely set: p+p 2

Bayesian property induction Comparison unlikely set: p 2 x x

Bayesian property induction 1.Seals and squirrels catch disease X, so horses are also susceptible 2.Seals and cows catch disease X, so horses are also susceptible Under the evolutionary prior, there are more scenarios that are compatible with the second statement.

Bayesian property induction Evolutionary prior explain the data better than any other model (but not by much…).

Other priors

How to learn the structure of the prior Syntactic rules for growing graphs

Theory-based causal inference

Can we use this framework to infer causality?

Theory-based causal inference ‘Blicket’ detector: activates when a ‘blicket’ is placed onto it. Observation 1: B1 and B2: detector on Most kids say that B1 and B2 are blickets Observation 2: B1 alone: detector on All kids say B1 is a blicket but not B2. This is known as extinction or ‘explaining away’

Theory-based causal inference Impossible to capture with usual learning algorithm because there aren’t enough trials to learn all the probabilities involved. Simple reasoning could be used, along with Occam’s razor (e.g., B1 alone is enough to explain all the data), but it’s hard to formalize (How do we define Occam’s razor?)

Theory-based causal inference Alternative: assume the data were generated by a causal process. We observed two trials d 1 ={e=1, x 1 =1, x 2 =1} and d 1 ={e=1, x 1 =1, x 2 =0}. What kind of Bayesian net can account for these data? There are only four possible networks:

Theory-based causal inference Which network is most likely to explain the data? Bayesian approach: compute the posterior over networks given the data If we assume that the probability of any object to be a blicket is  the prior over Bayesian nets is given by

Theory-based causal inference Let consider what happen after we observe d 1 ={e=1, x 1 =1, x 2 =1} If we assume the machine does not go off by itself (p(e=1)=0), we have

Theory-based causal inference Let consider what happen after we observe d 1 ={e=1, x 1 =1, x 2 =1} For the other nets:

Theory-based causal inference Therefore, we’re left with three networks and for each of them we have: Assuming blickets are rare (  <0.5), the most likely explanations are the ones for which only one object is a blicket (h 10 and h 01 ). Therefore, object1 or object 2 is a blicket (but it’s unlikely that both are blickets!)

Theory-based causal inference We now observe d 2 ={e=1, x 1 =1, x 2 =0} Again, if we assume the machine does not go off by itself (p(e=1)=0), we have We’re assuming that the machine does not go off if there is no ‘blicket’

Theory-based causal inference We observed two trials d 1 ={e=1, x 1 =1, x 2 =1} and d 1 ={e=1, x 1 =1, x 2 =0}. h 00 and h 01 are inconsistent with these data, so we’re left with the other two.

Theory-based causal inference We observed two trials d 1 ={e=1, x 1 =1, x 2 =1} and d 1 ={e=1, x 1 =1, x 2 =0} and we’re left with two networks. Assuming ‘blicket’s are rare (  <0.5), The network in which only one object is a blicket (h 10 ) is the most likely explanation

Theory-based causal inference But what happens to the probability that X 2 is a blicket? To compute this we need to compute Either this link is in network h jk or it’s not Hypothesis for which this link exists must have a 1 here.

Theory-based causal inference But what happens to the probability that X 2 is a blicket? (Data suggest that  =1/3)

Theory-based causal inference Probability that X 2 is a blicket after the second observation Therefore the probability that X 2 is a blicket went down after the second observation (3/5 to 1/3) which is consistent with kids’ reports. Occam’s razor comes from assuming  <0.5.

Theory-based causal inference This approach can be generalized to much more complicated generative models.