Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition.

Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition

A Puzzle Concepts & causation are intertwined  Concepts and categorization depend (in part) on causal beliefs and inferences  Causal learning and reasoning depend (in part) on the particular concepts we have But the most prevalent theories in the two fields use quite different formalisms Q: Can categorization and causal inference be represented in a common “language”?

Central Theoretical Claim Many psychological theories of categorization are equivalent to (special cases of) Bayesian categorization of probabilistic graphical models (and so the answer to the previous question is “Yes” – they can share the language of graphical models)

Overview Bayesian Categorization of Probabilistic Graphical Models (PGMs) Psychological Theories of Categorization Theoretical & Experimental Implications

Bayesian Categorization Set of exclusive, exhaustive models M  For each model m, a prior probability and a P(X | m) distribution (perhaps a generative model) Given X, update the model probabilities: (and use the updated probabilities for choices)

Probabilistic Graphical Models PGMs were developed to provide compact representations of probability distributions All PGMs are defined for a set of variables V, and composed of:  Graph over (nodes corresponding to) V  Probability distribution/density over V

Probabilistic Graphical Models Markov assumption: Graph entails certain (conditional and unconditional) independence constraints on the probability distribution  Markov assumptions imply a decomposition of the probability distribution into a product of simpler terms (i.e., fewer parameters) Different PGM-types have different graph- types and/or Markov assumptions

Probabilistic Graphical Models Also assume Faithfulness/Stability:  The only probabilistic independencies are those implied by Markov If we do not assume this, then every probability distribution can be represented by every PGM-type Faithfulness is assumed explicitly or implicitly by all PGM learning algorithms Def’n: A graph is a perfect map iff it is Markov & Faithful to the probability distribution

Probabilistic Graphical Models For a particular PGM-type, the set of probability distributions with a perfect map in that PGM-type form a natural group  This set will almost always be non-exhaustive  Shorthand: “Probability distribution for PGM” will mean “Probability distribution for which there is a perfect map in the PGM-type”

Bayesian Networks Directed acyclic graph Markov: V is independent of its non- descendants conditional on its parents Example: P(F 1, F 2, F 3, F 4 ) = P(F 1 )  P(F 2 )  P(F 4 | F 1, F 2 )  P(F 3 | F 4 ) F1F1 F2F2 F3F3 F4F4

Markov Random Fields Undirected graph (i.e., no arrowheads) Markov: V is independent of its non- neighbors conditional on its neighbors Example: P(F 1, F 2, F 3, F 4 ) = P(F 1, F 4 )  P(F 2, F 4 )  P(F 3, F 4 ) F1F1 F2F2 F3F3 F4F4

Bayesian Categorization of PGMs Use the standard updating equation: And require the P(X | a) distributions to be distributions for that PGM-type  I.e., the PGM-type supplies the generative model

Simple Example Suppose we have two equiprobable models F1F1 F2F2 F1F1 F2F2 P(F 1 = 1) = 0.1 P(F 2 = 1) = 0.2 P(F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 0) = 0.6

Simple Example Suppose we have two equiprobable models Observe 11, and conclude Right  P(Left | 11) = 0.03<< P(Right | 11) = 0.97 F1F1 F2F2 F1F1 F2F2 P(F 1 = 1) = 0.1 P(F 2 = 1) = 0.2 P(F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 0) = 0.6

Simple Example Suppose we have two equiprobable models Observe 11, and conclude Right  P(Left | 11) = 0.03<< P(Right | 11) = 0.97 Observe 00, and conclude Left  P(Left | 00) = 0.90>> P(Right | 00) = 0.10 and so on… F1F1 F2F2 F1F1 F2F2 P(F 1 = 1) = 0.1 P(F 2 = 1) = 0.2 P(F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 1) = 0.8 P(F 2 = 1 | F 1 = 0) = 0.6

Psychological Theories All assume a fixed set of input features  Usually binary-, sometimes continuous-valued For the purposes of this talk, I will focus on static theories of categorization  That is, focus on the categories that are learned, as opposed to the learning process itself  The learning processes can also be captured/ explained in this framework

Shared Theoretical Structure For many psychological theories, categorization of a novel instance involves:  For each category under consideration, determine the similarity (according to a specific metric) between the category and the novel instance  Then use the category similarities to generate a response probability for each category Alternately, use a deterministic choice rule but assume noise in the perceptual system (e.g., Ashby)

Shared Theoretical Structure In this high-level picture,  We get different categorization theories by having (i) different classes of similarity metrics, and/or (ii) different response rules  Within a particular theory, different particular categories result from different actual similarity metrics (i.e., different parameter values)

Unconsidered Theories Not every categorization theory has this particular high-level structure  In particular, arbitrary neural network models don’t For practical reasons, I will focus on models with analytically defined similarity metrics  Excludes models such as RULEX & SUSTAIN that can only be investigated with simulations Finally, I won’t explore obvious connections with Anderson’s rational analysis model

Returning to the High-Level Picture… Step 2: “Use the category similarities to generate a response probability” Most common second stage rule is the Weighted Luce-Shepard rule:

Luce-Shepard & Bayesian Updating L-S is equivalent to Bayesian updating if, for each a, Sim(a, X) is a probability distribution  Sim(a, X) represents P(X | m)  (normalized) b a weights represent base rates Note: Unweighted L-S  equal base rates for the categories

Similarities as Probabilities When do similarities represent probabilities? The answer turns out to be “Always”  Similarity metrics are defined for arbitrary combinations of category features  So from the point-of-view of response probabilities, we can renormalize any similarity metric to produce a probability distribution (see also Myung, 1994; Ashby & Alfonso-Reese, 1995; and Rosseel, 2002)

Categorization as Bayesian Updating All psychological theories of categorization with this high-level structure are special cases of Bayesian updating  “Special cases” because they restrict the possible similarities (and so probability distributions)  Note: I focused on weighted L-S, but similar conclusions can be drawn for other response probability rules Common thread: treat similarities as probabilities (perhaps because of noise in the perceptual system)

Psychological Categorization & PGMs Claim: For each psychological theory,  [Class of similarity metrics] is equivalent to  [Probability distributions for (sub-classes of) a PGM-type] Three examples:  Causal Model Theory  Exemplar-based models (specifically, GCM)  Prototype-based models (first- and second-order)

Causal Model Theory Causal Model Theory:  Categories are defined by causal structures, represented as arbitrary causal Bayes nets  Similarity of an instance to a category is explicitly: Sim(m, X) = P(X | m) (where m is a Bayesian network)

Causal Model Theory CMT categorization (with weighted L-S) is equivalent to Bayesian updating with arbitrary Bayes nets as the generating PGMs  Varying weights in the L-S rule correspond to different category base rates

Exemplar-Based Models Generalized Context Model  Categories defined by a set of exemplars E j Exemplars are actually observed category instances

Exemplar-Based Models Generalized Context Model  Categories defined by a set of exemplars E j Exemplars are actually observed category instances  Similarity is the (weighted) average (exponential of) distance between the instance and exemplars Multiple distance metrics used (e.g., weighted city-block)

Exemplar-Based Models There is an equivalence between:  GCM-similarity functions; and  Probability distributions for Bayes nets with graph: and a regularity constraint on the distribution terms F1F1 F2F2 E [unobserved] FnFn …

Exemplar-Based Models GCM categorization (with weighted L-S) is equivalent to Bayesian updating with fixed- structure Bayes nets (+constraint) as the generating PGMs

Prototype-Based Models First-order Multiplicative Prototype Model:  Categories defined by a prototypical instance Q Prototype need not be actually observed

Prototype-Based Models First-order Multiplicative Prototype Model:  Categories defined by a prototypical instance Q Prototype need not be actually observed  Similarity is the (weighted exponential of the) distance between the instance and the prototype Again, different distance metrics can be used

Prototype-Based Models There is an equivalence between:  FOMPM-similarity functions; and  Probability distributions for empty-graph Markov random fields (and a regularity constraint on the distribution terms ) Note: The “no-edge Markov random field” probability distributions are identical with the “no-edge Bayes net” probability distributions

Prototype-Based Models First-order models fail to capture the intuition of “prototype as summary of observations”  Inter-feature correlations cannot be captured Second-order models with interaction terms  Define features F ij whose value depends on the state of F i and F j  Assume the similarity function is still factorizable into feature-based terms Non-trivial assumption, but not particularly restrictive

Prototype-Based Models There is an equivalence between:  SOMPM-similarity functions; and  Probability distributions for arbitrary-graph Markov random fields (and a regularity constraint on the distribution terms ) Constraint details highly dependant on the exact second-order feature definition and the similarity metric

Prototype-Based Models First-order prototype-based categorization (with weighted L-S) is equivalent to Bayesian updating with no-edge Markov random fields (+constraint) as the generating PGMs  And second-order prototypes are equivalent to Bayesian updating with arbitrary-graph Markov random fields

Summary of Theoretical Results Many psychological theories of categorization are equivalent to Bayesian updating, assuming a particular generative model-type Significant instances:  CMT  Arbitrary-graph Bayes nets  GCM  Fixed-graph Bayes net (+constraint)  Prototype  Empty- or Arbitrary-graph Markov random field (+constraint)

Common Representational Language Common representational language for:  Many psychological theories of concepts and categorization; and  Psychological theories of causal inference and belief based on Bayes nets This shared language arguably facilitates the development of a unified theory of the psychological domains  Unfortunately, just a promissory note right now

Multiple Categorization Systems Several recent papers have argued (roughly):  Each psychological theory is empirically superior for some problems in some domains   There must be multiple categorization systems (corresponding to the different theories)

Multiple Categorization Systems Bayes nets and Markov random fields are special cases of chain graphs – PGMs with directed and undirected edges  We can model each categorization theory as a special case of Bayesian updating on a chain graph

Multiple Categorization Systems If all categorization is Bayesian updating on chain graphs, then we have one cognitive system with many different possible “parameters” (i.e., generative models)  Note: This possibility does not show that the “multiple systems” view is wrong, but does blunt the inference from multiple confirmed theories

Concepts as Chain Graphs How can we test “concepts as chain graphs”?

Concepts as Chain Graphs How can we test “concepts as chain graphs”?  Use a probability distribution for chain graphs with no Bayes net or Markov random field perfect map  Example: F1F1 F3F3 F4F4 F2F2

Concepts as Chain Graphs How can we test “concepts as chain graphs”?  Use a probability distribution for chain graphs with no Bayes net or Markov random field perfect map  Example: Experimental question: How accurately can people learn categories based on this graph? F1F1 F3F3 F4F4 F2F2

Expanded Equivalence Results These results extend known equivalencies to include (i) Causal model theory; and (ii) Second-order prototype models These various theoretical equivalencies can guide experimental design  Use them to determine whether a particular category structure can be equally well-modeled by multiple psychological theories

Expanded Equivalence Results Bayes nets and Markov random fields represent overlapping sets of distributions  Specifically, Bayes nets with no colliders are equivalent to Markov random fields with no cycles

Expanded Equivalence Results Bayes nets and Markov random fields represent overlapping sets of distributions  Specifically, Bayes nets with no colliders are equivalent to Markov random fields with no cycles F1F1 F2F2 F3F3 F4F4 Equal CMT & SOMPM model fits for this concept

Expanded Equivalence Results Bayes nets and Markov random fields represent overlapping sets of distributions  Specifically, Bayes nets with no colliders are equivalent to Markov random fields with no cycles F1F1 F2F2 F3F3 F4F4 Equal CMT & SOMPM model fits for this concept F1F1 F2F2 F3F3 F4F4 Different CMT & SOMPM model fits for this concept

Novel Suggested Theories Recall that the PGMs for both the GCM and SOMPM have additional constraints  These constraints have a relatively natural computational motivation Idea: Investigate generalized versions of the psychological theories  E.g., do we get significantly better model fits? how accurately do people learn concepts that violate the regularity constraints? and so on…

Conclusion Many psychological theories of categorization are equivalent to (special cases of) Bayesian categorization of probabilistic graphical models (and those equivalencies have implications for both (a) theory development & testing, and (b) experimental design & practice)

Appendix: GCM & Bayes Nets Example of the regularity constraint:  City-block distance metric, continuous features: For each F i, each P(F i | E = j) is a Laplace (double exponential) distribution with the same scale parameter, and possibly distinct means E (in the Bayes net) has as many values as there are exemplars (in the category)  P(E = j) is the exemplar weight  In the limit of infinite exemplars, we can represent arbitrary probability distributions

Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition.

Similar presentations

Presentation on theme: "Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition.

Similar presentations

Presentation on theme: "Graphical Models for Psychological Categorization David Danks Carnegie Mellon University; and Institute for Human & Machine Cognition."— Presentation transcript:

Similar presentations

About project

Feedback