Presentation on theme: "How should we represent visual scenes? Common-Sense Core, Probabilistic Programs Josh Tenenbaum MIT Brain and Cognitive Sciences CSAIL Joint work with."— Presentation transcript:
How should we represent visual scenes? Common-Sense Core, Probabilistic Programs Josh Tenenbaum MIT Brain and Cognitive Sciences CSAIL Joint work with Noah Goodman, Chris Baker, Rebecca Saxe, Tomer Ullman, Peter Battaglia, Jess Hamrick and others.
Core of common-sense reasoning Human thought is structured around a basic understanding of physical objects, intentional agents, and their relations. “Core knowledge” (Spelke, Carey, Leslie, Baillargeon, Gergely…) Intuitive theories (Carey, Gopnik, Wellman, Gelman, Gentner, Forbus, McCloskey…) Primitives of lexical semantics (Pinker, Jackendoff, Talmy, Pustejovsky) Visual scene understanding (Everyone here…) The key questions: (1) What is the form and content of human common-sense theories of the physical world, intentional agents, and their interaction? (2) How are these theories used to parse visual experience into representations that support reasoning, planning, communication? From scenes to stories…
A developmental perspective A 3 year old and her dad: Dad: “What's this a picture of?” Sarah: “A bear hugging a panda bear.”... Dad: “What is the second panda bear doing?” Sarah: “It's trying to hug the bear.” Dad:“What about the third bear?” Sarah: “It’s walking away.” But this feels too hard to approach now, so what about looking at younger children (e.g.12 months or younger)?
Common sense in infancy 1980’s-90s’: Wynn, Spelke, Baillargeon,…
Heider and Simmel, 1944 Southgate and Csibra, 2009 (13 month olds) Intuitive physics and psychology
Intuitive physics (Whiting et al) (Gupta, Efros, Hebert)
Probabilistic generative models early 1990’s-early 2000’s –Bayesian networks: model the causal processes that give rise to observations; perform reasoning, prediction, planning via probabilistic inference. –The problem: not sufficiently flexible, expressive.
Scene understanding as an inverse problem The “inverse Pixar” problem: World state (t) Image (t) graphics
World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) physics graphics …… Scene understanding as an inverse problem The “inverse Pixar” problem:
Probabilistic programs Probabilistic models a la Laplace. –The world is fundamentally deterministic (described by a program), and perfectly predictable if we could observe all relevant variables. –Observations are always incomplete or indirect, so we put probability distributions on what we can’t observe. Compare with Bayesian networks. –Thick nodes. Programs defined over unbounded sets of objects, their properties, states and relations, rather than traditional finite- dimensional random variables. –Thick arrows. Programs capture fine-grained causal processes unfolding over space and time, not simply directed statistical dependencies. –Recursive. Probabilistic programs can be arbitrarily manipulated inside other programs. (e.g. perceptual inferences about entities that make perceptual inferences, entities with goals and plans re: other agents’ goals and plans.) Compare with grammars or logic programs.
Laplace’s demon We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes. —Pierre Simon Laplace, A Philosophical Essay on Probabilities [ [
Probabilistic programs for “inverse pixar” scene understanding World state: CAD++ Graphics –Approximate Rendering Simple surface primitives Rasterization rather than ray tracing (for each primitive, which pixels does it affect?) Image features rather than pixels –Probabilities: Image noise, image features Unseen objects (e.g., due to occlusion)
Probabilistic programs for “inverse pixar” scene understanding World state: CAD++ Graphics Physics –Approximate Newton (physical simulation toolkit, e.g. ODE) Collision detection: zone of interaction Collision response: transient springs Dynamics simulation: only for objects in motion –Probabilities: Latent properties (e.g., mass, friction) Latent forces
Modeling stability judgments
World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) physics graphics ……
Modeling stability judgments World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) physics Prob. approx. rendering ……
Modeling stability judgments World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) …… physics Prob. approx. rendering
Modeling stability judgments World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) Prob. approx. Newton …… Prob. approx. rendering
Modeling stability judgments World state (t-1) World state (t)World state (t+1) Image (t-1) Image (t)Image (t+1) …… Prob. approx. rendering Prob. approx. Newton = perceptual uncertainty
Perception: Approximate posterior with block positions normally distributed around ground truth, subject to global stability. Reasoning : Draw multiple samples from perception. Simulate forward with deterministic approx. Newton (ODE) Decision: Expectations of various functions evaluated on simulation outputs. (Hamrick, Battaglia, Tenenbaum, Cogsci 2011) Modeling stability judgments
Results Model prediction (expected proportion of tower that will fall) Mean human stability judgment
The flexibility of common sense (“infinite use of finite means”, “visual Turing test”) Which way will the blocks fall? How far will the blocks fall? If this tower falls, will it knock that one over? If you bump the table, will more red blocks or yellow blocks fall over? If this block had (not) been present, would the tower (still) have fallen over? Which of these blocks is heavier or lighter than the others? …
Direction of fall
Direction and distance of fall
If you bump the table…
Model prediction (expected proportion of red vs. yellow blocks that fall) Mean human judgment If you bump the table… (Battaglia, & Tenenbaum, in prep)
Experiment 1: Cause/ Prevention Judgments (Gerstenberg, Tenenbaum, Goodman, et al., in prep)
Modeling people’s cause/prevention judgments Physics Simulation Model p(B|A) – p(B| not A) p(B|A) 0 if ball misses 1 if ball goes in p(B| not A): assume sparse latent Gaussian perturbations on B’s velocity.
Conclusions From scenes to stories… What contents of stories are routinely accessed through visual scenes? How can we represent that content for reasoning, communication, prediction and planning? Focus on core knowledge present in preverbal infants: intuitive physics, intuitive psychology. Representations using probabilistic programs: thick nodes (e.g. CAD++), thick arrows (physics, graphics, planning), recursive (inference about inference, goals about goals). Challenges for future work: (1) Integrating physics and psychology. (2) Efficient inference. (3) Learning.