A Neural Signature of Hierarchical Reinforcement Learning

Slides:



Advertisements
Similar presentations
Soyoun Kim, Jaewon Hwang, Daeyeol Lee  Neuron 
Advertisements

Volume 63, Issue 3, Pages (August 2009)
Elizabeth V. Goldfarb, Marvin M. Chun, Elizabeth A. Phelps  Neuron 
A Source for Feature-Based Attention in the Prefrontal Cortex
Context-Dependent Decay of Motor Memories during Skill Acquisition
Volume 77, Issue 5, Pages (March 2013)
Rei Akaishi, Kazumasa Umeda, Asako Nagase, Katsuyuki Sakai  Neuron 
Adrian G. Fischer, Markus Ullsperger  Neuron 
Jung Hoon Sul, Hoseok Kim, Namjung Huh, Daeyeol Lee, Min Whan Jung 
Coding of the Reach Vector in Parietal Area 5d
Alan N. Hampton, Ralph Adolphs, J. Michael Tyszka, John P. O'Doherty 
Sang Wan Lee, Shinsuke Shimojo, John P. O’Doherty  Neuron 
Volume 94, Issue 2, Pages e6 (April 2017)
Learning to Simulate Others' Decisions
Rajeev D.S. Raizada, Russell A. Poldrack  Neuron 
Keno Juechems, Jan Balaguer, Maria Ruz, Christopher Summerfield  Neuron 
Volume 63, Issue 3, Pages (August 2009)
Volume 81, Issue 6, Pages (March 2014)
A Map for Social Navigation in the Human Brain
Michael L. Morgan, Gregory C. DeAngelis, Dora E. Angelaki  Neuron 
Volume 93, Issue 2, Pages (January 2017)
Volume 62, Issue 5, Pages (June 2009)
Feature- and Order-Based Timing Representations in the Frontal Cortex
Hedging Your Bets by Learning Reward Correlations in the Human Brain
CA3 Retrieves Coherent Representations from Degraded Input: Direct Evidence for CA3 Pattern Completion and Dentate Gyrus Pattern Separation  Joshua P.
A Role for the Superior Colliculus in Decision Criteria
Selective Entrainment of Theta Oscillations in the Dorsal Stream Causally Enhances Auditory Working Memory Performance  Philippe Albouy, Aurélien Weiss,
Volume 71, Issue 4, Pages (August 2011)
A Map for Horizontal Disparity in Monkey V2
Volume 94, Issue 2, Pages e6 (April 2017)
Jianing Yu, David Ferster  Neuron 
Volume 74, Issue 3, Pages (May 2012)
Hippocampal “Time Cells”: Time versus Path Integration
Volume 45, Issue 4, Pages (February 2005)
Visually Cued Action Timing in the Primary Visual Cortex
Human Orbitofrontal Cortex Represents a Cognitive Map of State Space
Ju Tian, Naoshige Uchida  Neuron 
Neural Correlates of Reaching Decisions in Dorsal Premotor Cortex: Specification of Multiple Direction Choices and Final Selection of Action  Paul Cisek,
Franco Pestilli, Marisa Carrasco, David J. Heeger, Justin L. Gardner 
A. Saez, M. Rigotti, S. Ostojic, S. Fusi, C.D. Salzman  Neuron 
Rethinking Motor Learning and Savings in Adaptation Paradigms: Model-Free Memory for Successful Actions Combines with Internal Models  Vincent S. Huang,
Vivek R. Athalye, Karunesh Ganguly, Rui M. Costa, Jose M. Carmena 
Joseph T. McGuire, Matthew R. Nassar, Joshua I. Gold, Joseph W. Kable 
A Scalable Population Code for Time in the Striatum
Ethan S. Bromberg-Martin, Masayuki Matsumoto, Okihide Hikosaka  Neuron 
Volume 95, Issue 5, Pages e5 (August 2017)
Erie D. Boorman, John P. O’Doherty, Ralph Adolphs, Antonio Rangel 
Jay A. Gottfried, Joel S. Winston, Raymond J. Dolan  Neuron 
Rei Akaishi, Kazumasa Umeda, Asako Nagase, Katsuyuki Sakai  Neuron 
Greg Schwartz, Sam Taylor, Clark Fisher, Rob Harris, Michael J. Berry 
Serial, Covert Shifts of Attention during Visual Search Are Reflected by the Frontal Eye Fields and Correlated with Population Oscillations  Timothy J.
Volume 54, Issue 2, Pages (April 2007)
Broca's Area and the Hierarchical Organization of Human Behavior
Franco Pestilli, Marisa Carrasco, David J. Heeger, Justin L. Gardner 
Timing, Timing, Timing: Fast Decoding of Object Information from Intracranial Field Potentials in Human Visual Cortex  Hesheng Liu, Yigal Agam, Joseph.
Social Signals in Primate Orbitofrontal Cortex
Daniel E. Winkowski, Eric I. Knudsen  Neuron 
Ethan S. Bromberg-Martin, Okihide Hikosaka  Neuron 
Volume 88, Issue 4, Pages (November 2015)
Learning to Simulate Others' Decisions
Predictive Neural Coding of Reward Preference Involves Dissociable Responses in Human Ventral Midbrain and Ventral Striatum  John P. O'Doherty, Tony W.
John T. Serences, Geoffrey M. Boynton  Neuron 
David Badre, Bradley B. Doll, Nicole M. Long, Michael J. Frank  Neuron 
Perceptual Classification in a Rapidly Changing Environment
Christian J. Fiebach, Jesse Rissman, Mark D'Esposito  Neuron 
Honghui Zhang, Andrew J. Watrous, Ansh Patel, Joshua Jacobs  Neuron 
Volume 99, Issue 1, Pages e4 (July 2018)
Volume 34, Issue 4, Pages (May 2002)
Matthew R. Roesch, Adam R. Taylor, Geoffrey Schoenbaum  Neuron 
Keno Juechems, Jan Balaguer, Maria Ruz, Christopher Summerfield  Neuron 
Presentation transcript:

A Neural Signature of Hierarchical Reinforcement Learning José J.F. Ribas-Fernandes, Alec Solway, Carlos Diuk, Joseph T. McGuire, Andrew G. Barto, Yael Niv, Matthew M. Botvinick  Neuron  Volume 71, Issue 2, Pages 370-379 (July 2011) DOI: 10.1016/j.neuron.2011.05.042 Copyright © 2011 Elsevier Inc. Terms and Conditions

Figure 1 Illustration of HRL Dynamics At t1, a primitive action (a) is selected. Based on the consequent state, an RPE is computed (green arrow from t2 to t1), and used to update the action policy (π) for the preceding state, as well as the value (V) of that state (an estimate of the expected future reward, when starting from that state). At t2 a subroutine (σ) is selected and remains active through t5. Until then, primitive actions are selected as dictated by σ (lower tier). A PPE is computed after each (lower green arrows from t5 to t2), and used to update the subroutine-specific action policy (πσ) and state values (Vσ). These PPEs are computed with respect to pseudo-reward received at the end of the subroutine (yellow asterisk). Once the subgoal state of σ is reached, σ is terminated. An RPE is computed for the entire subroutine (upper green arrow from t5 to t2), and used to update the value and policy, V and π, associated with the state in which σ was initiated. A new action is then selected at the top level, yielding primary reward (red asterisk). Adapted from Botvinick et al. (2009). Neuron 2011 71, 370-379DOI: (10.1016/j.neuron.2011.05.042) Copyright © 2011 Elsevier Inc. Terms and Conditions

Figure 2 Task and Predictions from HRL and RL Left view is task display and underlying geometry of the delivery task. Right view shows prediction-error signals generated by standard RL and by HRL in each category of jump event. Gray bars mark the time step immediately preceding a jump event. Dashed time courses indicate the PPE generated in C and D jumps that change the subgoal's distance by a smaller amount. For simulation methods, see Experimental Procedures. Neuron 2011 71, 370-379DOI: (10.1016/j.neuron.2011.05.042) Copyright © 2011 Elsevier Inc. Terms and Conditions

Figure 3 Results of EEG Experiment Left view shows evoked potentials at electrode Cz, aligned to jump events, averaged across participants. D and E refer to jump destinations in Figure 2. The data series labeled D-E shows the difference between curves D and E, isolating the PPE effect. Right view is scalp topography for condition D, with baseline condition E subtracted (topography plotted on the same grid used in Yeung et al. [2005]). Neuron 2011 71, 370-379DOI: (10.1016/j.neuron.2011.05.042) Copyright © 2011 Elsevier Inc. Terms and Conditions

Figure 4 Results of fMRI Experiment 1 Shown are regions displaying a positive correlation with the PPE, independent of subgoal displacement. Talairach coordinates of peak are 0, 9, and 39 for the dorsal ACC, and 45, 12, and 0 for right anterior insula. Not shown are foci in left anterior insula (−45, 9, −3) and lingual gyrus (0, −66, 0). Color indicates general linear model parameter estimates, ranging from 3.0 × 10−4 (palest yellow) to 1.2 × 10−3 (darkest orange). Neuron 2011 71, 370-379DOI: (10.1016/j.neuron.2011.05.042) Copyright © 2011 Elsevier Inc. Terms and Conditions

Figure 5 Results of Behavioral Experiment Left view is an example of a choice display. Subgoal 1 would always be on an ellipse defined by the house and the truck. In this example subgoal 2 has smaller overall distance to the goal and larger distance to the truck relative to subgoal 1 (labels not shown to participants). Right view shows results of logistic regression on choices and of the comparison between two RL models. Choices were driven significantly by the ratio of distances of the goal of the two subgoals (left box, central mark is the median, edges correspond to 25th and 75th percentiles, whiskers to extreme values, outliers to individual dots outside box and whiskers; each colored dot represents a single participant's data), whereas the ratio of distances to subgoal did not significantly explain participant's choices (middle box). Bayes factors favored the model with only reward for goal attainment and no reward for subgoal against the one with reward for subgoal and goal attainment (right box). Neuron 2011 71, 370-379DOI: (10.1016/j.neuron.2011.05.042) Copyright © 2011 Elsevier Inc. Terms and Conditions