Model-based RL (+ action sequences): maybe it can explain everything

Slides:



Advertisements
Similar presentations
Affective Facial Expressions Facilitate Robot Learning Joost Broekens Pascal Haazebroek LIACS, Leiden University, The Netherlands.
Advertisements

Lecture 18: Temporal-Difference Learning
NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.
Lecture 17: Instrumental Conditioning (Associative Structures) Learning, Psychology 5310 Spring, 2015 Professor Delamater.
Behavioral Models – Direct Instruction. 1. Development is a direct result of outside experiences 1. Development is a direct result of outside experiences.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
STEP 2: DESIGN TRAINING (Continued). STEPS TO EFFECTIVE TRAINING 1. Assess Needs – Organizational Analysis – Person Analysis – Task Analysis – Ensure.
Instrumental Condtioning & PREE. What is instrumental conditioning?  Modification of behavior by its consequences  Outcome is dependent upon the behavior.
Curious Characters in Multiuser Games: A Study in Motivated Reinforcement Learning for Creative Behavior Policies * Mary Lou Maher University of Sydney.
Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.
Ratio Schedules Focus on the number of responses required before reinforcement is given.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Testing computational models of dopamine and noradrenaline dysfunction in attention deficit/hyperactivity disorder Jaeseung Jeong, Ph.D Department of Bio.
Human Resource DevelopmentMuhammad Adnan Sarwar 1 Training and Development Human Resource Management.
Behavioral Learning Theory: Operant Conditioning
Midterm Stats Min: 16/38 (42%) Max: 36.5/38 (96%) Average: 29.5/36 (78%)
The role of the basal ganglia in habit formation Group 4 Youngjin Kang Zhiheng Zhou Baoyu Wang.
Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.
PSY402 Theories of Learning Chapter 6 (Cont.) Factors Affecting Appetitive Learning.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Behavior Modification II: ABC Complexities Lesson 7.
If behavior was dominated in the past by Hull’s S-R reinforcement paradigm, what paradigm is it dominated by today? There is a causal relationship between.
Operant Conditioning. What is it?  Learning from the consequences of behavior  Depending on the consequences the learner will learn to repeat or eliminate.
Reinforcement learning (Chapter 21)
Project Organizing Helen Hill MA MFT. Communication Activities Increase Awareness Increase Knowledge Change Attitudes Reinforce Attitudes Maintain Interest.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Chapter 13 Motivation © 2014 Cengage Learning MGMT7.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Does the brain compute confidence estimates about decisions?
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CSC 221: Computer Programming I Spring 2010
Learning and Perception
CSC 221: Computer Programming I Fall 2005
Quick one: Do you know what these are?
Mastering the game of Go with deep neural network and tree search
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Reinforcement Learning (1)
AlphaGo with Deep RL Alpha GO.
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
PSY402 Theories of Learning
Teaching with Instructional Software
Neuroimaging of associative learning
Strategies and Techniques
התניה אופרנטית II: מטרות והרגלים
Chapter 13 Motivation MGMT7 © 2014 Cengage Learning.
Reinforcement learning
מוטיבציה והתנהגות free operant
Operant Conditioning Unit 4 - AoS 2 - Learning.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
Neuroimaging of associative learning
Reinforcement Learning
PSY402 Theories of Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
PSY402 Theories of Learning
Neuroimaging of associative learning
CS 188: Artificial Intelligence Spring 2006
Naval Research Laboratory Learning-Based Controls Researcher
JJ Orban de Xivry Hands on session JJ Orban de Xivry
Orbitofrontal Cortex as a Cognitive Map of Task Space
World models and basis functions
Presentation transcript:

Model-based RL (+ action sequences): maybe it can explain everything Niv lab meeting 6/11/2012 Stephanie Chan

goal-directed v.s. habitual instrumental actions After extensive training Choose action based on previous actions/stimuli Sensory motor cortices + DLS (putamen) Not sensitive to: reinforcer devaluation action-outcome changes in contingency After moderate training Choose action based on expected outcome PFC & DMS(caudate) Usually: Model-based RL Model-free RL

goal-directed v.s. habitual instrumental actions What do real animals do?

Model-free RL Explains resistance to devaluation: Devaluation occurs in “extinction”. No feedback / no TD error Does NOT explain resistance to changes in action-outcome contingency In fact, habituated behavior should be MORE sensitive to changes in contingency Maybe: update rates go small after extended training

Alternative explanation We don’t need model-free RL Habit formation = association of individual actions into “action sequences” More parsimonious A means of modeling action sequences

Over the course of training Exploration -> exploitation Variability -> stereotypy Errors and RT -> decrease Individual actions -> “chunked” sequences PFC + associative striatum -> sensorimotor striatum “closed loop” -> “open loop”

When should actions get chunked? Q-learning with dwell time Q(s,a) = R(s) + E[V(s’)] – D(s)<R> When costs (possible mistakes) are outweighed by benefits (decrease decision time) Cost: C(s,a,a’) = E[Q(s’,a’)-V(s’)] = E[A(s’,a’)] Efficient way to compute this: TDt = [rt – dt<R> + V(st+1)]-V(st) = a sample of A(st,at) Benefit: (# timesteps saved)  <R>

When do they get unchunked? C(s,a,a’) is insensitive to changes in environment Primitive actions no longer evaluated, no TD error, no samples for C But <R> is sensitive to changes… Action sequences get unchunked when environment changes to decrease <R> No unchunking if environment changes to present a better alternative to increase <R> Ostlund et al 2009: rats are immediately sensitive to devaluation of the state that the macro action lands on, but not on the intermediate states

Simulations I : SRTT-like task

Simulations II: Instrumental conditioning Reinforcer devaluation Non-contingent Omission