COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Dialogue Policy Optimisation
Markov Decision Process
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement learning
Reinforcement Learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Markov Decision Process (MDP)
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
COMP 2208 Dr. Long Tran-Thanh University of Southampton Revision.
Reinforcement learning
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning
Markov Decision Processes
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning

Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour

Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour

Sequential decision making Environment Perception Behaviour/Action Decision making Repeatedly making decisions 1 decision per round Uncertainty: outcome is not known in advance and noisy

Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit Goal: find the a way out from the maze It’s a standard search problem

Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit New goal: find the shortest path from A (entrance) to B (exit) The robot can try several times Candidate paths

Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit What we want to have is a policy (of behaviour) At each situation, it will tell us what to do System of shortest paths: from each point in the maze to the exit

Example: getting out the maze At each time step: Choose a direction Make a step Check whether it’s the exit Supervised vs. unsupervised learning? Offline vs. online learning? unsupervised online

Reinforcement learning A specific unsupervised online learning problem Some sees it different from unsupervised (e.g., Bishop, 2006) The setting: The agent repeatedly interact with the environment Gets some feedback (e.g., positive – negative) = reinforcement Learning problem: to get a good policy based on the feedback

Motivations

Difficulty of reinforcement learning

How do we know which actions lead us to victory? And what about those that made us lose the game? How can we measure which action is the best to take at each time step? We need to be able to evaluate the actions we take (and the states we are in)

States, actions, and rewards Think about the world as a set of (discrete) states With an action, we move from one state to another one Reward = feedback of the environment – measures the “goodness” of the action taken Good/bad states E.g., the exit of the maze = good, other locations = bad (or not so good) E.g., action = physically moving between locations If the new state is “good”, the reward is high, and vice versa Goal: maximise the sum of collected rewards over time

Simple example ABC DEF (+100) A simple maze: There are six states. C is a terminal state (game starts again at A). F gives a +100 reward (i.e., any actions that takes us to F receives +100) All other states have zero reward. Maximising rewards over time = finding shortest path

An intuition of learning good policies ABC DEF (+100) At the beginning we have no prior knowledge We start with a simple policy: just randomly move at each state But of course, we pay attention to the rewards we’ve collected so far At a certain point, we will eventually arrive to F, for which move we receive +100 Reasoning: what was the last state before I got to F? It must be a good state too (since it leads to F) We update the value of that state to be “good”

Temporal difference learning We maintain the value V of each state They represent how valuable the states are (in the long run) We update these values as we go along

Temporal difference learning How are we going to update our estimate of each state's value? Immediate reward is important. If moving into a certain state gives us a reward or punishment, that obviously needs recording. But future reward is important too. A state may give us nothing now, but we still like it if it is linked to future reinforcement.

Temporal difference learning Also important not to let any single learning experience change our opinion too much. We want to change our V estimates gradually to allow the long-run picture to emerge, so we need a learning rate. The formula for TD learning combines all of these factors. Current reward Future reward Progressive learning

Temporal difference learning Estimate of V_i at time step (t+1) Old estimate Learning rate Current reward Temporal difference: State value difference between old and new states

Temporal difference learning Suppose a = 0.1 (learning rate) We move from A to B We will receive 0 for a while …

Temporal difference learning Suppose we have a C-> F move at a certain point (F = +100) New value of V_c is 10 !! Since F is the exit, we restart the game in the next step (we don’t update V_F) But still keep the V_i values …

Temporal difference learning After a fair amount of time steps: We’re starting to get the sense of where the high-value states are But then, after 500K steps: 100 0

Temporal difference learning So what’s the reason here? Eventually from any state we can get to F sooner or later In the long term, all the states are equally valuable We need a way to distinguish paths that requires less moves States that lie on shorter paths have higher value

Temporal difference learning Solution: discounting future rewards Rewards in the far future are less valuable Current reward/rewards in the near future are more important Discount factor

Temporal difference learning Now, even after 500K steps: What is the best policy then? Always move to the neighbour with the highest value We need to know which actions are needed within this policy

Q-learning In some cases, we also need to learn the outcomes of the actions: We don’t know which actions will take us to which state Put differently, we want to learn the value of taking each action as well (Q value) The expected value of getting to state j is the maximum Q value we could get for any action x done at j. i i j j k x

Markov Decision Processes So far, our actions deterministically lead us from a state to another But in many real world situations, the state transitions are stochastic

Markov Decision Processes How to capture this uncertainty? Markov decision process: States, actions, state transition, rewards as usual State transition is stochastic i i J1 J2 Jm k Markov property: the probability of arriving to state j as the next state only depends on the current state and the current action The past does not have influence on the near future

Markov Decision Processes TD-learning: Q-learning: State transition probability

How to update the values in MDPs? Since the state transition is stochastic, we don’t know in advance what will the state if we take action k If the system is real, and we can only control the actions, then just take the action, and observe the next state However, in many cases, the system is a simulation, and thus, we need to control the state transition as well How can we simulate the state transition process?

Monte Carlo simulation Monte Carlo simulation (Fermi, von Neumann, Ulam, Metropolis) i i J3 J2 J1 P3 = 0.3 P2 = 0.2 P1 = 0.5 k Random generator: between 0 and 1 If choose J1 If between 0.5 and 0.7 -> choose J2 If between > 0.7 -> choose J3

Which actions/ state should we update? So far we only dealt with how to update the V and Q values Question: which one should we choose to update next? (i.e., which action we should choose next) If it’s all about learning the values as accurately as possible Uniformly randomly choosing the actions In some cases, the actual rewards do count as well (no training phase)

Exploration vs. exploitation Dilemma of exploration vs. exploitation Exploration: we want to learn as accurately as possible in order to make better decisions (the longer we learn, the better it is) Exploitation: we want to find the best actions as soon as possible (the less exploration the better) How to solve this paradox? Remember the bandit algorithms? We combine epsilon-greedy with TD-learning or Q-learning We choose the highest estimate with (1-epsilon) probability Uniformly and randomly choose another one with epsilon prob.

Extensions of reinforcement learning Partial observable MDPs (PoMDPs) We don’t fully observe the state we are in We maintain some belief about the possible states we can be in (Bayesian) Decentralised MDPs (DecMDPs) Multiple agents working together in a decentralised manner DecPoMDP – the combination of the above 2 Inverse RL: Classical RL: given the system, what is the optimal policy? Inverse RL: given the optimal policy, what is the underlying system?

Applications

Teaching helicopter to perform inverse hovering (Stanford) Smart home heating (Southampton)

Summary Reinforcement learning: an important learning problem Unsupervised, online, has feedback from the environment State value update: TD-learning Action value update: Q-learning Which action/state to choose next: epsilon-greedy MDP, PoMDP, DecMDP, etc…