Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Markov Decision Process
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
Reinforcement learning
Learning 3: Bayesian Networks Reinforcement Learning Genetic Algorithms based on material from Ray Mooney, Daphne Koller, Kevin Murphy.
Reinforcement Learning
Reinforcement Learning
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Markov Decision Processes
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Reinforcement Learning (1)
Policies and exploration and eligibility, oh my!.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect ,Chap. 17 CS121 – Winter 2003.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement learning (Chapter 21)
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005

Project Teams: 2-3 You should have ed me your team members! Two components: Define Environment Learning Agent

Example: Agent_with_Personality State: mood: happy, sad, mad, bored sensor: smile, cry, glare, snore Action: smile, hit, tell-joke, tickle Define: S X A X S’ X P with probabilities and output string Define S X {-10,10}

Example cont: State: happy (s0), sad (s1), mad (s2), bored (s3) smile (p0), cry(p1), glare(p2), snore (p3) Action: smile (a0), hit (a1), tell-joke (a2), tickle (a3) Define: S X A X S’ X P with probabilities and output string i.e “It makes me happy when you smile” “Argh! Quit smiling at me!!!” “Oh, I’m so happy I don’t care if you hit me” “HEY!!! Quit hitting me” “Boo hoo, don’t be hitting me” Define S X {-10,10} i.e

Example: Robot Navigation State: location Action: forward, back, left, right State -> Reward: define rewards of states in your grid State x Action -> State defined by movements

Learning Agent Calls Environment Program to get a training set Outputs a Q function: Q(S x A) We will evaluate the output of your learning program, by using it to execute and computing the reward given.

Schedule Monday, Dec. 5 Electronically submit your environment Monday, Dec. 12 Submit your learning agent Wednesday, Dec 13 Submit your writeup

Reinforcement Learning supervised learning is simplest and best-studied type of learning another type of learning tasks is learning behaviors when we don’t have a teacher to tell us how the agent has a task to perform; it takes some actions in the world; at some later point gets feedback telling it how well it did on performing task the agent performs the same task over and over again it gets carrots for good behavior and sticks for bad behavior called reinforcement learning because the agent gets positive reinforcement for tasks done well and negative reinforcement for tasks done poorly

Reinforcement Learning The problem of getting an agent to act in the world so as to maximize its rewards. Consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem. We can use a similar method to train computers to do many tasks, such as playing backgammon or chess, scheduling jobs, and controlling robot limbs.

Reinforcement Learning for blackjackblackjack for robot motionrobot motion for controllercontroller

Formalization we have a state space S we have a set of actions a1, …, ak we want to learn which action to take at every state in the space At the end of a trial, we get some reward, positive or negative want the agent to learn how to behave in the environment, a mapping from states to actions example: Alvinn state: configuration of the car learn a steering action for each state

Accessible or observable state Repeat:  s  sensed state  If s is terminal then exit  a  choose action (given s)  Perform a Reactive Agent Algorithm

Policy (Reactive/Closed-Loop Strategy) A policy  is a complete mapping from states to actions

Repeat:  s  sensed state  If s is terminal then exit  a   (s)  Perform a Reactive Agent Algorithm

Approaches learn policy directly– function mapping from states to actions learn utility values for states, the value function

Value Function An agent knows what state it is in and it has a number of actions it can perform in each state. Initially it doesn't know the value of any of the states. If the outcome of performing an action at a state is deterministic then the agent can update the utility value U() of a state whenever it makes a transition from one state to another (by taking what it believes to be the best possible action and thus maximizing): U(oldstate) = reward + U(newstate) The agent learns the utility values of states as it works its way through the state space.

Exploration The agent may occasionally choose to explore suboptimal moves in the hopes of finding better outcomes. Only by visiting all the states frequently enough can we guarantee learning the true values of all the states. A discount factor is often introduced to prevent utility values from diverging and to promote the use of shorter (more efficient) sequences of actions to attain rewards. The update equation using a discount factor gamma is: U(oldstate) = reward + gamma * U(newstate) Normally gamma is set between 0 and 1.

Q-Learning augments value iteration by maintaining a utility value Q(s,a) for every action at every state. utility of a state U(s) or Q(s) is simply the maximum Q value over all the possible actions at that state.

Q-Learning foreach state s foreach action a Q(s,a)=0 s=currentstate do forever a = select an action do action a r = reward from doing a t = resulting state from doing a Q(s,a) = (1 – alpha) Q(s,a) + alpha * (r + gamma * Q(t)) s = t Notice that a learning coefficient, alpha, has been introduced into the update equation. Normally alpha is set to a small positive constant less than 1.

Selecting an Action simply choose action with highest expected utility? problem: action has two effects gains reward on current sequence information received and used in learning for future sequences trade-off immediate good for long-term well-being stuck in a rut jumping off a cliff just because you’ve never done it before…

Exploration policy wacky approach: act randomly in hopes of eventually exploring entire environment greedy approach: act to maximize utility using current estimate need to find some balance: act more wacky when agent has little idea of environment and more greedy when the model is close to correct example: one-armed bandits…

RL Summary active area of research both in OR and AI several more sophisticated algorithms that we have not discussed applicable to game-playing, robot controllers, others