Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Slides:



Advertisements
Similar presentations
Decision Theory: Sequential Decisions Computer Science cpsc322, Lecture 34 (Textbook Chpt 9.3) Nov, 28, 2012.
Advertisements

Partially Observable Markov Decision Process (POMDP)
Department of Computer Science Undergraduate Events More
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Decision Theoretic Planning
Planning under Uncertainty
CPSC 322, Lecture 37Slide 1 Finish Markov Decision Processes Last Class Computer Science cpsc322, Lecture 37 (Textbook Chpt 9.5) April, 8, 2009.
Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Department of Computer Science Undergraduate Events More
CPSC 422, Lecture 14Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14 Feb, 4, 2015 Slide credit: some slides adapted from Stuart.
Department of Computer Science Undergraduate Events More
CPSC 322, Lecture 35Slide 1 Value of Information and Control Computer Science cpsc322, Lecture 35 (Textbook Chpt 9.4) April, 14, 2010.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
CPSC 502, Lecture 13Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 13 Oct, 25, 2011 Slide credit POMDP: C. Conati.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Department of Computer Science Undergraduate Events More
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
CPSC 422, Lecture 10Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10 Sep, 30, 2015.
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 6
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Biomedical Data & Markov Decision Process
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 5
Decision Theory: Single Stage Decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 20
Probability and Time: Markov Models
UBC Department of Computer Science Undergraduate Events More
Announcements Homework 3 due today (grace period through Friday)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16
Probability and Time: Markov Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
Decision Theory: Sequential Decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17
Decision Theory: Single Stage Decisions
Probability and Time: Markov Models
Instructor: Vincent Conitzer
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
CS 188: Artificial Intelligence Fall 2008
Probability and Time: Markov Models
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8
Reinforcement Learning (2)
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Presentation transcript:

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7 Sep, 23, 2016 CPSC 422, Lecture 7

Course Announcements Assignment 1 has been posted ValueOfInfo and ValueOfControl MDPs: Value Iteration POMDPs: Belief State Update POMDPs were initially formalized by the control theory and operations research communities to optimally control stochastic dynamical systems. More recently, the articial intelligence community also considered POMDPs for planning under uncertainty. To date, a wide range of sequential decision problems such as robot navigation, preference elicitation, stochastic resource allocation, maintenance scheduling, spoken-dialog systems and many others have been modeled using POMDPs. CPSC 422, Lecture 8

Applications of AI 422 big picture Markov decision process Query Planning Deterministic Stochastic Value Iteration Approx. Inference Full Resolution SAT Logics Belief Nets Markov Decision Processes and Partially Observable MDP Markov Chains and HMMs First Order Logics Ontologies Temporal rep. Applications of AI Approx. : Gibbs Undirected Graphical Models Conditional Random Fields Reinforcement Learning Representation Reasoning Technique Forward, Viterbi…. Approx. : Particle Filtering First Order Logics (will build a little on Prolog and 312) Temporal reasoning from NLP 503 lectures Reinforcement Learning (not done in 340, at least in the latest offering) Markov decision process Set of states S, set of actions A Transition probabilities to next states P(s’| s, a′) Reward functions R(s, s’, a) RL is based on MDPs, but Transition model is not known Reward model is not known While for MDPs we can compute an optimal policy RL learns an optimal policy CPSC 422, Lecture 34

Lecture Overview Start Reinforcement Learning Start Q-learning Estimate by Temporal Differences CPSC 422, Lecture 7

MDP and Reinforcement Learning Markov decision process Set of states S, set of actions A Transition probabilities to next states P(s’| s, a′) Reward function R(s) or R(s, a) or R(s, a, s’) RL is based on MDPs, but Transition model is not known Reward model is not known While for MDPs we can compute an optimal policy RL learns an optimal policy R(s, s’, a) CPSC 422, Lecture 7

Search-Based Approaches to RL Policy Search (stochastic local search) Start with an arbitrary policy To evaluate a policy, try it out in the world Generate some neighbours….. Problems with evolutionary algorithms Policy space can be huge: with n states and m actions there are mn policies Policies are evaluated as a whole: cannot directly take into account locally good/bad behaviours Here the agent is evaluated as a whole, based on how it survives use of experiences is wasteful: cannot directly take into account locally good/bad behavior, since policies are evaluated as a whole CPSC 422, Lecture 7

Discounted reward we have seen in MDPs Q-learning Contrary to search-based approaches, Q-learning learns after every action Learns components of a policy, rather than the policy itself Q(s,a) = expected value of doing action a in state s and then following the optimal policy Discounted reward we have seen in MDPs expected value of following optimal policy л in s’ states reachable from s by doing a Probability of getting to s’ from s via a reward in s CPSC 422, Lecture 7

Q values If more than one max pick one of the max actions randomly … sk a0 Q[s0,a0] Q[s1,a0] …. Q[sk,a0] a1 Q[s0,a1] Q[s1,a1] Q[sk,a1] an Q[s0,an] Q[s1,an] Q[sk,an] If the agent had the complete Q-function, would it know how to act in every state? If more than one max pick one of the max actions randomly A. Yes B. No But how to learn the Q-values? CPSC 422, Lecture 7

Q values s0 s1 … sk a0 Q[s0,a0] Q[s1,a0] …. Q[sk,a0] a1 Q[s0,a1] Q[s1,a1] Q[sk,a1] an Q[s0,an] Q[s1,an] Q[sk,an] Once the agent has a complete Q-function, it knows how to act in every state By learning what to do in each state, rather then the complete policy as in search based methods, learning becomes linear rather than exponential in the number of states But how to learn the Q-values? Selects one of the actions that maximizes Q[s,a] CPSC 422, Lecture 7

Q values Q(s,a) are known as Q-values, and are related to the utility of state s as follows From (1) and (2) we obtain a constraint between the Q value in state s and the Q value of the states reachable from a CPSC 422, Lecture 7

Learning the Q values Can we exploit the relation between Q values in “adjacent” states? No, because we don’t know the transition probabilities P(s’|s,a) and the reward function We’ll use a different approach, that relies on the notion of Temporal Difference (TD) A. Yes B. No Clickers yes No ? CPSC 422, Lecture 7

Average Through Time Suppose we have a sequence of values (your sample data): v1, v2, .., vk And want a running approximation of their expected value e.g., given sequence of grades, estimate expected value of next grade A reasonable estimate is the average of the first k values: One of the main elements of Q-learning CPSC 422, Lecture 7

Average Through Time CPSC 422, Lecture 7

Estimate by Temporal Differences (vk - Ak-1) is called a temporal difference error or TD-error it specifies how different the new value vk is from the prediction given by the previous running average Ak-1 The new estimate (average) is obtained by updating the previous average by αk times the TD error It intuitively makes sense If the new value vk is higher than the average at k-1, than the old average was too small the new average is generated by increasing the old average by a measure proportional to the error Do the opposite if the new vk is smaller than the average at k-1 CPSC 422, Lecture 7

Q-learning: General Idea Learn from the history of interaction with the environment, i.e., a sequence of state-action-rewards <s0, a0, r1, s1, a1, r2, s2, a2, r3,.....> History is seen as sequence of experiences, i.e., tuples <s, a, r, s’> agent doing action a in state s, receiving reward r and ending up in s’ These experiences are used to estimate the value of Q (s,a) expressed as CPSC 422, Lecture 7

Q-learning: General Idea But ….. Is an approximation. The real link between Q(s,a) and Q(s’,a’) is CPSC 422, Lecture 7

Q-learning: Main steps Store Q[S, A], for every state S and action A in the world Start with arbitrary estimates in Q (0)[S, A], Update them by using experiences Each experience <s, a, r, s’> provides one new data point on the actual value of Q[s, a] current estimated value of Q[s’,a’], where s’ is the state the agent arrives to in the current experience How do you merge it with the previous estimate? New value of Q[s,a], CPSC 422, Lecture 7

Q-learning: Update step TD formula applied to Q[s,a] Alpha-k can actually be a constant but then you do not have assurance of convergence New value for Q[s,a] from <s,a,r,s’> updated estimated value of Q[s,a] Previous estimated value of Q[s,a] CPSC 422, Lecture 7

Q-learning: algorithm CPSC 422, Lecture 7

Learning Goals for today’s class You can: Describe and criticize search-based approaches to RL Motivate Q-learning Justify Estimate by Temporal Differences Explain, trace and implement Q-learning CPSC 422, Lecture 7

TODO for Mon Do Practice Ex. On Reinforcement Learning: Exercise 11.A: Q-learning http://www.aispace.org/exercises.shtml CPSC 422, Lecture 7