Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Lecture 18: Temporal-Difference Learning
Markov Decision Process
Modeling Uncertainty over time Time series of snapshot of the world “state” we are interested represented as a set of random variables (RVs) – Observable.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.
Reinforcement Learning
Reinforcement Learning Introduction Presented by Alp Sardağ.
Chapter 6: Temporal Difference Learning
Markov Decision Processes
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Machine Learning Chapter 13. Reinforcement Learning
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
INTRODUCTION TO Machine Learning
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement Learning
Introduction to Reinforcement Learning Hiren Adesara Prof: Dr. Gittens.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning
Markov Decision Processes
Biomedical Data & Markov Decision Process
Reinforcement Learning
Planning and Learning with Hidden State
Markov Decision Processes
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Hierarchical POMDP Solutions
Dr. Unnikrishnan P.C. Professor, EEE
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 416 Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman

Machine LearningRL2 Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these:

Machine LearningRL3 Temporal Difference Learning (2) TD( ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V * for any 0   1 [Dayan, 1992] Tesauro's TD-Gammon uses this algorithm Bias-variance tradeoff [Kearns & Singh 2000] Implemented using “eligibility traces” [Sutton 1988] Helps overcome non-Markov environments [Loch & Singh, 1998] Equivalent expression:

Machine LearningRL4 non-Markov Examples Can you solve them?

Machine LearningRL5 Markov Decision Processes Recall MDP: finite set of states S set of actions A at each discrete time agent observes state s t  S and chooses action a t  A receives reward r t, and state changes to s t+1 Markov assumption: s t+1 =  (s t, a t ) and r t = r(s t, a t ) –r t and s t+1 depend only on current state and action –functions  and r may be nondeterministic –functions  and r not necessarily known to agent

Machine LearningRL6 Partially Observable MDPs Same as MDP, but additional observation function  that translates the state into what the learner can observe: o t =  (s t ) Transitions and rewards still depend on state, but learner only sees a “shadow”. How can we learn what to do?

Machine LearningRL7 State Approaches to POMDPs Q learning (dynamic programming) states: –observations –short histories –learn POMDP model: most likely state –learn POMDP model: information state –learn predictive model: predictive state –experience as state Advantages, disadvantages?

Machine LearningRL8 Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) state occupation probabilities POMDP model E: Forward-backward M: Fractional counting

Machine LearningRL9 EM Pitfalls Each iteration increases data likelihood. Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable.

Machine LearningRL10 Information State Assumes: objective reality known “map” Also belief state: represents location. Vector of probabilities, one for each state. Easily updated if model known. (Ex.)

Machine LearningRL11 Plan with Information States Now, learner is 50% here and 50% there instead of in any particular state. Good news: Markov in these vectors Bad news: States continuous Good news: Can be solved Bad news:...slowly More bad news: Model is approximate!

Machine LearningRL12 Predictions as State Idea: Key information from distance past, but never too far in the future. (Littman et al. 02) start at blue: down red ( left red) odd up __? history: forget up blue left red up not blue left not red up blue left not red predict: up blue ? left red ? up not blue left red

Machine LearningRL13 Experience as State Nearest sequence memory (McCallum 1995) Relate current episode to past experience. k longest matches considered to be the same for purposes of estimating value and updating. Current work: Extend TD( ), extend notion of similarity (allow for soft matches, sensors)

Machine LearningRL14 Classification Dialog (Keim & Littman 99) User to travel to Roma, Torino, or Merino? States: S R, S T, S M, done. Transitions to done. Actions: –QC (What city?), –QR, QT, QM (Going to X ?), –R, T, M (I think X ). Observations: –Yes, no (more reliable), R, T, M (T/M confusable). Objective: –Reward for correct class, cost for questions.

Machine LearningRL15 Incremental Pruning Output Optimal plan varies with priors ( S R = S M ). STST SRSR SMSM

Machine LearningRL16 S T =0.00

Machine LearningRL17 S T =0.02

Machine LearningRL18 S T =0.22

Machine LearningRL19 S T =0.76

Machine LearningRL20 S T =0.90

Machine LearningRL21 Wrap Up Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning. Lecture slides on web. ml02-rl1.ppt ml02-rl2.ppt