Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Similar presentations


Presentation on theme: "Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman."— Presentation transcript:

1 Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman

2 Machine LearningRL2 Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these:

3 Machine LearningRL3 Temporal Difference Learning (2) TD( ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V * for any 0   1 [Dayan, 1992] Tesauro's TD-Gammon uses this algorithm Bias-variance tradeoff [Kearns & Singh 2000] Implemented using “eligibility traces” [Sutton 1988] Helps overcome non-Markov environments [Loch & Singh, 1998] Equivalent expression:

4 Machine LearningRL4 non-Markov Examples Can you solve them?

5 Machine LearningRL5 Markov Decision Processes Recall MDP: finite set of states S set of actions A at each discrete time agent observes state s t  S and chooses action a t  A receives reward r t, and state changes to s t+1 Markov assumption: s t+1 =  (s t, a t ) and r t = r(s t, a t ) –r t and s t+1 depend only on current state and action –functions  and r may be nondeterministic –functions  and r not necessarily known to agent

6 Machine LearningRL6 Partially Observable MDPs Same as MDP, but additional observation function  that translates the state into what the learner can observe: o t =  (s t ) Transitions and rewards still depend on state, but learner only sees a “shadow”. How can we learn what to do?

7 Machine LearningRL7 State Approaches to POMDPs Q learning (dynamic programming) states: –observations –short histories –learn POMDP model: most likely state –learn POMDP model: information state –learn predictive model: predictive state –experience as state Advantages, disadvantages?

8 Machine LearningRL8 Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) state occupation probabilities POMDP model E: Forward-backward M: Fractional counting

9 Machine LearningRL9 EM Pitfalls Each iteration increases data likelihood. Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable.

10 Machine LearningRL10 Information State Assumes: objective reality known “map” Also belief state: represents location. Vector of probabilities, one for each state. Easily updated if model known. (Ex.)

11 Machine LearningRL11 Plan with Information States Now, learner is 50% here and 50% there instead of in any particular state. Good news: Markov in these vectors Bad news: States continuous Good news: Can be solved Bad news:...slowly More bad news: Model is approximate!

12 Machine LearningRL12 Predictions as State Idea: Key information from distance past, but never too far in the future. (Littman et al. 02) start at blue: down red ( left red) odd up __? history: forget up blue left red up not blue left not red up blue left not red predict: up blue ? left red ? up not blue left red

13 Machine LearningRL13 Experience as State Nearest sequence memory (McCallum 1995) Relate current episode to past experience. k longest matches considered to be the same for purposes of estimating value and updating. Current work: Extend TD( ), extend notion of similarity (allow for soft matches, sensors)

14 Machine LearningRL14 Classification Dialog (Keim & Littman 99) User to travel to Roma, Torino, or Merino? States: S R, S T, S M, done. Transitions to done. Actions: –QC (What city?), –QR, QT, QM (Going to X ?), –R, T, M (I think X ). Observations: –Yes, no (more reliable), R, T, M (T/M confusable). Objective: –Reward for correct class, cost for questions.

15 Machine LearningRL15 Incremental Pruning Output Optimal plan varies with priors ( S R = S M ). STST SRSR SMSM

16 Machine LearningRL16 S T =0.00

17 Machine LearningRL17 S T =0.02

18 Machine LearningRL18 S T =0.22

19 Machine LearningRL19 S T =0.76

20 Machine LearningRL20 S T =0.90

21 Machine LearningRL21 Wrap Up Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning. Lecture slides on web. http://www.cs.rutgers.edu/~mlittman/talks/ ml02-rl1.ppt ml02-rl2.ppt


Download ppt "Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman."

Similar presentations


Ads by Google