© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

RL for Large State Spaces: Value Function Approximation

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Decision Theoretic Planning

Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.

Reinforcement Learning

Reinforcement learning (Chapter 21)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Introduction Presented by Alp Sardağ.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

© Daniel S. Weld 1 Reinforcement Learning CSE 473 Ever Feel Like Pavlov’s Poor Dog?

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

Reinforcement Learning

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Neural Networks Chapter 7

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement learning (Chapter 21)

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Reinforcement Learning

Announcements Homework 3 due today (grace period through Friday)

RL for Large State Spaces: Value Function Approximation

Reinforcement Learning

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld Topics Agency Problem Spaces SearchKnowledge Representation Planning PerceptionNLPMulti-agentRobotics Uncertainty MDPs Supervised Learning Reinforcement Learning

© Daniel S. Weld 3 Review: MDPs S = set of states set (|S| = n) A = set of actions (|A| = m) Pr = transition function Pr(s,a,s’) represented by set of m n x n stochastic matrices each defines a distribution over SxS R(s) = bounded, real-valued reward fun represented by an n-vector

© Daniel S. Weld 4 Reinforcement Learning

© Daniel S. Weld 5 How is learning to act possible when… Actions have non-deterministic effects And are initially unknown Rewards / punishments are infrequent Often at the end of long sequences of actions Learner must decide what actions to take World is large and complex

© Daniel S. Weld 6 RL Techniques Temporal-difference learning Learns a utility function on states or on [state,action] pairs Similar to backpropagation treats the difference between expected / actual reward as an error signal, that is propagated backward in time Exploration functions Balance exploration / exploitation Function approximation Compress a large state space into a small one Linear function approximation, neural nets, … Generalization

© Daniel S. Weld 7 Example:

© Daniel S. Weld 8 Passive RL Given policy , estimate U  (s) Not given transition matrix, nor reward function! Epochs: training sequences (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (1,2)  (1,1)  (1,2)  (2,2)  (3,2) –1 (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (1,1)  (1,2)  (1,1)  (2,1)  (2,2)  (2,3)  (3,3) +1 (1,1)  (1,2)  (2,2)  (1,2)  (1,3)  (2,3)  (1,3)  (2,3)  (3,3) +1 (1,1)  (2,1)  (2,2)  (2,1)  (1,1)  (1,2)  (1,3)  (2,3)  (2,2)  (3,2) -1 (1,1)  (2,1)  (1,1)  (1,2)  (2,2)  (3,2) -1

© Daniel S. Weld 9 Value Function

© Daniel S. Weld 10 Approach 1 Direct estimation Estimate U  (s) as average total reward of epochs containing s (calculating from s to end of epoch) Pros / Cons? Requires huge amount of data doesn’t exploit Bellman constraints! Expected utility of a state = its own reward + expected utility of successors

© Daniel S. Weld 11 Approach 2 Adaptive Dynamic Programming Requires fully observable environment Estimate transition function M from training data Solve Bellman eqn w/ modified policy iteration Pros / Cons: Requires complete observations Don’t usually need value of all states

© Daniel S. Weld 12 Approach 3 Temporal Difference Learning Do backups on a per-action basis Don’t try to estimate entire transition function! For each transition from s to s’, update:  =  =

© Daniel S. Weld 13 Notes Once U is learned, updates become 0: “TD(0)” One step lookahead Can do 2 step, 3 step…

© Daniel S. Weld 14 TD( ) Or, … take it to the limit! Compute weighted average of all future states becomes weighted average Implementation Propagate current weighted TD onto past states Must memorize states visited from start of epoch

© Daniel S. Weld 15 TD( ) TD rule just described does 1-step lookahead – called TD(0) Could also look ahead 2 steps = TD(1), 3 steps = TD(2), … TD( ) – compute a weighted average of all future states Implementation Propagate current weighted temporal difference onto past states Requires memorizing states visited from beginning of epoch (or from point i is not too tiny)

© Daniel S. Weld 16 Notes II Online: update immediately after each action Works even if epochs are infinitely long Offline: wait until the end of an epoch Can perform updates in any order E.g. backwards in time (Converges faster if rewards come only at epoch end ) Why?!? Prioritized sweeping heuristic

© Daniel S. Weld 17 Q-Learning Version of TD-learning where instead of learning value function on states we learn one on [state,action] pairs Why do this?

© Daniel S. Weld 18 Part II So far, we have assumed agent had policy Now, suppose agent must learn it While acting in uncertain world

© Daniel S. Weld 19 Active Reinforcement Learning Suppose agent must make policy while learning First approach: Start with arbitrary policy Apply Q-Learning New policy: in state s, choose action a that maximizes Q(a,s) Problem?

© Daniel S. Weld 20 Utility of Exploration Too easily stuck in non-optimal space “Exploration versus exploitation tradeoff” Solution 1 With fixed probability perform a random action Solution 2 Increase est expected value of infrequent states Properties of f(u, n) ?? U + (s)  R(s) +  max a f (  s’ P(s’ | a,s) U + (s’), N(a,s) ) If n > N e U i.e. normal utility Else, R + i.e. max possible reward

© Daniel S. Weld 21 Part III Problem of large state spaces remain Never enough training data! Learning takes too long What to do??

© Daniel S. Weld 22 Function Approximation Never enough training data! Must generalize what learning to new situations Idea: Replace large state table by a smaller, parameterized function Updating the value of state will change the value assigned to many other similar states

© Daniel S. Weld 23 Linear Function Approximation Represent U(s) as a weighted sum of features (basis functions) of s Update each parameter separately, e.g:

© Daniel S. Weld 24 Example U(s) =  0 +  1 x +  2 y Learns good approximation 10

© Daniel S. Weld 25 But What If… U(s) =  0 +  1 x +  2 y 10 +  3 z Computed Features z=  (x g -x) 2 + (y g -y) 2

© Daniel S. Weld 26 Neural Nets Can create powerful function approximators Nonlinear Possibly unstable For TD-learning, apply difference signal to neural net output and perform backpropagation

© Daniel S. Weld 27 Policy Search Represent policy in terms of Q functions Gradient search Requires differentiability Stochastic policies; softmax Hillclimbing Tweak policy and evaluate by running Replaying experience

© Daniel S. Weld 28 ~Worlds Best Player Neural network with 80 hidden units Used computed features 300,000 games against self

© Daniel S. Weld 29 Pole Balancing Demo

© Daniel S. Weld 30 Applications to the Web Focused Crawling Limited resources Fetch most important pages first Topic specific search engines Only want pages which are relevant to topic Minimize stale pages Efficient re-fetch to keep index timely How track the rate of change for pages?

© Daniel S. Weld 31 Standard Web Search Engine Architecture crawl the web create an inverted index store documents, check for duplicates, extract links inverted index DocIds Slide adapted from Marty Hearst / UC Berkeley] Search engine servers user query show results To user

© Daniel S. Weld 32 RL for Focussed Crawling Huge, unknown environment Large, dynamic number of “actions” Consumable rewards Questions: How represent states, actions? What’s the right reward? Discount factor? How represent Q functions? Dynamic programming on state space the size of WWW?

© Daniel S. Weld 33 Disregard state Action == bag of words in link neighborhood Q function Train offline on crawled corpus Compute reward as f(action),  =0.5 Discretize rewards into bins Learn Q function directly (supervised learning) Add words  link neighborhood to bin bag Train naïve Bayes classifier on bins Rennie & McCallum (ICML-99)

© Daniel S. Weld 34 Performance

© Daniel S. Weld 35 Zoom on Performance

© Daniel S. Weld 36 Applications to the Web II Improving Pagerank

© Daniel S. Weld 37 Methods Agent Types Utility-based Action-value based (Q function) Reflex Passive Learning Direct utility estimation Adaptive dynamic programming Temporal difference learning Active Learning Choose random action 1/n th of the time Exploration by adding to utility function Q-learning (learn action/value f directly – model free) Generalization ADDs Function approximation (linear function or neural networks) Policy Search Stochastic policy repr / Softmax Reusing past experience

© Daniel S. Weld 38 Summary Use reinforcement learning when Model of world is unknown and/or rewards are delayed Temporal difference learning Simple and efficient training rule Q-learning eliminates need for explicit T model Large state spaces can (sometimes!) be handled Function approximation, using linear functions Or neural nets