CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock

Reinforcement Learning In MDP, we learned how to determine an optimal sequence of actions for an agent in a stochastic environment. – An agent that knows the correct model of the environment can navigate, finding its ways to the positive rewards and avoiding the negative penalties. Reinforcement learning: – can guide the agent to an optimal policy, even though he doesn't know anything about the rewards when he starts out.

Reinforcement Learning +100 -100 START 1234 a b c What if we don’t know where the +100 and -100 regards are when we start? A reinforcement learning agent can learn to explore the territory, find where the rewards are, and then learn an optimal policy. An MDP solver can only do that once it knows exactly where the rewards are

RL Example Backgammon is a stochastic game In the 1990s, Gary Tesauro at IBM wrote a program to play backgammon. – #1: tried to learn the utility of a Game state, using examples that were labeled by human expert backgammon players. only a small number of states were labeled The program tried to generalize from the labels, using supervised learning – #2: no human expertise and no supervision. 1 copy of the program play against another; and at the end of the game, the winner got a positive reward, and the loser, a negative. perform at the level of the very best players in the world; learning from examples of about 200,000 games

Forms of Learning Supervised: – (x1, y1), (x2, y2) … y = f(x) Unsupervised: – X1, x2, … P(X=x) Reinforcement: – s, a, s, a, …; r Optimal policy: what is the right thing to do in any of the states

Forms of Learning Examples: (S, U, R) – Speech Recognition: examples of voice recordings, and then the transcript's intermittent text for each of those recordings; from them, I try to learn a model of language. – Star data: for each star, a list of all the different emission frequencies of light coming to earth analyzing the spectral emissions of stars and trying to find clusters of stars in dissimilar types that may be of interest to astronomers. – Lever pressing: a rat who is trained to press a lever to get a release of food when certain conditions are met – Elevator controller: a sequence of button presses, and the wait time that we are trying to minimize a bank of elevators in a building and they have to have some program, some policy, to decide which elevator goes up and which elevator goes down in response to the percepts, which would be the button presses at various floors in the building.

MDP Review Markov Decision Processes: – List of States: S1, …, Sn; – List of Actions: a1, …, ak; – State Transition Matrix: T(S, a, S’) = P(S’|a,S) – Reward function: R(S’) / R(S, a, S’) – Finding optimal policy: pi(s) look at all possible actions; choose the best one, according to the expected, in terms of probability utility.

Agents of RL Problem with MDP: – Unknown: R?? P?? Agent typeKnowLearnUse Utility-based agentPR->UU Q-learning agentQ(s, a)Q Reflex agentπ(s)π Utility-based agent: – Learn R from P, and use P, R to learn the utility function U -> MDP Q-learning agent: – Learn a utility function Q(s, a) over a pair of state and action. Reflex agent: – Learn the policy (stimulus response)

Passive and Active Agent typeKnowLearnUse Utility-based agentPR->UU Q-learning agentQ(s, a)Q Reflex agentπ(s)π Passive RL: the agent has a fixed policy and executes that policy. – e.g., Your friend are driving from Little Rock to Dallas, you learn the R (a shortcut), but you can’t change your friend’s driving behavior (policy). Active RL: change the policy as progressing – e.g., You take over the control of the car, and adjust the policy based on what you have learned. – It also gives you the possibility to explore.

Passive Temporal Difference Learning +1+1 START 1234 a b c Π, U(s), N(s), r – If s’ is new then U[s’] <- r’ – If s is not null then Increment Ns[s] U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) α(): learning rate (e.g., 1/(N+1))

Passive Agent Results

Weakness TF Long convergence ?? Limited by policy ?? Missing states ?? Poor estimates ?? Problem: Fixed policy!!! +1+1 START 1234 a b c

Active RL: Greedy π <- π’: after utility update, recompute the new optimal policy. How should the agent behave? Choose action with highest expected utility? Exploration vs. exploitation: occasionally try “suboptimal” actions!! – Random?

Errors in Utility Tracking Π, U(s), N(s) Reasons for errors: – Not enough samples (random fluctuations) – Not a good policy Questions: – Make U too low ? – Make U too high ? – Improved with more N ?

Exploration Agents An exploration agent that will: – be more proactive about exploring the world when it's uncertain, and – fall back to exploiting the (sub-)optimal policy, when it becomes more certain about the world. If s’ is new then U[s’] <- r’ If s is not null then Increment Ns[s] U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) U(s) = +R, when Ns < e U(s)

Exploratory agent results

Q-Learning U -> Π: – policy for each state is determined by the expected value Unknown P? – Q-learning.81.89.91+1+1.76.66.70.66.61.39 1234 a b c

Q-learning Q(s,a) <- Q(s,a) + α(R(s) + γQ(s’,a’) – Q(s,a)) +1+1 1234 a b c 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Conclusion Know P, we can learn R, and derive U -> MDP Don’t know P or R, we can use Q-learning, where use Q(s,a) as a utility function. We learned the trade-off between exploration and exploitation.

CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Similar presentations

Presentation on theme: "CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Similar presentations

Presentation on theme: "CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock."— Presentation transcript:

Similar presentations

About project

Feedback