Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Rationale Learning from experience Adaptive control Examples not explicitly labeled, delayed feedback Problem of credit assignment – which action(s) led to payoff? tradeoff short-term thinking (immediate reward) for long-term consequences

Transition function – T:SxA->S, environment Reward function R:SxA->real, payoff Stochastic but Markov Policy=decision function,  :S->A “rationality” – maximize long term expected reward –Discounted long-term reward (convergent series) –Alternatives: finite time horizon, uniform weights Agent Model =

Markov Decision Processes (MDPs) if know R and T(=P), solve for value func V  (s) policy evaluation Bellman Equations dynamic programming (|S| eqns in |S| unknowns)

finding optimal policies Value iteration – update V(s) iteratively until  (s)=argmax a V  (s) stops changing Policy iteration – iterate between choosing  and updating V over all states Monte Carlo sampling: run random scenarios using  and take average rewards as V(s) MDPs

Q-learning: model-free Q-function: reformulate as value function of S and A, independent of R and T(=  )

Q-learning algorithm

Convergence Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|<  ) Proof: with each iteration (where all SxA visited), magnitude of largest error in Q table decreases by at least 

Training “on-policy” –exploitation vs. exploration –will relevant parts of the space be explored if stick to current (sub-optimal) policy? –  -greedy policies: choose action with max Q value most of the time, or random action  % of the time “off-policy” –learn from simulations or traces –SARSA: training example database: Actor-critic

Non-deterministic case

Temporal Difference Learning

convergence is not the problem representation of large Q table is the problem (domains with many states or continuous actions) how to represent large Q tables? –neural network –function approximation –basis functions –hierarchical decomposition of state space

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Similar presentations

Presentation on theme: "Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Similar presentations

Presentation on theme: "Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)"— Presentation transcript:

Similar presentations

About project

Feedback