Reinforcement learning

Reinforcement learning
Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)

Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass

Controll tasks Planning of multiple actions Learning from interaction Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 … … s1 s2 s3 s4 s5 … s9 a1 a2 a3 a4 a5 … a9

Supervised vs Reinforcement learning
Both are machine learning Supervised Reinforcement Prompt supervision Late, indirect reinforcement Passive learnng (training dataset is given) Active learning (actions taken by the system which will be then reinforced)

time: states: actions: reward: policy (strategy): deterministic: stochastic: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)

process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward

Markov assumption → the dynamics of the system can be given by:

Markov Decision Processes (MDPs)
Stochastic transitions a1 r = 0 1 1 2 r = 2 a2

Markov Decision Process
environment changes according to P and R agent takes an action: we are looking for the optimal policy  which maximises

The exploration – exploitation dilemma The k-armed bandit bandit
Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.

Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

Regression-based reinforcement learning
if the number of states and/or actions is high it’ll be intractable Continous state and/or action spaces:

TD-gammon TD learning, neural network with with 1 single layer
1,500,000 game between variants achieved the level of world champion Backgammon state space: ~1020 , DP doesn’t work!!

AlphaGo Zero approximate policy iteration deep learning (79 layers)
5M self-playing games Go state space: ~10170

Reinforcement learning

Similar presentations

Presentation on theme: "Reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement learning

Similar presentations

Presentation on theme: "Reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback