Download presentation
Presentation is loading. Please wait.
1
Reinforcement learning
Szepesvári Csaba: Megerősítéses tanulás (2004) Szita István, Lőrincz András: Megerősítéses tanulás (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)
2
Reinforcement learning
3
Reinforcement learning
4
Reinforcement learning
5
Reinforcement learning
Pavlov: Nomad 200 robot Nomad 200 simulator Sridhar Mahadevan UMass
6
Reinforcement learning
Controll tasks Planning of multiple actions Learning from interaction Objective: maximising reward (i.e. task-specific) +50 -1 +3 r9 r5 r4 r1 … … s1 s2 s3 s4 s5 … s9 a1 a2 a3 a4 a5 … a9
7
Supervised vs Reinforcement learning
Both are machine learning Supervised Reinforcement Prompt supervision Late, indirect reinforcement Passive learnng (training dataset is given) Active learning (actions taken by the system which will be then reinforced)
8
Reinforcement learning
time: states: actions: reward: policy (strategy): deterministic: stochastic: (s,a) is the likelihood that we choose action a being in state s (infinate horizon)
9
process: model of the environment: transition probabilites and reward objective: find a policy which maximises the expected value of total reward
10
Markov assumption → the dynamics of the system can be given by:
11
Markov Decision Processes (MDPs)
Stochastic transitions a1 r = 0 1 1 2 r = 2 a2
12
Markov Decision Process
environment changes according to P and R agent takes an action: we are looking for the optimal policy which maximises
13
The exploration – exploitation dilemma The k-armed bandit bandit
Avg. reward rewards 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 -20, 0, 50 agent 100 Maximising the reward on a long-term we have to explore the world’s dynamics then we can exploit this knowladge and collect reward.
14
Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.
15
Regression-based reinforcement learning
if the number of states and/or actions is high it’ll be intractable Continous state and/or action spaces:
16
TD-gammon TD learning, neural network with with 1 single layer
1,500,000 game between variants achieved the level of world champion Backgammon state space: ~1020 , DP doesn’t work!!
17
AlphaGo Zero approximate policy iteration deep learning (79 layers)
5M self-playing games Go state space: ~10170
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.