Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham.

Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham

Plan Why and when What we can do –Learning how to act –Learning maps –Evolutionary Robotics How we do it –Supervised Learning –Learning from punishments and rewards –Unsupervised Learning

Learning How to Act What can we do? –Reaching –Road following –Box pushing –Wall following –Pole-balancing –Stick juggling –Walking

Learning How to Act: Reaching We can learn from reinforcement or from a teacher (supervised learning) Reinforcement Learning: –Action: Move your arm (  ) –You received a reward of 2.1 Supervised Learning: –Action: Move your hand to  –You should have moved to     (x,y,z)

Learning How to Act: Driving ALVINN: learned to drive in 5 minutes Learns to copy the human response Feedforward multilayer neural network 30 32 Steering wheel position

Learning How to Act: Driving Network outputs form a Gaussian Mean encodes the driving direction Compare with the “correct” human action Compute error for each unit given desired Gaussian

Learning How to Act: Driving Distribution of training examples from on the fly learning causes problems Network doesn’t see how to cope with misalignments Network can forget if it doesn’t see a situation for a while Answer: generate new examples from the on the fly images

Learning How to Act: Driving Use camera geometry to assess new field of view Fill in using information about road structure Transform the target steering direction Present as a new training example

Learning How to Act: Driving

Learning How to Act Obelix Learns to push boxes Reinforcement Learning

What is Reinforcement Learning? Learning from punishments and rewards Agent moves through world, observing states and rewards Adapts its behaviour to maximise some function of reward s9s9 s5s5 s4s4 …… … +50 +3 r9r9 r5r5 r4r4 r1r1 s1s1 a9a9 a5a5 a4a4 a2a2 … a3a3 a1a1 s2s2 s3s3

Return: Long term performance Let’s assume our agent acts according to some rules, called a policy,  The return R t is a measure of long term reward collected after time t The expected return for a state-action pair is called a Q value Q(s,a) +50 +3 r9r9 r5r5 r4r4 r1r1

One step Q-learning Guess how good state-action pairs are Take an action Watch the new state and reward Update the state-action value s t+1 atat stst r t+1

Obelix Won’t converge with a single controller Works if you divide it into behaviours But …

Evolutionary Robotics

Learning Maps

Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham.

Similar presentations

Presentation on theme: "Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham.

Similar presentations

Presentation on theme: "Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham."— Presentation transcript:

Similar presentations

About project

Feedback