# Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006.

## Presentation on theme: "Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006."— Presentation transcript:

Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006

Overview Policy Gradient Algorithms  RL for Quadrupal Locomotion  PEGASUS Algorithm Autonomous Helicopter Flight High Speed Obstacle Avoidance RL for Biped Locomotion  Poincare-Map RL  Dynamic Planning Hierarchical Approach  RL for Acquisition of Robot Stand-Up Behavior

RL for Quadruped Locomotion [Kohl04] Simple Policy-Gradient Example Optimize Gait for Sony-Aibo Robot Use Parameterized Policy  12 Parameters Front + rear locus (height, x-pos, y-pos) Height of the front and the rear of the body …

Quadruped Locomotion Policy: No notion of state – open loop control! Start with initial Policy Generate t = 15 random policies R i is  Evaluate Value of each policy on the real robot  Estimate gradient for each parameter  Update policy into the direction of the gradient

Quadruped Locomotion Estimation of the Walking Speed of a policy  Automated process of the Aibos  Each Policy is evaluated 3 times  One Iteration (3 x 15 evaluations) takes 7.5 minutes

Quadruped Gait: Results Better than the best known gait for AIBO!

Pegasus [Ng00] Policy Gradient Algorithms:  Use finite time horizon, evaluate Value  Value of a policy in a stochastic environment is hard to estimate => Stochastic Optimization Process PEGASUS:  For all policy evaluation trials use fixed set of start states (scenarios)  Use „fixed randomization“ for policy evaluation Only works for simulations!  The same conditions for each evaluation trial  => Deterministic Optimization Process! Can be solved by any optimization method Commonly Used: Gradient Ascent, Random Hill Climbing

Autonomous Helicopter Flight [Ng04a, Ng04b] Autonomously learn to fly an unmanned helicopter  70000 \$ => Catastrophic Exploration! Learn Dynamics from the observation of a Human pilot Use PEGASUS to:  Learn to Hover  Learn to fly complex maneuvers  Inverted Helicopter flight

Helicopter Flight: Model Indenfication 12 dimensional state space  World Coordinates (Position + Rotation) + Velocities 4-dimensional actions  2 rotor-plane pitch  Rotor blade tilt  Tail rotor tilt Actions are selected every 20 ms

Helicopter Flight: Model Indenfication Human pilot flies helicopter, data is logged  391s training data  reduced to 8 dimensions (position can be estimated from velocities) Learn transition probabilities P(st+1|st, at)  supervised learning with locally weighted linear regression  Model Gaussian noise for stochastic model Implemented a simulator for model validation

Helicopter Flight: Hover Control Desired hovering position : Very Simple Policy Class  Edges are optained by human prior knowledge  Learns more or less linear gains of the controller Quadratic Reward Function:  punishment for deviation of desired position and orientation

Helicopter Flight: Hover Control Results:  Better performance than Human Expert (red)

Helicopter Flight: Flying maneuvers Fly 3 manouvers from the most difficult RC helicopter competition class Trajectory Following:  punish distance from projected point on trajectory  Additional reward for making progress along the trajectory

Helicopter Flight: Results Videos:  Video1 Video2Video1Video2

Helicopter Flight: Inverse Flight Very difficult for humans  Unstable! Recollect data for inverse flight  Use same methods than before Learned in 4 days!  from data collection to flight experiment Stable inverted flight controller  sustained position Video

High Speed Obstacle Avoidance [Michels05] Obstacle Avoidance with RC car in unstructured Environments Estimate depth information from monocular cues Learn controller with PEGASUS for obstacle avoidance  Graphical Simulation : Does it work in the real environment?

Estimate Depths Information: Supervised Learning  Divide image into 16 horizontal stripes Use features of the strip and the neighbored strips as input vectors.  Target Values (shortest distance within a strip) either from simulation or laser range finders  Linear Regression Output of the vision system  angle of the strip with the largest distance  Distance of the strip

Obstacle Avoidance: Control Policy: 6 Parameters  Again, a very simple policy is used Reward:  Deviation of the desired speed, Number of crashes

Obstacle Avoidance: Results Using a graphical simulation to train the vision system also works for outdoor environments  VideoVideo

RL for Biped Robots Often used only for simplified planar models  Poincare-Map based RL [Morimoto04]  Dynamic Planning [Stilman05] Other Examples for RL in real robots:  Strongly Simplify the Problem: [Zhou03]

Poincare Map-Based RL Improve walking controllers with RL Poincare map: Intersection-points of an n-dimensional trajectory with an (n-1) dimensional Hyperplane  Predict the state of the biped a half cycle ahead at the phases :

Poincare Map Learn Mapping:  Input Space : x = (d, d‘) Distance between stance foot and body  Action Space : Modulate Via-Points of the joint trajectories Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid

Via Points Nominal Trajectories from human walking patterns Control output is used to modulate via points with a circle  Hand selected via-points  Increment via-points of one joint by the same amount

Learning the Value function Reward Function:  0.1 if height of the robot > 0.35m  -0.1 else Standard SemiMDP update rules  Only need to learn the value function for and Model-Based Actor-Critic Approach  A … Actor  Update Rule:

Results: Stable walking performance after 80 trials  Beginning of Learning  End of Learning

Dynamic Programming for Biped Locomotion [Stilman05] 4-link planar robot Dynamic Programming for Reduced Dimensional Spaces  Manual temporal decomposition of the problem into phases of single and double support  Use intuitive reductions fo the state space for both phases

State-Increment Dynamic Programming 8-dimensional state space:  Discretize State-Space by coarse grid Use Dynamic Programming:  Interval ε is defined as the minimum time intervall required for any state index to change

State Space Considerations Decompose into 2 state space components (DS + SS)  Important disctinctions between the dynamcs of DS and SS  Periodic System: DP can not be applied separately to state space components Establish mapping between the components for the DS and SS transition

State Space Reduction Double Support:  Constant step length (df) Can not change during DS Can change after robot completes SS  Equivalent to 5-bar linkage model Entire state space can be described by 2 DoF (use k 1 and k 2 )  5-d state space 10x16x16x12x12 grid => 368640 States

State Space Reduction Single Support  Compass 2-link Model  Assume k 1 and k 2 are constant Stance knee angle k 1 has small range in human walking Swing knee k 2 has strong effect on d f, but can be prescribed in accordance with h 2 with little effect on the robot‘s CoM  4-D state space 35x35x18x18 grid => 396900 states

State-Space Reduction Phase Transitions  DS to SS transition occurs when the rear foot leaves the ground Mapping:  SS to DS transition occurs when the swing leg makes contact Mapping:

Action Space, Rewards Use discretized torques  DS: hip and both knee joints can accelerate the CoM Fix hip action to zero to gain better resolution for the knee joints Discretize 2-D action space from +- 5.4 Nm into 7x7 intervalls  SS: Only choose hip torque 17 intervalls in the range of +- 1.8 Nm State x Actions  398640x49+396900x17 = 26280660 cells (!!) Reward: 

Results 11 hours of computation The computed policy locates a limit cycle through the space.

Performance under error Alter different properties of the robot in simulation Do not relearn the policy Wide range of disturbances are accepted  Even if the used model of the dynamics is incorrect!  Wide set of acceptable states allows the actual trajectory to be distinct from the expected limit cycle

RL for a CPG-driven biped Robot CPG-Controller:  Recurrent Neural-oscillator network  State dynamics  Sensory Input:  Torque output State of the system:  Neural State v + Physical State x Weights of the CPG have already been optimized [Taga01]

CPG Actor-Critic 2 modules: Actor + CPG  Wijact and Ajk are trained by RL, the rest of the system is fixed

CPG-Actor Critic Actor  Outputs indirect control u to the CPG coupled system (CPG + physical system)  The actor is a linear controller without any feedback connections Learning with the Natural Actor-Critic algorithm

CPG Actor-Critic: Experiments CPG-Driven Biped Robot

Learning of a Stand-up Behavior [Morimoto00] Learning to stand-up with a 3-linked planar robot. 6-D state space  Angles + Velocities Hierarchical Reinforcement Learning  Task decomposition by Sub-goals  Decompose task into: Non–linear problem in a lower dimensional space Nearly-linear problem in a high- dimensional space

Upper-level Learning Coarse Discretization of postures  No speed information in the state space (3-d state space): Actions: Select sub-goals  New Sub-goal

Upper-Level Learning Reward Function:  Reward success of stand-up  Reward also for the success of a subgoal  Choosing sub-goals which are easier to reach from the current state is prefered Use Q(lambda) learning to learn the sequence of sub-goals

Lower-level learning Lower level is free to choose at which speed to reach sub-goal (desired posture) 6-D state space  Use Incremental Normalized Gaussian networks (ING-net) as function approximator RBF network with rule for allocating new RBF-centers Action Space:  Torque-Vector:

Lower-level learning Reward:  -1.5 if the robot falls down Continuous time actor critic learning [Doya99]  Actor and Critic are learnt with ING-nets. Control Output:  Combination of linear servo controller and non-linear feedback controller

Results: Simulation Results  Hierarchical architecture 2x faster than plain architecture Real Robot  Before Learning Before Learning  During Learning During Learning  After Learning After Learning Learned on average in 749 trials (7/10 learning runs) Used on average 4.3 subgoals

The end For People who are interested in using RL:  RL-Toolbox www.igi.tu-graz.ac.at/ril-toolbox Thank you

Literature [Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005 [Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000 [Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004 [Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004 [Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005 [Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004

Literature [Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005 [Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000 [Morimoto98] Hierarchical Reinforcement Learning of Low- Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998 [Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003 [Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999

Download ppt "Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006."

Similar presentations