Presentation is loading. Please wait.

Presentation is loading. Please wait.

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.

Similar presentations


Presentation on theme: "Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu."— Presentation transcript:

1 Value Function Approximation on Non-linear Manifolds for Robot Motor Control
Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu Vijayakumar2) 1) Computer Science, Tokyo Institute of Technology 2) School of Informatics, University of Edinburgh

2 Maze Problem: Guide Robot to Goal
Possible actions Up Left Right Position (x,y) Down reward In this talk, we would like to discuss maze problem like this. In the maze, we want to guide a robot to the goal. The robot knows its position but doesn’t know which direction he should go. We don’t teach the robot the best action to take at each position but give a reward at the goal. The task of this problem is to make the robot select optimal action. Goal Robot knows its position but doesn’t know which direction to go. We don’t teach the best action to take at each position but give a reward at the goal. Task: make the robot select the optimal action.

3 Markov Decision Process (MDP)
An MDP consists of : set of states, : set of actions, : transition probability, : reward, An action the robot takes at state is specified by policy . Goal: make the robot learn optimal policy The maze problem can be formulated as Markov Decision Process (MDP). An MDP consists of S, A, P, R. S is a set of states, A is a set of actions, P is transition probability, R is reward An action a the robot takes at state s is specified by policy, pi like this. So, the goal of the maze problem is to make the robot learn good policy

4 Definition of Optimal Policy
Action-value function: discounted sum of future rewards when taking in and following thereafter Optimal value: Optimal policy: is computed if is given. Question: How to compute ? Here we define optimal policy more formally To this purpose, we first define action value function Q as this. This means a discounted sum of future rewards when taking action a in state s and following pi thereafter. Using this value function Q, we define optimal value function Q star like this. Using Q star, we can define optimal policy like this This means we can compute optimal policy pi if we Q star is given. So the question is how to compute optimal value function Q star?

5 Policy Iteration (Sutton & Barto, 1998) Starting from some initial policy iterate Steps 1 and 2 until convergence. Step 1. Compute for current Step 2. Update by Policy iteration always converges to if       in step 1 can be computed. Question: How to compute ? Q star can be computed by Policy Iteration. This consists of two steps. Starting from some initial policy pi, At step 1, compute Q for current pi At step 2, update pi by this. Policy iteration always converges to optimal policy pi if we can compute Qpi in step 1. Because Qp is expressed by infinite sum, so the question is how to compute Q pi in step 1?

6 Bellman Equation can be recursively expressed by
can be computed by solving Bellman equation Drawback: dimensionality of Bellman equation becomes huge in large state and action spaces Q pi can be recursively expressed by this. This is called Bellman equation. Q pi can be computed by solving Bellman equation But the drawback of this approach is dimensionality of Bellman equation becomes huge in large state and action spaces. So computation cost is high high computational cost

7 Least-Squares Policy Iteration
(Lagoudakis and Parr, 2003) Linear architecture: is learned so as to optimally approximate Bellman equation in the least-squares sense # of parameters is only : LSPI works well if we choose appropriate Question: How to choose ? : fixed basis functions : parameters : # of basis functions To overcome this drawback, LSPI has been proposed. In LSPI, Q is approximated by the linear combination of some fixed basis functions. Here, phi is fixed basis function, w is parameter, K is the number of basis functions. w is learned so as to optimally approximate Bellman equation in the least-squares sense In this LSPI, # of parameters is only K which is much smaller than S times A. LSPI works well if we choose appropriate basis function. The question is how to choose basis function?

8 Popular Choice: Gaussian Kernel (GK)
Partitions : Euclidean distance : Centre state A popular choice is Gaussian Kernel. Here ED is Euclid Distance, Sc is centre state. The advantage of Gaussian Kernel is smooth but Gaussian tail goes over the partition. Actually this would cause big problem. Smooth Gaussian tail goes over partitions

9 Approximated Value Function by GK
Optimal value function Approximated by GK Log scale When we approximate this value function by Gaussian Kernels. The approximated value function is like this. Clearly values around the partition are not good. 20 randomly located Gaussians Values around the partitions are not approximated well.

10 Policy Obtained by GK Optimal policy GK-based policy
Policy obtained by Gaussian Kernel is like this. Compared with optimal policy, Gaussian Kernels provides undesired policies around the partition. GK provides an undesired policy around the partition.

11 We propose new Gaussian kernel to overcome this problem.
Aim of This Research Gaussian tails go over the partition. Not suited for approximating discontinuous value functions. As shown in the example, Gaussian tails go over the partition. So, Gaussians are not suited for approximating discontinuous value functions In this research, we propose new Gaussian Kernel to solve this problem We propose new Gaussian kernel to overcome this problem.

12 State Space as a Graph Ordinary Gaussian uses Euclidean distance.
Euclidean distance does not incorporate state space structure, so tail problems occur. We represent state space structure by a graph, and use it for defining Gaussian kernels. Gaussian tail problems occur since the state space structure is ignored. Here, we represent state space by a graph and use it for defining Gaussian Kernels. (Mahadevan, ICML 2005)

13 Geodesic Gaussian Kernels
Natural distance on graph is shortest path. We use shortest path in Gaussian function. We call this kernel geodesic Gaussian. SP can be efficiently computed by Dijkstra. Shortest path Gaussian Kernels depend on the distance. A natural distance on the graph would be shortest path. We define Gaussian kernels on the graph as this. The Euclidean distance between these two points is blue line. On the other hand, these two points are not directly connected on the graph so shortest path is red line. We call this geodesic Gaussian We note that Shortest Path is efficiently computable by Dijkstra algorithm. Euclidean distance

14 Example of Kernels Ordinary Gaussian Geodesic Gaussian
For example, when we put the center here, Geodesic Gaussian Kernel looks like this. This shows that The tail does not go across the partition And Values smoothly decrease along the maze. Tails do not go across the partition. Values smoothly decrease along the maze.

15 Comparison of Value Functions
Optimal Ordinary Gaussian Geodesic Gaussian When we approximate this value function by Geodesic Gaussian Kernels. The approximated value function is like this. Unlike Ordinary Gaussian, values near the partition are well approximated and discontinuity across the partition preserved Values near the partition are well approximated. Discontinuity across the partition is preserved.

16 Comparison of Policies
Ordinary Gaussian Geodesic Gaussian Therefore, Geodesic Gaussian Kernels can provide desired policies around the partition GGKs provide good policies near the partition.

17 Fraction of optimal states
Experimental Result Average over 100 runs Geodesic Fraction of optimal states We ran the same experiments 100 times and plot average fraction of optimal states. Small width does not have the tail problem but is less smooth Large width suffers from the tail problem In Geodesic Gaussian, Small width does not have tail problem but less smooth Large width has no tail problem and smooth Ordinary Number of kernels Ordinary Gaussian: tail problem Geodesic Gaussian: no tail problem

18 Robot Arm Reaching Task: move the end effector to reach the object
2-DOF robot arm State space 180 Object End effector Obstacle Next, we applied our kernel to a robot arm reaching task. This robot consists of two joints and the task is to lead the end effector to a target object while avoiding obstacle. The reward is 1 when the end effecter touches the object, otherwise 0. The state space is like this. This pose corresponds to this blue point in the state space and as the robot moves joints, the blue point moves through white areas. When the arm hits the obstacles, the blue point hits the black areas. When the end effector reaches the object, the blue point gets into the red area. Joint 2 (degree) Joint 2 Joint 1 Reward: +1 reach the object 0 otherwise -180 -100 100 Joint 1 (degree)

19 Robot Arm Reaching Ordinary Gaussian Geodesic Gaussian
Next we would like to show applications on robot controls. We have to omit the details due to the time limitation. Here is Robot Arm Reaching task. Policy Obtained by Ordinary Gaussian is like this. The arm hits the obstacle to reach the end-effector to the object. On the other hand, policy obtained by Geodesic Gaussian is like this. This makes the arm avoid the obstacle. Moves directly towards the object without avoiding the obstacle. Successfully avoids the obstacle and can reach the object.

20 Khepera Robot Navigation
Khepera has 8 IR sensors measuring the distance to obstacles. Task: explore unknown maze without collision Next, we apply our method to a Khpera Robot Navigation. Khepera has 8 IR sensors measuring the distance to obstacles. The task is to explore unknown maze without collision to various obstacles. Khepera takes four actions: moving forward, rotate left, right and backward. To make the Khepera explore the world, rewards are set to 1 for moving forward, -2 for collision, 0 for others. Reward: +1 (forward) -2 (collision) 0 (others) Sensor value:

21 State Space and Graph Discretize 8D state space by self-organizing map. 2D visualization We discretize 8-dimensional state space by self-organizing map. The state space is like this. For visualization the state space is projected on 2-D subspace. We can see that there are also partitions in this state space. Partitions 21

22 Khepera Robot Navigation
Ordinary Gaussian Geodesic Gaussian Next is about the Khepera robot navigation task. Policy obtained by Ordinary Gaussian is like this. When the robot faces an obstacle, it goes backward and it forwards again and again. On the other hand, policy obtained by Geodesic Gaussian is like this. When the robot faces an obstacle, it makes a turn. When facing obstacle, goes backward (and goes forward again). When facing obstacle, makes a turn (and go forward). 22

23 Experimental Results Geodesic outperforms ordinary Gaussian.
Average over 30 runs Geodesic Geodesic Gaussian outperforms ordinary Gaussian. Ordinary Geodesic outperforms ordinary Gaussian.

24 Conclusion Value function approximation: good basis function needed
Ordinary Gaussian kernel: tail goes over discontinuities Geodesic Gaussian kernel: smooth along the state space Through the experiments, we showed geodesic Gaussian is promising in high-dimensional continuous problems! As a conclusion, Value function approximation needs good basis function In ordinary Gaussian kernel, tail goes over discontinuities. Geodesic Gaussian kernel is smooth along the state space. Geodesic Gaussian is promising in high-dimensional continuous problems. Thank you for listening my presentation today. 24


Download ppt "Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu."

Similar presentations


Ads by Google