Download presentation

Presentation is loading. Please wait.

1
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

2
Contents 1.Project’s Goal 2.Octopus Arm and it’s Model 3.The Learning Process 4.The On-Line GPTD Algorithm 5.Project Development Stages 6.Program Structure 7.Our work so far 8.What’s left to be done

3
1. Project’s Goal Teach an octopus arm model to reach a given point in space.

4
2. Octopus Arm and it’s Model (1/4) An octopus is a carnivorous eight- armed sea creature with a large soft head and two rows of suckers on the underside of each arm (usually lives on the bottom of the ocean).

5
2. Octopus Arm and it’s Model (2/4) An octopus arm is a muscular hydrostat organ capable of exerting force with the sole use of muscles, without requiring a rigid skeleton.

6
The model simulates the physical behavior of the arm in its natural environment. The model gives us the position of the arm considering: Muscle forces. Internal forces that keep the arm’s volume constant (The arm is filled with liquid). Gravitation and the floatation vertical forces. The drag of the water. 2. Octopus Arm and it’s Model (3/4)

7
2. Octopus Arm and it’s Model (4/4) The real octopus arm is continuous. This model approximates the arm by dividing it into segments and calculating the forces on each segment separately. The model we were given is the outcome of a previous project in this lab. It is a 2- dimensional and written in C.

8
3. The Learning Process (1/4) We use Reinforcement Learning (RL) methods to teach our model: –Reinforcement learning is a problem faced by an agent that must learn behavior through trial and error interactions with a dynamic environment. –We use RL in programming an agent by reward and punishment without needing to specify how the task is to be achieved.

9
3. The Learning Process (2/4) In our case: –The agent chooses which muscles to activate in a given time. –The model provides us the result of the activation (the next state of the arm). –The reward the agent gets depends on the arm’s state.

10
3. The Learning Process (3/4) In RL the agent chooses his action in each state by a “Policy”. In order to improve the policy, we should calculate the “Value” of each state for that given policy. For that we use an Optimistic Policy Iteration (OPI) algorithm, which means the policy will change in each iteration without waiting for the value convergence.

11
3. The Learning Process (4/4) For the OPI, we will try two exploration methods: Probabilistic Greedy Softmax Since the model’s state space is continuous, we use the On-Line GPTD algorithm for the value estimation.

12
4. On-Line GPTD Algorithm (1/4) TD( ) – An algorithm family in which temporal differences are used to estimate the value function on-line. GPTD – Gaussian Processes for TD learning: Assume that the sequence of rewards is a gaussian random process (with noise), and the rewards we get are samples of that process. We can estimate the value function using gaussian estimation and a kernel function.

13
4. On-Line GPTD Algorithm (2/4) GPTD disadvantages: Space consumption of O(t 2 ). Time consumption of O(t 3 ). The proposed solution: On-Line Sparsification applied on the GPTD algorithm.

14
4. On-Line GPTD Algorithm (3/4) On-Line Sparsification: Instead of keeping a large number of results of a vector function (function applied on a vector, yielding a vector), we keep a “dictionary” of input vectors that can span, up to an accuracy threshold, the original vector function’s space.

15
4. On-Line GPTD Algorithm (4/4) Applying the on-line sparsification on the GPTD algorithm yields: –Recursive update rules. –No matrix inversion needed. –Matrix dimensions depend on m t (the dictionary size at time t), generally not linearly dependent of t. Using those, we can calculate the value estimate and it’s variance with O(m t ) and O(m t 2 ) time, respectively.

16
5. Project Development Stages 1.Learning the usage of the octopus arm model. 2.Understanding the theoretical basis (RL & On-Line GPTD) 3.Adjusting the model program to our needs. 4.Implementing the On-Line GPTD algorithm for general purposes. 5.Implementing an agent that will use the model and the On-Line GPTD algorithm to perform the RL task. 6.Testing the learning program with different parameters to find optimal and interesting results: Model parameters (activations, times, lengths, number of segments, etc…) On-Line GPTD parameters (kernel functions, gaussian noise variance, discount factor, accuracy threshold). Agent parameters (state exploration methods, goals, reward functions). 7.Conclusions.

17
6. Work done so far Model code learned and adjusted to our needs. After studying the theoretical basis, On- Line GPTD generic module was implemented. Agent supporting different exploration methods was implemented. All modules were successfully integrated in the C++ environment.

18
7. Program Structure ExplorerAgent Arm Model On-Line GPTDEnvironment

19
8. Work left to be done Testing the learning program with different parameters to find optimal and interesting results, as specified earlier. Conclusions.

Similar presentations

© 2024 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google