Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement learning
Lecture 18: Temporal-Difference Learning
Dialogue Policy Optimisation

Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
RL for Large State Spaces: Value Function Approximation
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
1. Algorithms for Inverse Reinforcement Learning 2
Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Presenter: Yufan Liu November 17th,
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Efficient Model Selection for Support Vector Machines
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Applying reinforcement learning to Tetris A reduction in state space Underling : Donald Carr Supervisor : Philip Sterne.
Ali Al-Saihati ID# Ghassan Linjawi
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.
Non-Bayes classifiers. Linear discriminants, neural networks.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.
Reinforcement learning (Chapter 21)
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Reinforcement Learning
A Comparison of Learning Algorithms on the ALE
Making complex decisions
Reinforcement Learning in POMDPs Without Resets
Reinforcement learning (Chapter 21)
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Lecture 2 – Monte Carlo method in finance
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Reinforcement Learning
Chapter 8: Generalization and Function Approximation
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Linear Discrimination
Unsupervised Perceptual Rewards For Imitation Learning
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel

Contents 1.Project’s Goal 2.Octopus Arm and it’s Model 3.The Learning Process 4.The On-Line GPTD Algorithm 5.Project Development Stages 6.Program Structure 7.Our work so far 8.What’s left to be done

1. Project’s Goal Teach an octopus arm model to reach a given point in space.

2. Octopus Arm and it’s Model (1/4) An octopus is a carnivorous eight- armed sea creature with a large soft head and two rows of suckers on the underside of each arm (usually lives on the bottom of the ocean).

2. Octopus Arm and it’s Model (2/4) An octopus arm is a muscular hydrostat organ capable of exerting force with the sole use of muscles, without requiring a rigid skeleton.

The model simulates the physical behavior of the arm in its natural environment. The model gives us the position of the arm considering:  Muscle forces.  Internal forces that keep the arm’s volume constant (The arm is filled with liquid).  Gravitation and the floatation vertical forces.  The drag of the water. 2. Octopus Arm and it’s Model (3/4)

2. Octopus Arm and it’s Model (4/4) The real octopus arm is continuous. This model approximates the arm by dividing it into segments and calculating the forces on each segment separately. The model we were given is the outcome of a previous project in this lab. It is a 2- dimensional and written in C.

3. The Learning Process (1/4) We use Reinforcement Learning (RL) methods to teach our model: –Reinforcement learning is a problem faced by an agent that must learn behavior through trial and error interactions with a dynamic environment. –We use RL in programming an agent by reward and punishment without needing to specify how the task is to be achieved.

3. The Learning Process (2/4) In our case: –The agent chooses which muscles to activate in a given time. –The model provides us the result of the activation (the next state of the arm). –The reward the agent gets depends on the arm’s state.

3. The Learning Process (3/4) In RL the agent chooses his action in each state by a “Policy”. In order to improve the policy, we should calculate the “Value” of each state for that given policy. For that we use an Optimistic Policy Iteration (OPI) algorithm, which means the policy will change in each iteration without waiting for the value convergence.

3. The Learning Process (4/4) For the OPI, we will try two exploration methods:  Probabilistic Greedy  Softmax Since the model’s state space is continuous, we use the On-Line GPTD algorithm for the value estimation.

4. On-Line GPTD Algorithm (1/4) TD( ) – An algorithm family in which temporal differences are used to estimate the value function on-line. GPTD – Gaussian Processes for TD learning: Assume that the sequence of rewards is a gaussian random process (with noise), and the rewards we get are samples of that process. We can estimate the value function using gaussian estimation and a kernel function.

4. On-Line GPTD Algorithm (2/4) GPTD disadvantages:  Space consumption of O(t 2 ).  Time consumption of O(t 3 ). The proposed solution: On-Line Sparsification applied on the GPTD algorithm.

4. On-Line GPTD Algorithm (3/4) On-Line Sparsification: Instead of keeping a large number of results of a vector function (function applied on a vector, yielding a vector), we keep a “dictionary” of input vectors that can span, up to an accuracy threshold, the original vector function’s space.

4. On-Line GPTD Algorithm (4/4) Applying the on-line sparsification on the GPTD algorithm yields: –Recursive update rules. –No matrix inversion needed. –Matrix dimensions depend on m t (the dictionary size at time t), generally not linearly dependent of t. Using those, we can calculate the value estimate and it’s variance with O(m t ) and O(m t 2 ) time, respectively.

5. Project Development Stages 1.Learning the usage of the octopus arm model. 2.Understanding the theoretical basis (RL & On-Line GPTD) 3.Adjusting the model program to our needs. 4.Implementing the On-Line GPTD algorithm for general purposes. 5.Implementing an agent that will use the model and the On-Line GPTD algorithm to perform the RL task. 6.Testing the learning program with different parameters to find optimal and interesting results:  Model parameters (activations, times, lengths, number of segments, etc…)  On-Line GPTD parameters (kernel functions, gaussian noise variance, discount factor, accuracy threshold).  Agent parameters (state exploration methods, goals, reward functions). 7.Conclusions.

6. Work done so far Model code learned and adjusted to our needs. After studying the theoretical basis, On- Line GPTD generic module was implemented. Agent supporting different exploration methods was implemented. All modules were successfully integrated in the C++ environment.

7. Program Structure ExplorerAgent Arm Model On-Line GPTDEnvironment

8. Work left to be done Testing the learning program with different parameters to find optimal and interesting results, as specified earlier. Conclusions.