Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: http://www.technion.ac.il/~smily/rl/index.html FOR MORE INFO...

Project Goals Study the field of Reinforcement Learning (RL) Have a practical experience with implementing RL algorithms Examine the influence of various parameters on RL algorithms performence

Overview Reinforcement Learning In RL problems, an agent (a decision-maker), attempts to control a dynamic system by choosing actions every time interval

Overview, cont. Reinforcement Learning The agent receives a feedback with every action it executes

Overview, cont. Reinforcement Learning The ultimate goal of the agent is to learn a strategy for selecting actions such that the overall performance is optimized according to a given criteria

Overview, cont. The Value function Given a fixed policy, which determines the action to be performed at a given state, this function assigns a value to every state in the state space (all possible states the system can have)

Overview, cont. The Value function The value of a state is defined as the weighted sum (short term reinforcements are taken more strongly into account than long term ones) of the reinforcements received when starting at that state and following the given policy to a final state

Overview, cont. The Value function Or mathematically:

Overview, cont. The Action Value Function or Q- Function Given a fixed policy, this function assigns a value to every pair of (state, action) in the (state, action) space

Overview, cont. The Action Value Function or Q- Function The value of a pair (state s, action a) is defined as the weighted sum of reinforcements due to executing action a at state s, and then following the given policy for selecting actions in subsequent states

Overview, cont. The Action Value Function or Q- Function Or mathematically:

Overview, cont. The learning algorithm Uses experiences to progressively learn the optimal value function, which is the function that predicts the best long term outcome an agent could receive from a given state

Overview, cont. The learning algorithm The agent studies the optimal value function by continually exercising the current, non- optimal estimate of the value function and improving this estimate after every experience

Overview, cont. The learning algorithm Given the optimal value functuion, the agent can then evaluate the optimal policy by performing:

Description Overviewed the field of Learning in general and focused on RL algorithms Implemented various RL algorithms on a chosen task, aiming to teach the agent the best way to perform the task

Description, cont. The task of the agent Given a car’s initial location and velocity, bring it to a desired location with a zero velocity, as quickly as possible !

Description, cont. The task of the agent System description: The car can move either forward or backwards The agent can control the car’s acceleration at any time interval

Description, cont. The task of the agent System description: Walls are placed on both sides of the track When the car hits a wall, it bounces back in the same speed it had prior to the collision

Description, cont. A sketch of the system

Description, cont. The code was written using MATLAB Performed experiments to determine the influence of different parameters on the Learning algorithm (mainly on convergence and how fast the system studies the optimal policy) Tested the performance of CMAC as a function approximator (tested for both 1D and 2D functions)

Implementation issues Function approximators - Representing the Value/Q Function –Lookup Tables A finite ordered set of elements (A possible implementation would be an array). Each element is uniquely associated with an index. Accessing the element would be through it’s index. Each region in a continuous state space is mapped into an element of the lookup table. Thus, all states within a region are aggregated (accumulated) into one table element, therefore assigned the same value.

Implementation issues, cont. Function approximators - Representing the Value/Q Function –Lookup Tables This mapping from the state space to the Lookup Table can be uniform or non-uniform. An example of a uniform mapping of the state space to cells

Implementation issues, cont. Function approximators - Representing the Value/Q Function –Cerebellar Model Articulation Controller – CMAC each state activates a specific set of memory locations (features). The arithmetic sum of their values is the value of the stored function.

A CMAC structure realization

Implementation issues, cont. Learning the optimal Value Function We wish to study the optimal Value Function from which we can deduce the optimal action policy

Implementation issues, cont. Learning the optimal Value Function Our learning algorithm was based on methods of Temporal Difference or shortly, TD

Implementation issues, cont. Learning the optimal Value Function We define the temporal difference as:

Implementation issues, cont. Learning the optimal Value Function At each time step we update the estimated Value Function by calculating:

Implementation issues, cont. Learning the optimal Value Function By definition, the optimal policy satisfies:

Implementation issues, cont. Learning the optimal Value Function –TD( ) and Eligibility Traces The TD rule as presented above is really an instance of a more general class of algorithms called TD( ) with.

Implementation issues, cont. Learning the optimal Value Function –TD( ) and Eligibility Traces The general TD( ) rule is similar to TD rule given above: is taken to be.

Implementation issues, cont. Look-Up Table Implementation –We used a Look-Up Table to represent the Value Function, and acquired the optimal policy by applying the TD( ) algorithm. –We used a non uniform mapping of the state space to cells in the Look-Up Table which enabled us to keep a rather small number of cells but still have a fine quantization around the origin.

Implementation issues, cont. CMAC Implementation - 1 –CMAC is used to represent the Value Function and TD( ) is the learning algorithm. CMAC Implementation - 2 –CMAC is used to represent the Q-Function and TD( ) is the learning algorithm.

Implementation issues, cont. System simulation description We simulated each of the three implementations for different values of. For each value of we tested the system for different values of. was taken to be 1 throughout the simulations.

Simulation Results We define: –success rate: The percentage of all tries in which the agent has successfuly been able to bring the car to it’s destination with zero velocity

Simulation Results We define: –Average way: The average of the time intervals it took the agent to bring the car to it’s destination

Simulation Results Look-Up Table results –A common result for all parameter variants is the improvement of the success rate and the shortening of the average way to goal as learning progresses. –For a given, it’s hard to observe any differences between the results for different values of. –As increases, the learning process is better i.e. For a given try number, the results for the success rate and average way are better.

Simulation Results Look-Up Table results –It’s noted that eventually, in all cases, the success rate reaches 100% i.e. the agent successfully brought the car to it’s goal.

Look-Up Table performance summary

Simulation Results CMAC Q-Function results –A common result for all parameter variants is the improvement of the success rate and the shortening of the average way to goal as learning progresses. –For a given, better results were obtained for a bigger. –As increases, the learning process is generally better.

Simulation Results CMAC Q-Function results –In most cases, 100% success rate is not reached, though it reaches 100% in some cases. –We can see in some cases that the success rate decreases along the tries and increases again.

CMAC Q-Learning performance summary

Simulation Results CMAC Value Iteration results –The figure ahead shows the results obtained by CMAC Value Iteration implementation compared to the results already obtained for the CMAC Q- Learning implementation. The results are for the best pair of ( ) as obtained from previous results.

Simulation Results CMAC Value Iteration results A comparison between CMAC Q-Learning and CMAC Value Iteration performance

Simulation Results Learning process examples –In figure 1 we show the process of learning for a specific starting state and learning parameters. The figure shows the movement of the car after every few tries, for 150 time consecutive time intervals. –In figure 2 we demonstrate the systems ability (at the end of learning) to direct the car to it’s goal starting from different states.

Figure 1: The progress of learning for:

Figure 2: The systems performance from different starting states after try 20 with the parameters:

Conclusions In this project we implemented a family of R.L. Algorithms,, with two different function approximators, CMAC and Look-Up Table.

Conclusions We examined the affect of the learning parameters,, and on the overall performance of the system. –In the Look-Up Table implementation: does not have a significant impact on the results; as increases the success rate increases more rapidly.

Conclusions We examined the affect of the learning parameters,, and on the overall performance of the system. –In the CMAC implementation: as or increases, the success rate increases and the average way decreases more rapidly.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Similar presentations

Presentation on theme: "Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Similar presentations

Presentation on theme: "Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR."— Presentation transcript:

Similar presentations

About project

Feedback