Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information.

Slides:

Advertisements

Similar presentations

Quality-of-Service Routing in IP Networks Donna Ghosh, Venkatesh Sarangan, and Raj Acharya IEEE TRANSACTIONS ON MULTIMEDIA JUNE 2001.

Advertisements

Dynamic Web Service Composition within a Service-Oriented Architecture Ivan J. Jureta, St´ephane Faulkner Info. Manag. Research Unit (IMRU) University.

Dialogue Policy Optimisation

Price Of Anarchy: Routing

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

VEHICLE ROUTING PROBLEM

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Follow the regularized leader

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Decision Theoretic Planning

Management Science 461 Lecture 2b – Shortest Paths September 16, 2008.

1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.

Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

The Antnet Routing Algorithm - A Modified Version Firat Tekiner, Z. Ghassemlooy Optical Communications Research Group, The University of Northumbria, Newcastle.

Planning under Uncertainty

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Detecting Network Intrusions via Sampling : A Game Theoretic Approach Presented By: Matt Vidal Murali Kodialam T.V. Lakshman July 22, 2003 Bell Labs, Lucent.

On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.

Distributed Q Learning Lars Blackmore and Steve Block.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Ant Colony Optimization Optimisation Methods. Overview.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.

1 TDMA Scheduling in Competitive Wireless Networks Mario CagaljHai Zhan EPFL - I&C - LCA February 9, 2005.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing Y.Thomas Hou, Yi Shi, Jianping Pan, Scott F.Midkiff Mobile.

D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.

Triple Patterning Aware Detailed Placement With Constrained Pattern Assignment Haitong Tian, Yuelin Du, Hongbo Zhang, Zigang Xiao, Martin D.F. Wong.

Knight’s Tour Distributed Problem Solving Knight’s Tour Yoav Kasorla Izhaq Shohat.

Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.

Distributed Asynchronous Bellman-Ford Algorithm

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Switch-and-Navigate: Controlling Data Ferry Mobility for Delay-Bounded Messages Liang Ma*, Ting He +, Ananthram Swami §, Kang-won Lee + and Kin K. Leung*

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Swarm Intelligence 虞台文.

Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.

Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Presenter: Jonathan Murphy On Adaptive Routing in Wavelength-Routed Networks Authors: Ching-Fang Hsu Te-Lung Liu Nen-Fu Huang.

Optimization Flow Control—I: Basic Algorithm and Convergence Present : Li-der.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,

Optimization of Wavelength Assignment for QoS Multicast in WDM Networks Xiao-Hua Jia, Ding-Zhu Du, Xiao-Dong Hu, Man-Kei Lee, and Jun Gu, IEEE TRANSACTIONS.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Whitespace Measurement and Virtual Backbone Construction for Cognitive Radio Networks: From the Social Perspective Shouling Ji and Raheem Beyah Georgia.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

1 - CS7701 – Fall 2004 Review of: Detecting Network Intrusions via Sampling: A Game Theoretic Approach Paper by: – Murali Kodialam (Bell Labs) – T.V. Lakshman.

Ant colony optimization. HISTORY introduced by Marco Dorigo (MILAN,ITALY) in his doctoral thesis in 1992 Using to solve traveling salesman problem(TSP).traveling.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Designing Games for Distributed Optimization Na Li and Jason R. Marden IEEE Journal of Selected Topics in Signal Processing, Vol. 7, No. 2, pp ,

1 Presented by Sarbagya Buddhacharya. 2 Increasing bandwidth demand in telecommunication networks is satisfied by WDM networks. Dimensioning of WDM networks.

1 Slides by Yong Liu 1, Deep Medhi 2, and Michał Pióro 3 1 Polytechnic University, New York, USA 2 University of Missouri-Kansas City, USA 3 Warsaw University.

Stochastic Optimization

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Hub Location–Allocation in Intermodal Logistic Networks Hüseyin Utku KIYMAZ.

1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Reinforcement Learning

Analysis of Link Reversal Routing Algorithms

An Adaptive Middleware for Supporting Time-Critical Event Response

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Networked Real-Time Systems: Routing and Scheduling

Reinforcement Learning

Stochastic Methods.

Presentation transcript:

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Université de Louvain Belgium

Achbany Youssef - UCL 2 Outline Introduction Mathematical concepts Modelling exploration by entropy Optimal policy Preliminary experiments Conclusion and further work

Achbany Youssef - UCL 3 Introduction One of the challenges of reinforcement learning is to manage: The tradeoff between exploration and exploitation. Exploitation aims to capitalize on already well-established solutions. Exploration: aims to continually try new ways of solving the problem. is relevant when the environment is changing.

Achbany Youssef - UCL 4 Introduction Simple routing problem The goal is to reach a destination node (13) From an initial node (1) To minimize costs For each node Set of admissible actions Weight (cost) associated We define a probability distribution on the set of admissible actions

Achbany Youssef - UCL 5 Mathematical concepts We have a set of states, S = {1, 2, …,n} s t = k means that the system is in state k at time t In each state s = k, we have a set of admissible control actions, U(k) So that u(k)  U(k) is a control action available at state k

Achbany Youssef - UCL 6 Mathematical concepts When we choose action u(s t ) at state s t, A bounded cost C(u(s t )| s t ) < ∞ is incurred The system jumps to state s t+1 = f(u(s t )| s t ) Where f is a function We suppose the network of states does not contain any negative cycle

Achbany Youssef - UCL 7 Mathematical concepts For each state s, we define a probability distribution on the set of admissible actions, P(u(s)| s) Meaning that the choice is randomized This introduces exploration – not only exploitation This is the main contribution of our work

Achbany Youssef - UCL 8 Mathematical concepts For instance if, in state s = k, there are three admissible actions, The probability distribution P(u(k)| s=k) involves three values k uk1uk1 P(u k 1 |k) uk2uk2 uk3uk3 P(u k 3 |k) P(u k 2 |k)

Achbany Youssef - UCL 9 Mathematical concepts The policy  is defined as the set of all probability distributions for all states

Achbany Youssef - UCL 10 Mathematical concepts The goal is to reach a destination state, s = d From an initial state, s 0 = k 0 While minimizing the total expected cost The expectation is taken on the policy, that is, on all the random variables u(k) associated to the states

Achbany Youssef - UCL 11 Mathematical concepts In other words, we have to determine the best policy  that minimizes V  (k 0 ) That is, the best probability distributions This is standard, except the fact that we introduce choice randomisation

Achbany Youssef - UCL 12 Mathematical concepts We now introduce a way to control exploration We introduce the degree of exploration, E k, defined on each state k Which is the entropy of the probability distribution of actions in this state k

Achbany Youssef - UCL 13 Modelling exploration by entropy The degree of exploration, E k, is defined as the entropy at state k The minimum is 0 (no exploration) The maximum is log(n k ) where n k is the number of admissible actions in state k (full exploration)

Achbany Youssef - UCL 14 Modelling exploration by entropy While the exploration rate is defined as and takes its value between 0 (no exploration) and 1 (full exploration).

Achbany Youssef - UCL 15 Modelling exploration by entropy The goal now is to determine the optimal policy under exploration constraints That is, seek the policy,  *, among for which the expected cost, V  (k 0 ), is minimal while guarantying a given degree of exploration (entropy) in each state k

Achbany Youssef - UCL 16 Modelling exploration by entropy In other words, where the E k are provided/fixed by the user/designer They control the degree of exploration at each node k

Achbany Youssef - UCL 17 Modelling exploration by entropy Thus, we route the agents as fast as possible, while exploring the network

Achbany Youssef - UCL 18 Optimal policy Here are the necessary optimality conditions (for a local minimum), very similar to Bellman’s equations V  * (k) is the optimal expected cost from state k P(i|k) is the probability of chosing action i satisfying the entropy constraint through  k

Achbany Youssef - UCL 19 Optimal policy Which lead to the following updating rules Convergence has been proved in a stationary environment

Achbany Youssef - UCL 20 Optimal policy This updating rule has a nice interpretation: Route the agents preferably (with probability P(i|k) ) to the state from which the expected cost is minimal Including the direct cost for reaching this state

Achbany Youssef - UCL 21 Optimal policy If  k is large (zero entropy: no exploration), we obtain which is the common value iteration algorithm or Bellman’s equation for finding the shortest path

Achbany Youssef - UCL 22 Optimal policy If  k is zero (maximum entropy: full exploration), We perform a blind exploration We estimate the « average first passage time » Without taking the costs into consideration: where n k is the number of admissible actions in state k

Achbany Youssef - UCL 23 Advantages of our algorithm Our strategy could be interesting if the environment is changing And there is a need for continuous exploration Indeed, if no exploration is performed, The agent will not notice the changes unless they occur on the shortest path So that the policy will not be adjusted In other words, we propose an optimal exploration/exploitation trade-off

Achbany Youssef - UCL 24 Preliminary experiments Simple Network routing Dynamic Uncertain

Achbany Youssef - UCL 25 Preliminary experiments Exploration rate of 0% for all nodes (no exploration)

Achbany Youssef - UCL 26 Preliminary experiments Entropy rate of 30% for all nodes

Achbany Youssef - UCL 27 Preliminary experiments Entropy rate of 60% for all nodes

Achbany Youssef - UCL 28 Preliminary experiments Entropy rate of 90% for all nodes

Achbany Youssef - UCL 29 Preliminary experiments Other experimental simulations are provided in: Tuning continual exploration in reinforcement learning (Technical report submitted for publication). /Achbany2005a.pdf

Achbany Youssef - UCL 30 Conclusion In this work, we presented a model integrating both exploration and exploitation in a common framework. The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system. When no exploration is performed (zero entropy on each node), the model reduces to the common value iteration algorithm computing the minimum cost policy. On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a "blind" exploration, without considering the costs.

Achbany Youssef - UCL 31 Further work This model has been extended to Stochastic shortest paths problems Discounted problems Acyclic graphs Edit-distances between string Developing links with Q-learning

Achbany Youssef - UCL 32 Thank you !!!