Strategic Decisions Using Dynamic Programming

Slides:

Advertisements

Similar presentations

Airline Schedule Optimization (Fleet Assignment I)

Advertisements

Partially Observable Markov Decision Process (POMDP)

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.

SPREADSHEETS IN EDUCATION OF LOGISTICS MANAGERS AT FACULTY OF ORGANIZATIONAL SCIENCES: AN EXAMPLE OF INVENTORY DYNAMICS SIMULATION L. Djordjevic, D. Vasiljevic.

Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)

INDUSTRIAL & SYSTEMS ENGINEERING

Job Release-Time Design in Stochastic Manufacturing Systems Using Perturbation Analysis By: Dongping Song Supervisors: Dr. C.Hicks & Dr. C.F.Earl Department.

An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Scheduling Using Timed Automata Borzoo Bonakdarpour Wednesday, April 13, 2005 Selected Topics in Algorithms and Complexity (CSE960)

Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.

Strategic Directions in Real- Time & Embedded Systems Aatash Patel 18 th September, 2001.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Chapter 7 Dynamic Job Shops Advantages/Disadvantages Planning, Control and Scheduling Open Queuing Network Model.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Chen Cai, Benjamin Heydecker Presentation for the 4th CREST Open Workshop Operation Research for Software Engineering Methods, London, 2010 Approximate.

ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.

A Framework for Distributed Model Predictive Control

Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

An Overview of Dynamic Programming Seminar Series Joe Hartman ISE October 14, 2004.

Slide 1 Mixed Model Production lines  2000 C.S.Kanagaraj Mixed Model Production Lines C.S.Kanagaraj ( Kana + Garage ) IEM 5303.

A Dynamic Data Grid Replication Strategy to Minimize the Data Missed Ming Lei, Susan Vrbsky, Xiaoyan Hong University of Alabama.

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Learning Automata based Approach to Model Dialogue Strategy in Spoken Dialogue System: A Performance Evaluation G.Kumaravelan Pondicherry University, Karaikal.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Tactical Planning in Healthcare with Approximate Dynamic Programming Martijn Mes & Peter Hulshof Department of Industrial Engineering and Business Information.

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,

MBA7020_01.ppt/June 13, 2005/Page 1 Georgia State University - Confidential MBA 7020 Business Analysis Foundations Introduction - Why Business Analysis.

George F Luger ARTIFICIAL INTELLIGENCE 6th edition Structures and Strategies for Complex Problem Solving Machine Learning: Symbol-Based Luger: Artificial.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Niching Genetic Algorithms Motivation The Idea Ecological Meaning Niching Techniques.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Adversarial Search Chapter Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent reply Time limits.

1 Introduction to Reinforcement Learning Freek Stulp.

1 Evaluation of TEWA in a Ground Based Air Defense Environment Presenters: Basie Kok, Andries Heyns Supervisor: Prof. Jan van Vuuren.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Content caching and scheduling in wireless networks with elastic and inelastic traffic Group-VI 09CS CS CS30020 Performance Modelling in Computer.

Proposal of Asynchronous Distributed Branch and Bound Atsushi Sasaki†, Tadashi Araragi†, Shigeru Masuyama‡ †NTT Communication Science Laboratories, NTT.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Lot-Sizing and Lead Time Performance in a Manufacturing Cell Article from Interfaces (1987) by U. Karmarkar, S. Kekre, S. Kekre, and S. Freeman Illustrates.

Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?

Tactical Planning in Healthcare using Approximate Dynamic Programming (tactisch plannen van toegangstijden in de zorg) Peter J.H. Hulshof, Martijn R.K.

DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Arne Thesen and Akachai Jantayavichit Slide 1 A new approach to tolerance improvement through real-time selective assembly Arne Thesen and Akachai Jantayavichit.

Machine Learning: Symbol-Based

Machine Learning: Symbol-Based

Xi Chen Mentor: Denny Zhou In collaboration with: Qihang Lin

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8

Analytics and OR DP- summary.

An Overview of Reinforcement Learning

Probability-based Evolutionary Algorithms

2016 International Conference on Grey Systems and Uncertainty Analysis

Dr. Unnikrishnan P.C. Professor, EEE

CS 416 Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8

Presentation transcript:

Strategic Decisions Using Dynamic Programming Nikolaos E. Pratikakis, Matthew Realff and Jay H. Lee The main objective of this talk is to introduce the intuition behind the dynamic programming methodology. By characterizing a decision ``strategic’’ , we denote a decision that considers the future uncertainty… And tries to maximize the position of the system in the long run… The goal of DP is to maximize the expected reward or

Agenda Motivation Exemplary Manufacturing Job Shop Under Uncertain Demand and Product Yield Understanding Dynamic Programming (DP) via Chess Curse of Dimensionality (COD) A new proposed Approach based on Real Time DP (RTDP) Results Conclusions and Future Work First I will highlight the problems that DP methods can be bring significant performance improvements Then I will introduce to you an exemplary… product yield…

Plant Operation Control [year] [month] Management [week] complexity Hierarchy in process system industries [day] [min] Plant Operation Control [sec] Decision Hierarchy in Process Industries

Manufacturing Job Shop Scheduling (Recirculation Rate) Completed Jobs Station 2 Queue Testing Area Station 1 Queue Assume that our manufacturing problem at this time period runs with a given configuration Our decisions will be effective from the next time period. So at firtst we should decide how much raw materials should we order for the next time period, how many machines we should have available at station 1 at the next time period and the percentage of those that will be used. The raw materials are queued at station 1 and when processed from the main processing area are going to be queued at station 2, then they are going to extensively tested, at the testing area, to see if they meet the market requirements or not. The portion that doesn’t meet the market requirements is going to be reconstructed and be forward to station 2, while the portion that does meet the market requirements is going to be stocked and satisfy the demand Likewise at station 2 and 3, the decision maker has the same decisions as in station 1. The uncertainty lies at the recirculation rate as well as at the demand rate. We studied this example by considering two modes for the recirculation rate : the high recirculation mode (low thoughput) corresponds to the bad operational status of the system and the low recirculation mode (high thoughput) that corresponds to the good operational status of the system. We will formulate this problem as an MDP but first we will show you what are the elements of an MDP formulation… Main Processing Station 3 Queue D (Demand) Reconstruction Area

An Analogy…

System State for Chess … A state is a configuration of the pieces at the board System State … System State System State

System State for Job Shop (Recirculation Rate) Completed Jobs Station 2 Queue Testing Area Station 1 Queue Main Processing Station 3 Queue D (Demand) Reconstruction Area

Control in DP Terms (2) Which Control or Action will maximize my future position? Action 1 ? Action 2 ? Expert to help you decide! System State How ???By scoring the successor configurations of the table

“Curse of Dimensionality” Curse of Dimensionality (COD) Size of S (storage issue) For complex applications the S is countable infinite Large number of controls per system state The research branch that focuses on alleviating the COD is termed as Approximate DP The obstacles if we apply DP in its pure form is size of the state space and the large number of controls. The computations for a single value iteration is denoted as such: If we don’t have a finite state or action space then its virtually impossible perform do even one value iteration.The stream of literature that tries to minimize the effect of all these elements is called ADP While heuristic policies have been used in most practical cases, {\em{dynamic programming}} (DP) \cite{Put94} \cite{BeI} \cite{BeII} stands as the basic paradigm for constructing optimal policies for multi-stage optimization problems under uncertainty. The main difficulty in applying the DP approach in its pure form is the ``curse of dimensionality'', referring to the exponential increase in the size of the state space with the number of state variables, and that of the action space with the number of decision variables. The ``curse of dimensionality" has two direct implications for the applicability of the DP approach. First is the space complexity of storing the value function (e.g., the {\it profit-to-go} function) defined over the state space: It may be very difficult or even impossible to store the values for all states even with today's technology. Second and more important is the time needed to converge to an optimal solution. The computation for a single iteration pass increases with the size of both the state space and the action space. In general, we have an exponential growth of computational time with respect to dimensions of the state space and the action space, and solving the problem using a textbook approach like {\it value iteration} or {\it policy iteration} is largely impractical \cite{BeII} \cite{Chang}. At Chapter 3 of this proposal, we propose an alternative way to address MDP formulations using a real-time dynamic programming approach, for solving large scale multi-stage decision problems under uncertainty. This approach can be used to find optimal solutions, but in many cases doing so will be prohibitively expensive. Instead we will use it to find suboptimal policies that improve significantly upon existing heuristics.

Formal Definition of Value Function Given a policy the value function for state is the expected reward Optimal value function corresponds to Value Functions are the solution of the optimality equations. This is a very important slide and defines the key concept of DP which is the value function. Simply put Value function of a policy π is its expected reward. Every policy π (Markovian deterministic or randomized) for the same state corresponds to a different value function The optimal or converged value function is the policy π* that attains the maximum expected reward. How can this policy π* be retrieved? If we simultaneously solve these optimality equations , that were first proposed by R. Bellman in 1957, we will end up with the optimal value functions for each of the sates Then we can find the optimal control using the optimal-converged value function information. This is exactly what the a*(s) equation denotes/ We can accomplish converged value function by following well known methodologies as value iteration, policy iteration, linear programming… We will stress the obstacles of DP by analyzing theValue Iteration approach Optimal action can easily be computed from optimal value function

Real Time Approximate Dynamic Programming Adaptive Action Set (AAS) Candidate optimal action for Initial state Possible successive next state 2 1 4 Check every action in AAS α* 3 6 Sample from possible transitions 5 We first start with a random state xi, we construct our AAS and we check every action using the bellman equations and our stored value table… We pick a greedy action a* update the value table, store this a* as the best known action for state xi and the sample from all the possible successive states the correspond to the transition from xi using a* We set xj=xi and continue with the loop Uncertainty Pratikakis, N.E, Realff M.J and Lee, J.H “Strategic Capacity Decisions In Manufacturing Using Real-Time Adaptive Dynamic Programming”, Submitted to Naval Research Logistics.

Results :Saturation of System States

Results : Performance .

Conclusions & Future Directions RTADP computationally amenable way to create a high quality policy for any given system. Quality of solution exceeds traditional deterministic approaches Extend current framework and incorporate risk issues (Risk Measure - CVaR). Risk RTADP framework promises to generate multiple strategies accounting risk. We conclude that the RTDP approach is RTDP approach is a computationally amenable way to create a high quality policy for any given system Adaptive action set with the random actions ensures convergence in the long run and does not satisfy performance This generic algorithmic framework can be used for multistage optimization problems under uncertainty

Questions …?

Approximate Dynamic Programming Sampling of the “relevant” state space through simulation (with known suboptimal policies) Fit a function approximator to the value function data for interpolation Global1,2 vs local approximators3 Barto et al4 introduced the real time DP To minimize the COD concerning the state space people have sampled the state space using suboptimal policies and then did the value iteration in the constructed state space, hoping that this state space overlaps with the relevant state space. Relevant state space is the state space that can be created if we know the optimal policy π* and via simulation we sample the states. But this approach can find very high Policies that exhibit performance very close to the one even if one did full DP. An other approach is to fit a global approximator to the value function data, but one needs huge amount of data to accomplish that. Recently Lee and Lee compare d the usage of global approximators (neural networks) vs local approximators(k-NN) in terms of stability and monotonicity during the value iteration and suggested the usage of local approximators. A different approach that mainly is applied for robotic path planning is the RTDP approach proposed by Barto. He has shown the convergence of this approach for stochastic shortest path problems even if on uses the relevant state space. Barto’s approach is a parent for our suggested approach and nect we will show an overview of the RTDP algorithm Several approaches have been proposed to reduce the computational burden of DP. A recent approach is Neuro-dynamic programming (NDP). The typical approach in the (NDP) and (RL) literature is to ¯t a global approximator (e.g., neural network) to the value function data. There have been many examples in the literature of failure of such an approximator in the current literature. The failure is attributed to \over-extending" the value function approximation, which was ¯rst explained by Thrun and Schwartz k-NN based approach in these examples exhibited stability, monotonicity, and fast rate of convergence. On the other hand, the neural network based approach did not exhibit such nice convergence behavior during the value iteration. 1. Bertsekas, D. P.. Encyclopedia of Optimization, Kluwer, 2001. Thrun, S. and Schwartz, A. Proceedings of the Fourth Connectionist Models Summer School (Hillsdale, NJ) Lawrence Erlbaum,1993. Lee, J. M. and Lee, J. H.,, International Journal of Control Automation and Systems, vol. 2, no. 3, pp. 263-278, 2004. Barto, A., Bradtke, S., and Singh, S. Artificial Intelligence, vol. 72, pp. 81-138, 1995.

Overview of RTDP Algorithm The controller always follows a policy that is greedy with respect to the most recent estimate of J. Simulate the dynamics of the system Update J according to : In RTDP, the The controller always follows a policy that is greedy with respect to the most recent estimate of J. Then we Simulate the dynamics of the system and we update only for this state and preserve the rest of the value table. Before I present our propose approach we will emphasize its main differences from the previous mentioned schemes.

Future Directions The future directions are to evolve this approach by considering Localized function approximations (k-Nearest Neighbors) Use Mathematical programming (e.g., MILP or chance CP) for generating candidate optimal controls The preliminary results of this architecture are quite promising We also pursuing a systematic study of the effect of initialization in convergence rate and the exploration Of the state space, I also have some preliminary results for that as well if you interested offline. With that I will conclude my talk Thank you for attending this talk…