Stochastic and non-Stochastic Games – a survey

Slides:



Advertisements
Similar presentations
Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games.
Advertisements

GAME THEORY.
C&O 355 Lecture 23 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Markov Decision Process
COMP 553: Algorithmic Game Theory Fall 2014 Yang Cai Lecture 21.
Mini-course on algorithmic aspects of stochastic games and related models Marcin Jurdziński (University of Warwick) Peter Bro Miltersen (Aarhus University)
Unique Sink Orientations of Cubes Motivation and Algorithms Tibor Szabó ETH Zürich (Bernd Gärtner, Ingo Schurr, Emo Welzl)
Generalized Linear Programming Jiří Matoušek Charles University, Prague.
C&O 355 Mathematical Programming Fall 2010 Lecture 20 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
Optimal Policies for POMDP Presented by Alp Sardağ.
Uri Zwick – Tel Aviv Univ. Randomized pivoting rules for the simplex algorithm Lower bounds TexPoint fonts used in EMF. Read the TexPoint manual before.
Infinite Horizon Problems
Planning under Uncertainty
A Randomized Polynomial-Time Simplex Algorithm for Linear Programming Daniel A. Spielman, Yale Joint work with Jonathan Kelner, M.I.T.
On the randomized simplex algorithm in abstract cubes Jiři Matoušek Charles University Prague Tibor Szabó ETH Zürich.
Uri Zwick – Tel Aviv Univ. Randomized pivoting rules for the simplex algorithm Upper bounds TexPoint fonts used in EMF. Read the TexPoint manual before.
RandomEdge can be mildly exponential on abstract cubes Jiri Matousek Charles University Prague Tibor Szabó ETH Zürich.
Design and Analysis of Algorithms
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
MCS312: NP-completeness and Approximation Algorithms
STUDY OF THE HIRSCH CONJECTURE BASED ON “A QUASI-POLYNOMIAL BOUND FOR THE DIAMETER OF GRAPHS OF POLYHEDRA” Instructor: Dr. Deza Presenter: Erik Wang Nov/2013.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.
OR Chapter 4. How fast is the simplex method  Efficiency of an algorithm : measured by running time (number of unit operations) with respect to.
An Affine-invariant Approach to Linear Programming Mihály Bárász and Santosh Vempala.
Krishnendu ChatterjeeFormal Methods Class1 MARKOV CHAINS.
Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About.
Linear Programming Many problems take the form of maximizing or minimizing an objective, given limited resources and competing constraints. specify the.
Lap Chi Lau we will only use slides 4 to 19
Game Theory Just last week:
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Topics in Algorithms Lap Chi Lau.
Making complex decisions
The minimum cost flow problem
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Haim Kaplan and Uri Zwick
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Artificial Intelligence
Markov Decision Processes
Linear Programming.
Markov Decision Processes
Uri Zwick Tel Aviv University
Analysis of Algorithms
Secular session of 2nd FILOFOCS April 10, 2013
Uri Zwick – Tel Aviv Univ.
Alternating tree Automata and Parity games
Thomas Dueholm Hansen – Aarhus Univ. Uri Zwick – Tel Aviv Univ.
Oliver Friedmann – Univ. of Munich Thomas Dueholm Hansen – Aarhus Univ
Discounted Deterministic Markov Decision Processes
Uri Zwick – Tel Aviv Univ.
Linear Programming and Approximation
Uri Zwick Tel Aviv University
CS 188: Artificial Intelligence Fall 2007
Artificial Intelligence
Raphael Yuster Haifa University Uri Zwick Tel Aviv University
Chapter 11 Limitations of Algorithm Power
Vincent Conitzer Repeated games Vincent Conitzer
Lecture 20 Linear Program Duality
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Linear Constrained Optimization
Presentation transcript:

Stochastic and non-Stochastic Games – a survey Uri Zwick Tel Aviv University Based on joint work with Oliver Friedmann, Thomas Dueholm Hansen, Marcin Jurdzinski, Peter Bro Miltersen, Mike Paterson

More precisely… 2-player, zero-sum, finite state, full-information, turn-based, infinite-horizon, total payoff, stochastic games For short, 2TBSG

Overview of talk What are stochastic games? Why are they interesting? How can we solve stochastic games? Relation to other problems Many intriguing open problems

Turn-based Stochastic Games [Shapley ’53] [Gillette ’57] … [Condon ’92] Total reward version Discounted version Limiting average version Both players have optimal positional strategies Can optimal strategies be found in polynomial time?

A positional strategy is a deterministic memoryless strategy Strategies A deterministic strategy specifies which action to take given every possible history A mixed strategy is a probability distribution over deterministic strategies A memoryless strategy is a strategy that depends only on the current state A positional strategy is a deterministic memoryless strategy

Both players have positional optimal strategies Values positional general positional general When the strategies are positional we get a Markov chain and y_i(sigma,tau) is easy to compute. Both players have positional optimal strategies There are positional strategies that are optimal for every starting position

Markov Decision Processes [Bellman ’57] [Howard ’60] … Total reward version Discounted version Limiting average version Optimal positional strategies can be found using LP Is there a strongly polynomial time algorithm?

Stochastic shortest paths (SSPs) Minimize the expected cost of getting to the target

To show that value  v , guess an optimal strategy  for MAX SG  NP  co-NP Deciding whether the value of a state is at least (at most) v is in NP  co-NP To show that value  v , guess an optimal strategy  for MAX Find an optimal counter-strategy  for min by solving the resulting MDP. Is the problem in P ?

Turn-based non-Stochastic Games [Ehrenfeucht-Mycielski (1979)]… Total reward version Discounted version Limiting average version Both players have optimal positional strategies Still no polynomial time algorithms known!

Why are stochastic games interesting? Intrinsically interesting… Model long term planning in adversarial and stochastic environments. Applications in AI and OR Applications in automatic verification and automata theory Competitive version of well-known optimization problems Perhaps the simplest combionatorial problem in NP  co-NP not known to be in P Many intriguing open problems

Algorithms for stochastic games Value Iteration: Solve a finite-horizon game using dynamic programming. Let the horizon tend to infinity. Polynomial, but not strongly polynomial, in game description and in 1/(1).

Algorithms for stochastic games Policy/Strategy Iteration: Construct a sequence of strategies for one of the players, each improving on the previous one. Strategies improved by performing local improving steps. Guaranteed to terminate in a finite number of steps with an optimal strategy. How exactly do we improve a strategy? Is number of steps polynomial? [Howard (1960)]

Evaluating an MDP strategy/policy MDP + policy  Markov Chain Values of a fixed policy can be found by solving a system of linear equations

Improving switches

Best improving switches

Dual LP formulation for stochastic shortest paths [d’Epenoux (1964)]

Primal LP formulation for stochastic shortest paths [d’Epenoux (1964)]

Multiple improving switches

Policy iteration for MDPs [Howard ’60]

Policy iteration variants

Policy iteration for 2-player SGs

Policy iteration for 2-player SGs Correctness Let (1,2) be the policies of the two players at the start of some iteration Let (’1,’2) be the policies of the two players at the end of some iteration Need to show that (1,2) <1 (’1,’2) Since 2 is an optimal counter-strategy against 1, all switches of 2 are improving for player 1 Only improving switches are performed on 1

The RANDOM FACET algorithm [Kalai (1992)] [Matousek-Sharir-Welzl (1992)] [Ludwig (1995)] A randomized strategy improvement algorithm Initially devised for LP and LP-type problems Applies to all turn-based games Sub-exponential complexity Fastest known for non-discounted 2-player games Performs only one improving switch at a time

The RANDOM FACET algorithm

Binary game – two actions per state RANDOM FACET for binary games [Kalai (1992)] [Matousek-Sharir-Welzl (1992)] Binary game – two actions per state Policy – n-bit vector In the binary case, removing an action is equivalent to fixing the other action from the corresponding state Sort the states so that the best policy with the i-th bit fixed is better than the best policy with the (i1)-th bit fixed

Would never be switched ! RANDOM FACET for binary games [Kalai (1992)] [Matousek-Sharir-Welzl (1992)] All correct ! Would never be switched !

Complexity of Policy Iteration? Essentially no answer for almost 50-years! Algorithm works very well in practice Is there a variant of PI that solves 2-player games in polynomial time? Some answers obtained in recent years

Upper bounds for Policy Iteration Dantzig’s pivoting rule, and the Howard’s policy iteration algorithm, Switch-All, require only O*(mn/(1)) iterations for discounted MDPs [Ye ’10] Howard’s policy iteration algorithm, Switch-All requires only O*(m/(1)) iterations for discounted turn-based 2-player Stochastic Games [Hansen-Miltersen-Z ’11]

Lower bounds for Policy Iteration Switch-All for Parity Games is exponential [Friedmann ’09] Switch-All for MDPs is exponential [Fearnley ’10] Random-Facet for Parity Games is sub-exponential [Friedmann-Hansen-Z ’11] Random-Facet and Random-Edge for MDPs and hence for LPs are sub-exponential [FHZ ’11]

Parity Games (PGs) A simple example Priorities 2 3 2 1 4 1 EVEN wins if largest priority seen infinitely often is even

Non-emptyness of -tree automata modal -calculus model checking Parity Games (PGs) ODD 8 EVEN 3 EVEN wins if largest priority seen infinitely often is even Equivalent to many interesting problems in automata and verification: Non-emptyness of -tree automata modal -calculus model checking

Mean Payoff Games (MPGs) Parity Games (PGs) Mean Payoff Games (MPGs) [Stirling (1993)] [Puri (1995)] ODD 8 EVEN 3 Replace priority k by payoff (n)k Move payoffs to outgoing edges

Deterministic Algorithms Non- discounted Discounted Algorithm SWITCH BEST SWITCH ALL [Condon ’93] [Ye ’10] [Friedmann ’09] [Hansen-Miltersen-Z ’11] [Fearnley ’10]

Randomized Algorithms Upper bound Lower bound Algorithm RANDOM EDGE RANDOM FACET [Kalai ’92] [Matousek-Sharir-Welzl ’92] [Friedmann-Hansen-Z ’11]

2TBSG  CLS  PLS  PPAD, but unlikely to be CLS-complete Related problems 2TBSGs can be reduced to Generalized P-matrix Linear Complementarity Problems 2TBSG  CLS  PLS  PPAD, but unlikely to be CLS-complete 2TBSG  LP-type Another LP-type problem, not known to be in P, is Smallest Enclosing Ball Nice abstract formulations that capture many of these problems are USOs and AUSOs

Summary The hope that simple deterministic or randomized versions of the policy iteration algorithm would solve 2-player stochastic games in polynomial time failed to materialize. New techniques are probably needed. Interior-point methods? Hardness results?

Many intriguing and important open problems Polynomial Policy Iteration Algorithms ??? Polynomial algorithms for SGs ? nSGs ? PGs ? Strongly Polynomial algorithms for LPs ? MDPs ?

THE END

3-bit counter (−N)15

3-bit counter 1

Observations Five “decisions” in each level Process always reaches the sink Values are expected lengths of paths to sink

3-bit counter – Improving switches Random-Edge can choose either one of these improving switches… 1

Cycle gadgets Cycles close one edge at a time Shorter cycles close faster

Cycles open “simultaneously” Cycle gadgets Cycles open “simultaneously”

3-bit counter 23 1 1

From b to b+1 in seven phases Bk-cycle closes Ck-cycle closes U-lane realigns Ai-cycles and Bi-cycles for i<k open Ak-cycle closes W-lane realigns Ci-cycles of 0-bits open

3-bit counter 34 1 1

Size of cycles Various cycles and lanes compete with each other Some are trying to open while some are trying to close We need to make sure that our candidates win! Length of all A-cycles = 8n Length of all C-cycles = 22n Length of Bi-cycles = 25i2n O(n4) vertices for an n-bit counter Can be improved using a more complicated construction and an improved analysis (work in progress)

Lower bound for Random-Facet Implement a randomized counter

Optimal Values in 2½-player games positional general positional general Both players have positional optimal strategies There are strategies that are optimal for every starting position

To show that value  v , guess an optimal strategy  for MAX 2½G  NP  co-NP Deciding whether the value of a state is at least (at most) v is in NP  co-NP To show that value  v , guess an optimal strategy  for MAX Find an optimal counter-strategy  for min by solving the resulting MDP. Is the problem in P ?

Discounted 2½-player games Strategy improvement correct version

Upper bounds for policy iteration algorithm Day 1 Monday, October 31 Uri Zwick (武熠) (Tel Aviv University) Perfect Information Stochastic Games Lecture 1½ Upper bounds for policy iteration algorithm

Complexity of Strategy Improvement Greedy strategy improvement for non-discounted 2-player and 1½-player games is exponential ! [Friedmann ’09] [Fearnley ’10] A randomized strategy improvement algorithm for 2½-player games runs in sub-exponential time [Kalai (1992)] [Matousek-Sharir-Welzl (1992)] Greedy strategy improvement for 2½-player games with a fixed discount factor is strongly polynomial [Ye ’10] [Hansen-Miltersen-Z ’11]

The RANDOM FACET algorithm Analysis

The RANDOM FACET algorithm Analysis

The RANDOM FACET algorithm Analysis

The RANDOM FACET algorithm Analysis

Lower bounds for policy iteration algorithm Day 1 Monday, October 31 Uri Zwick (武熠) (Tel Aviv University) Perfect Information Stochastic Games Lecture 3 Lower bounds for policy iteration algorithm

单纯形算法中随机 主元旋转规则的次指数级下界 Subexponential lower bounds for randomized pivoting rules for the simplex algorithm 单纯形算法中随机 主元旋转规则的次指数级下界 Oliver Friedmann – Univ. of Munich Thomas Dueholm Hansen – Aarhus Univ. Uri Zwick (武熠) – Tel Aviv Univ. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

Linear Programming Maximize a linear objective function subject to a set of linear equalities and inequalities

Linear Programming Simplex algorithm (Dantzig 1947) Ellipsoid algorithm (Khachiyan 1979) Interior-point algorithm (Karmakar 1984)

The Simplex Algorithm Dantzig (1947) Move up, along an edge to a neighboring vertex, until reaching the top

Pivoting Rules – Where should we go? Largest improvement? Largest slope? …

Deterministic pivoting rules Largest improvement Largest slope Dantzig’s rule – Largest modified cost Bland’s rule – avoids cycling Lexicographic rule – also avoids cycling All known to require an exponential number of steps, in the worst-case Klee-Minty (1972) Jeroslow (1973), Avis-Chvátal (1978), Goldfarb-Sit (1979), … , Amenta-Ziegler (1996)

Taken from a paper by Gärtner-Henk-Ziegler Klee-Minty cubes (1972) Taken from a paper by Gärtner-Henk-Ziegler

Is there a polynomial pivoting rule? Is the diameter polynomial? Hirsch conjecture (1957): The diameter of a d-dimensional, n-faceted polytope is at most n−d Refuted recently by Santos (2010)! Diameter is still believed to be polynomial (See, e.g., the polymath3 project) Quasi-polynomial (nlog d+1) upper bound Kalai-Kleitman (1992)

Randomized pivoting rules Random-Edge Choose a random improving edge Random-Facet If there is only one improving edge, take it. Otherwise, choose a random facet containing the current vertex and recursively find the optimum within that facet. [Kalai (1992)] [Matoušek-Sharir-Welzl (1996)] Random-Facet is sub-exponential! Are Random-Edge and Random-Facet polynomial ???

Abstract objective functions (AOFs) Acyclic Unique Sink Orientations (AUSOs) Every face should have a unique sink

AUSOs of n-cubes 2n facets 2n vertices USOs and AUSOs Stickney, Watson (1978) Morris (2001) Szabó, Welzl (2001) Gärtner (2002) The diameter is exactly n Bypassing the diameter issue!

AUSO results Random-Facet is sub-exponential [Kalai (1992)] [Matoušek-Sharir-Welzl (1996)] Sub-exponential lower bound for Random-Facet [Matoušek (1994)] Sub-exponential lower bound for Random-Edge [Matoušek-Szabó (2006)] Lower bounds do not correspond to actual linear programs Can geometry help?

Consider LPs that correspond to Markov Decision Processes (MDPs) LP results Explicit LPs on which Random-Facet and Random-Edge make an expected sub-exponential number of iterations Technique Consider LPs that correspond to Markov Decision Processes (MDPs) Observe that the simplex algorithm on these LPs corresponds to the Policy Iteration algorithm for MDPs Obtain sub-exponential lower bounds for the Random-Facet and Random-Edge variants of the Policy Iteration algorithm for MDPs, relying on similar lower bounds for Parity Games (PGs)

Dual LP formulation for MDPs

Primal LP formulation for MDPs

Vertices and policies