# Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?

## Presentation on theme: "Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?"— Presentation transcript:

Learning in Games Georgios Piliouras

Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?

Games n players Set of strategies S i for each player i Possible states (strategy profiles) S=×S i Utility u i :S→ R Social Welfare Q:S→ R Extend to allow probabilities Δ(S i ), Δ(S) u i (Δ(S))=E(u i (S)) Q(Δ(S))=E(Q(S)) (review)

Zero-Sum Games & Equilibria 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Rock Paper Scissors Nash: A product of mixed strategies s.t. no player has a profitable deviating strategy. 1/3 (review)

Why do we study Nash eq?  Nash eq. have a simple intuitive definition.  Nash eq. are applicable to all games.  In some classes of games, Nash eq. is reasonably good predictor of rational self-interested behavior (e.g. zero-sum games).  Even in general games, Nash eq. analysis seems like a natural, albeit optimistic, first step in understanding rational behavior.

Why is it optimistic? Nash eq. analysis presumes that agents can resolve issues regarding:  Convergence: Agent behavior will converge to a Nash.  Coordination: If there are many Nash eq, agents can coordinate on one of them.  Communication: Agents are fully aware of each other utilities/rationality.  Complexity: Computing a Nash can be hard even from a centralized perspective.

Today: Learning in Games Agent behavior is online learning algorithm/dynamic Input: Current state of environment/other agents (+ history) Output: Chosen (randomized) action Analyze the evolution of systems of coupled dynamics, as a way to predict interacting agent behavior. Advantages: Weaker assumptions. If dynamic converges  Nash equilibrium (may not converge) Disadvantages: Harder to analyze

Today: Learning in Games Agent behavior is online learning algorithm/dynamic Input: Current state of environment/other agents (+ history) Output: Chosen (randomized) action Class 1: Best (Better) Response Dynamics Class 2: No-regret dynamics (e.g. Weighted Majority/Hedge dynamic)

Start from arbitrary state ( S i ) Choose arbitrary agent i Agent i deviates to a best (better) response given the strategies of other. Advantages: Simple, widely applicable Disadvantages: No intelligence/learning Does this work? Best Response Dynamics (BR)

Congestion Games n players and m resources (“edges”) Each strategy corresponds to a set of resources (“paths”) Each edge has a cost function c e (x) that determines the cost as a function on the # of players using it. Cost experienced by a player = sum of edge costs xxxx 2x xx Cost(red)=6 Cost(green)=8

Potential Games A potential game is a game that exhibits a function Φ : S→ R s.t. for every s ∈ S and every agent i, u i (s i,s -i ) - u i (s) = Φ (s i,s -i ) - Φ (s) Every congestion game is a potential game: This implies that any such game has pure NE and that best response converges. Speed?

BR Cycles in Zero-Sum Games 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0

Regret(T) in a history of T periods: total profit of algorithm total profit of best fixed action in hindsight - An algorithm is characterized as “no regret” if for every input sequence the regret grows sublinearly in T. [Blackwell 56], [Hannan 57], [Fundberg, Levine 94],… 10 01 No single action significantly outperforms the dynamic. No Regret Learning

No single action significantly outperforms the dynamic. 10 01 Weather Profit Algorithm 3 Umbrella 3 Sunscreen 1 No Regret Learning

The Multiplicative Weights Algorithm a.k.a. Hedge a.k.a. Weighted Majority [Littlestone Warmuth ’94, Freund Schapire ‘99] Pick s with probability proportional to (1-ε) total(s), where total(s) denotes cumulative cost in all past periods. Why is it regret minimizing? – Proof on the board.

BREAK

No Regret and Equilibria Do no-regret algorithms converge to Nash equilibria in general games? Do no-regret algorithms converge to other equilibria in general games?

0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Nash: A probability distribution over outcomes, that is a product of mixed strategies s.t. no player has a profitable deviating strategy. Choose any of the green outcomes uniformly (prob. 1/9) Rock PaperScissors Rock Paper Scissors 1/3 Other Equilibrium Notions (review)

0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Nash: A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors 1/3 Coarse Correlated Equilibria (CCE): Other Equilibrium Notions (review)

A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors Coarse Correlated Equilibria (CCE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Other Equilibrium Notions (review)

A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors Coarse Correlated Equilibria (CCE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Choose any of the green outcomes uniformly (prob. 1/6) Other Equilibrium Notions (review)

A probability distribution over outcomes, s.t. no player has a profitable deviating strategy even if he can condition the advice from the dist.. Rock PaperScissors Rock Paper Scissors Correlated Equilibria (CE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Choose any of the green outcomes uniformly (prob. 1/6) Other Equilibrium Notions Is this a CE? NO (review)

Other Equilibrium Notions Pure NE CE CCE (review)

No-regret & CCE A history of no-regret algorithms is a sequence of outcomes s.t. no agent has a single deviating action that can increase her average payoff. A Coarse Correlated Equilibrium is a probability distribution over outcomes s.t. no agent has a single deviating action that can increase her expected payoff.

No Regret and Equilibria Do no-regret algorithms converge to Nash equilibria in general games? Do no-regret algorithms converge to other equilibria in general games? Do no-regret algorithms converge to Nash equilibria in interesting games?

CCE in Zero-Sum Games In general games, CCE ⊇ conv(NE) Why? In zero-sum games, the marginals and utilities of CCE and NE agree Why? What does it imply for no-regret algs?

BREAK 2 Can learning beat NASH equilibria by an arbitrary factor?

CCE in Congestion Games Load balancing: n balls, n bins Makespan: Expected maximum latency over all links c(x)=x … …

CCE in Congestion Games Pure Nash Makespan: 1 c(x)=x … … 111

CCE in Congestion Games Mixed Nash Makespan: Θ(logn/loglogn) c(x)=x … … 1/n [Koutsoupias, Mavronicolas, Spirakis ’02], [Czumaj, Vöcking ’02]

CCE in Congestion Games Coarse Correlated Equilibria Makespan: Exponentially worse Ω ( √ n) c(x)=x … … [Blum, Hajiaghayi, Ligett, Roth ’08]

No-Regret Algs in Congestion Games Since worst case CCE can be reproduced by worst case no-regret algs, worst case no-regret algorithms do not converge to Nash equilibria in general.

(t) is the current state of the system (this is a tuple of randomized strategies, one for each player). Each player tosses their coins and a specific outcome is realized. Depending on the outcome of these random events, we transition to the next state. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

Problem 1: Hard to get intuition about the problem, let alone analyze. Let’s try to come up with a “discounted” version of the problem. Ideas?? (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

Idea 1: Analyze expected motion. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

The system evolution is now deterministic. (i.e. there exists a function f, s.t. I wish to analyze this function (e.g. find fixed points). (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) E[ (t+1)] O(ε) E[ (t+1)]= f ( (t), ε ) Idea 1: Analyze expected motion.

Idea 2: I wish to analyze the MWA dynamics for small ε. Use Taylor expansion to find a first order approximation to f. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) E[ (t+1)] O(ε) f ( (t), ε) = f ( (t), 0) + ε ×f ´( (t), 0) + O(ε 2 ) Problem 2: The function f is still rather complicated.

As ε→0, the equation specifies a vector on each point of our state space (i.e. a vector field). This vector field defines a system of ODEs which we are going to analyze. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) f ( (t), ε)-f ( (t), 0) = f´( (t), 0) ε f´( (t), 0)

Deriving the ODE Taking expectations: Differentiate w.r.t. ε, take expected value: This is the replicator dynamic studied in evolutionary game theory.

Motivating Example c(x)=x

Motivating Example Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R 2. Pure Nash Mixed Nash

Motivating Example Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R 2.

Motivating Example Even in the simplest case of two balls, two bins with linear utility the replicator equation has a nonlinear form.

The potential function The congestion game has a potential function Let Ψ=E[Φ]. A calculation yields Hence Ψ decreases except when every player randomizes over paths of equal expected cost (i.e. is a Lyapunov function of the dynamics). [Monderer-Shapley ’96]. Analyzing the spectrum of the Jacobian shows that in “generic” congestion games only pure Nash are stable. [Kleinberg-Piliouras-Tardos ‘09]

Cyclic Matching Pennies (Jordan’s game) [Jordan ’93] H, T H, T H, T Nash Equilibrium ½, ½ ½, ½ ½, ½ Social Welfare of NE: 3/2 Profit of 1 if you mismatch opponent; 0 otherwise

Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Best Response Cycle Social Welfare of NE: 3/2 (H,H,T) [Jordan ’93] Profit of 1 if you mismatch opponent; 0 otherwise

Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T) [Jordan ’93] Best Response Cycle Profit of 1 if you mismatch opponent; 0 otherwise

Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2 Best Response Cycle [Jordan ’93] Profit of 1 if you mismatch opponent; 0 otherwise

Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2 player i-1 player i HT H01 T10 payoff, i ∈ {0,1,2} Best Response Cycle [Jordan ’93]

Asymmetric Cyclic Matching Pennies H, T H, T H, T Social Welfare of NE: 3M/(M+1) < 3 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2} 1/(M+1), M/(M+1) 1/(M+1), M/(M+1) 1/(M+1), M/(M+1) Best Response Cycle [Jordan ’93]

Replicator Dynamics Pr(player i plays H) mixed strategy profile

Replicator Dynamics utility of player i rate of growth of action H Survival of the fittest: The probability of an action increases iff it outperforms the current (mixed) strategy. Simple, well-studied dynamic with desirable properties: limit of weighted majority algorithm with step size  0 (e.g. no-regret) mixed strategy profile Pr(player i plays H)

Optimal Social Welfare SW = M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2}

Unique Nash: all play 1/(M+1); SW = 3 M/(M+1) << M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2}

Experiment 1: Starting on the diagonal we converge to Nash

Experiment 2: Starting off the diagonal we converge to 6-cycle Proofs?

As M increases, the region where the SW is not a Lyapunov, increases as well.

Example: M=3

Regions where SW is less than the SW(Nash) Proof Structure: SW is still a Lyapunov in these regions Example: M=3

Proof Structure: SW is still a Lyapunov in these regions Regions where SW is less than the SW(Nash)

Starting off the diagonal, we always escape these regions and we get trapped in the area with SW>SW(Nash)

New Lyapunov function implies convergence to theboundary Proof Idea: If SW > SW(Nash), then is a Lyapunov function

Step1: Partition the space into regions where the growth rates of all strategies have fixed signs. Step 2: Establish that trajectories cycle infinitely around restricted neighborhood of the boundary

Step 3: Specialized Lyapunov fns & “Stitching” arguments

The Game* is ON *by game I mean projects

Download ppt "Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?"

Similar presentations