# Online learning and game theory Adam Kalai (joint with Sham Kakade)

## Presentation on theme: "Online learning and game theory Adam Kalai (joint with Sham Kakade)"— Presentation transcript:

Online learning and game theory Adam Kalai (joint with Sham Kakade)

How do we learn? Goal learn a function f: X ! Y Batch (offline) model  Get training data (x 1,y 1 ),…,(x n,y n ) drawn independently from some distribution  over X £ Y  We output f: X ! Y with low error P  [f(x)  y] Online (repeated game) model, for i=1,2,…,n:  Observe ith example x i 2 X  We predict its label  Observe true label y i 2 {–,+}  Goal: make as few mistakes as possible Distribution-free learning Y = {–,+} =?

Outline 1. Online/batch learnability of F  Online learnability ) batch learnability  Finite learning: batch and online (via weighted maj.)  Batch learnability, online learnability? 2. Online learning in repeated games  Zero-sum: Weighted majority  General-sum: No “internal regret” ) corr. eq.

Online learning Adversary picks (x 1,y 1 ) 2 X £ Y We see x 1 We predict z 1 We see y 1 … Adversary picks (x n,y n ) 2 X £ Y We see x n We predict z n We see y n – + – ? “empirical error” err(A,data) = | {i j z i  y i } | / n X = R 2 Y = {–,+} Online alg. A(x 1,y 1,…,x i-1,y i-1,x i )=z i

Batch Learning X £ Y  X = R 2 Y = {–,+} data: (x 1,y 1 ),…,(x n,y n ) laerning algorithm A f: X ! Y – + – – – – – – – – – – –– – – – – – – – – – – – – + + + + + + + + + + + + + + + + + + + + + + + + + – + + + + + + + + + + – + – + – – – – +

– – – – – – – – – – –– – – – – – – – – – – – – + + + + + + + + + + + + + + + + + + + + + + + + + – + + + + + + + + + + – + – + – – – – + Batch Learning X £ Y  X = R 2 Y = {–,+} data: (x 1,y 1 ),…,(x n,y n ) – – – – – – – – – – –– – – – – – – – – – – – – + + + + + + + + + + + + + + + + + + + + + + + + + – + + + + + + + + + + – + –.+ – – – – + laerning algorithm A f: X ! Y – + “generalization error”   err(f,  ) = Pr  [f(x)  y] “empirical error” err(f,data) = | {i j f(x i )  y i } | / n – +

Online/batch learnability of F Family F of functions f: X ! Y ( Y = {–,+}) Alg. A learns F online if 9 k,c>0  Online input: data (x 1,y 1 ),…,(x n,y n )  Regret(A,data) = err(A,data)–min g 2F err(g,data) Alg. B batch learns F if 9 k,c>0   Input: (x 1,y  ),…,(x n,y n ) independent from    Output: f 2 F, Regret(f,  ) = err(f,  )–min g 2F err(g,  ) X £ Y  8 data E[Regret(A,data)] · k / n c  8  E data [Regret(B,  )] · k / n c

Online learnable ) Batch learnable · Given online learning algorithm A Define batch learning algorithm B   Input: (x 1,y 1 ),(x 2,y 2 ),…,(x n,y n ) from   Let f i (x): X ! Y be f i (x)=A(x 1,y 1,…,x i-1,y i-1,x)  Pick i 2 {1,2,…,n} at random and output f i Analysis E[Regret(A,data)] = E[err(A,data)] – E[min g 2F err(g,data)]  E[Regret(B,  )] = E[err(B,  )] – min g 2F err(g,  )

Given online learning algorithm A Define batch learning algorithm B   Input: (x 1,y 1 ),(x 2,y 2 ),…,(x n,y n ) from   Let f i (x): X ! Y be f i (x)=A(x 1,y 1,…,x i-1,y i-1,x)  Pick i 2 {1,2,…,n} at random and output f i Analysis E[Regret(A,data)] = E[err(A,data)] – E[min g 2F err(g,data)]  E[Regret(B,  )] = E[err(B,  )] – min g 2F err(g,  ) Online learnable ) Batch learnable · ·  ·

Outline 1. Online/batch learnability of F Online learnability ) batch learnability  Finite learning: batch and online (via weighted maj.)  Batch learnability ; online learnability  Batch learnability ) online learnability 2. Online learning in repeated games  Zero-sum: Weighted majority ) eq.  General-sum: No “internal regret” ) corr. eq. transductive

Online majority algorithm Say there is some perfect f* 2 F, err(f*,data)=0 Say | F |=F Predict according to majority of consistent f’s Each mistake Maj makes eliminates ¸ ½ of f’s Maj’s #mistakes · log 2 (F) err(Maj,data) · log 2 (F)/n Perfect f 2 F f 1 f 2 f 3 … f F (live) majority truth y x 1 x 2 x 3 … x n + – + … + + + – – + … + + – + + + … – + –

Naive batch learning Say there is some perfect f* 2 F, err(f*,data)=0 Say | F |=F Select a consistent f Say 8 g  f* err(g,data)=log(F)/n P[err(g,data)=0]= f 1 f 2 f 3 … f F truth y x 1 x 2 x 3 … x n +–+…+++–+…++ ––+…+–––+…+– +++…––+++…–– –+–…–––+–…–– Wow! Online looks like batch. Perfect f 2 F

Naive batch learning Naive batch algorithm  Choose f 2 F that minimizes err(f,data)  For any f 2 F, P[|err(f,data)-err(f,  )|> ] · 2e -2c 2  P[ 9 f 2F |err(f,data)-err(f,  )|> ] · 2Fe -200ln F · 2 -100  E[Regret(n.b.,  )] · c (F = | F |)

Weighted majority’ [LW89] Assign weight to each f, 8 f 2F w(f)=1 On period i=1,2,…,n:  Predict weighted maj of f’s  For each f: if f(x i )  y i, w(f):=w(f)/2 (F = | F |) WM’ errs ) total weight decreases by ¸ 25% Final total weight · F (3/4) #mistakes(WM’) Final total weight ¸ 2 -min f #mistakes(f) #mistakes(WM’) · 2.41(min f #mistakes(f)+log(F)/n)

Weighted majority [LW89] Assign weight to each f, 8 f 2F w(f)=1 On period i=1,2,…,n:  Predict weighted maj of f’s  For each f: if f(x i )  y i, w(f):=w(f)(1– ) Thm: E[Regret(WM,data)] · 2 (F = | F |) Wow! Online looks like batch.

Weighted majority extensions… Tracking  On any window W, E[Regret(WM,W)] · c f 1 f 2 f 3 … f F WM truth y x 1 x 2 x 3 … x n +–+…++++–+…+++ ––+…++–––+…++– +++…–+–+++…–+– W

Weighted majority extensions… Multi-armed bandit  You don’t see x i  You pick f  Find out if you erred  E[Regret] · c f 1 f 2 f 3 … f F WM truth y x 1 x 2 x 3 … x n + … + – –…+––…+– +…+–+…+–

Outline 1. Online/batch learnability of F Online learnability ) batch learnability Finite learning: batch and online (via weighted maj.)  Batch learnability ; online learnability  Batch learnability ) (transductive) online learnability 2. Online learning in repeated games  Zero-sum: Weighted majority ) eq.  General-sum: No “internal regret” ) corr. eq.

Batch  Online Define f c : [0,1] ! {+,–}, f c (x) = sgn(x – c) Simple threshold functions F = {f c | c 2 [0,1]} Batch learnable: Yes Online learnable: ?  Adversary does a “random binary search”  Each label is equally likely to be +/–  E[Regret]=½ for any online algorithm x 1 =.5 + x2x2 – x3x3 + 01 x4x4 – No!  x5x5

Key idea: transductive online learning We see x 1,x 2,…,x n 2 X in advance y 1,y 2,…,y n 2 {+,–} are revealed online [KakadeK05] x 1 =.5 + x2x2 – x3x3 + 01 x4x4 –

Adversary picks (x 1,y 1 ),…,(x n,y n ) 2X£Y Adversary reveals x 1,x 2,…,x n We predict z 1 We see y 1 We predict z 2 We see y 2 … We predict z n We see y n – + – ? “empirical error” err(A,data) = | {i j z i  y i } | / n X = R 2 Y = {–,+} Trans. online alg. T(x 1,y 1,…,x i-1,y i-1,x i,x i+1,…,x n )=z i Key idea: transductive online learning [KakadeK05]

Algorithm for trans. online learning We see x 1,x 2,…,x n 2 X in advance y 1,y 2,…,y n 2 {+,–} are revealed online Algorithm for trans. online learning  L distinct labelings f(x 1 ),f(x 2 ),…,f(x n ) over all f 2 F  Effective size of F is L  Run WM on L functions  E[Regret(WM,data)] · 2 [KK05] f1f2f3…f1f1f2f3…f1 x 1 x 2 x 3 … x n +++…++++…+ ––+…+––+…+ +++…–+++…– ++–…–++–…–

Candidate efficient algorithm f 1 f 2 f 3 … f 1 True y x 1 x 2 x 3 … x n + + + … + + ––+…+–––+…+– +++…–+++…– ++–…–++–…– + – – … + (random)

How many labelings? Shattering & VC Def: S µ X is shattered if there are 2 |S| ways to label S by f 2 F VC( F ) = max |S| S is shattered by F Example VC dimension captures complexity of F – + + – + – + + –

How many labelings? Shattering & VC Sauer’s lemma: # labelings L = O(n VC( F ) ) ) E[Regret(WM,data)] ·

Cannot batch learn faster than VC( F ) Shattered set S, |S| = VC( F ), n > 0 X £ Y  Batch training set of size n Each x 2 S is not in training set with probability (1-1/n) n ¼ e -1  ) E[Regret(B,  )] ¸ c VC( F ) /n

Putting it together Transductive online: E[Regret(WM,data)] =  Batch: E[Regret(B,  )] ¸ Trans. online learnable, batch learnable, finite VC( F ) Almost identical to standard VC bound

Learnability conclusions Finite VC( F ) characterizes batch and transductive learnability Open problem: what propertyof F characterizes online learnability (non- transductive) Efficiency!?  WM algorithm requires enumeration of F  Thm [KK05] : if one can efficiently find lowest error f 2 F, then one can design efficient online learning algorithm

Online learning in repeated games

Repeated games Example: Rounds i=1,2,…,n:  Players simultaneously choose actions  Players receive payoff, goal: max total payoff  Learning: players need not know opponent/game  Feedback: player only finds out payoff of his action and alternatives (not opponent action) 0,0-1,11,-1 -1,10,0-1,1 1,-1-1,10,0 RPS S P R Pl. 1 Pl. 2

(Mixed) Nash Equlibrium Each player chooses a dist. over actions Players are optimizing relative to opponent(s) 0,0-1,11,-1 -1,10,0-1,1 1,-1-1,10,0 RPS S P R 1/3 

Online learning in 0-sum games (Schapire recap) Payoff is A(i,j) for pl. 1, -A(i,j) for pl. 2 Going first is disadvantage: max i min j A(i,j) · min j max i A(i,j) Mixed strategies: max min  A(,  ) · min  max A(,  ) Min-max theorem “=”

Each player uses weighted majority  Maintain weight on each action, initially equal  Choose an action proportional to weight  (assume payoffs are in [-1,1])  Find out payoffs of each action  For each action weight Ã weight*(1 +  payoff) Regret = possible improvement (in hindsight) from always playing a single action WM ) regret is low Online learning in 0-sum games (Schapire recap)

Actions are (a 1,b 1 ),(a 2,b 2 ),…,(a n,b n ) Regret of pl. 1 is Let be empirical distributions of actions a 1,…,a n and b 1,…,b n, respectively Online learning in 0-sum games (Schapire recap) =

WM ) “min-max” theorem max min  A(,  ) = min  max A(,  ) = “value of game” Using WM, each player guarantees regret ! 0, regardless of opponent  Can beat an idiot in tic-tac-toe  Reasonable strategy to use  Justifies how such equilibria might arise

Justifying equilibrium in games Online learning gives plausible explanation for how equilibrium might arise Nash equilibrium  Zero-sum games Unique value Fast and easy to “learn”  General-sum games Not unique Fast and easy to “learn”?? Polynomial time algorithm to find one??

General-sum games No unique “value” Many very different equilibria Can’t naively improve a “no regret” algorithm (by playing a single mixed strategy) Low regret for both players ; equilibrium 1,10,0 2,2 ABAB A B e.g.

General sum games Low regret ; Nash equilibrium, e.g., (1,1),(2,2),(1,1),(2,2),(1,1),(2,2),… 0,0-1,-11, 11,-1 -1,-10,01,-11,1 -1,11,1 -1,11,1 123 3 2 1 4 4

Refined notion of regret Can’t naively improve a “no regret” algorithm (by playing a single mixed strategy) Might be able to naively improve it by replacing: “When alg. suggests 1, play 3” 0,0-1,-11, 11,-1 -1,-10,01,-11,1 -1,11,1 -1,11,1 123 3 2 1 4 4

Play col. 2 Play row 1 Internal regret Internal regret IR(i,j) is how much we could have improved by replacing all occurrences of action i with action j No internal regret ) correlated equilibrium Calibration ) correlated equilibrium [FosterVohra] Correlated Equilibrium [Aumann]  Best strategy to listen to the fish 1,00,10,0 1,00,1 0,01,0 1/6 

 Low internal regret ! correlated eq. Sequence like (1,1),(2,1),(3,2),… Think about this as a distribution No internal regret, correlated eq. Play col. 2 Play row 1 1,00,10,0 1,00,1 0,01,0 1/6 P

Online learning in games conclusions Online learning in zero-sum games  Weighted majority (low regret)  Achieves value of game Online learning in general-sum games  Low internal regret  Achieves correlated equilibrium Open problems  Are there natural dynamics ) Nash Equilibrium  Is correlated equilibrium going take over?