Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.

Similar presentations


Presentation on theme: "Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion."— Presentation transcript:

1 Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion Yishay Mansour, TAU and MSR Ohad Shamir, Weizmann

2 Nonstochastic sequential decision-making K actions and T time steps l t (a) – loss of action a at time t At time t – player picks action X t – incurs loss l t (X t ) – observe feedback on losses Multi-arm bandit: only l t (X t ) Experts (full information): l t (j) for any j 2

3 Nonstochastic sequential decision-making Goal: – minimize losses – benchmark: The best single action The action j that minimizes the loss – no stochastic assumptions on losses Regret Known regret bounds: – MAB – Experts 3

4 Motivation – observablity undirecteddirected 4

5 undirected observation graph ? ? ? ? ? ? ? ? 5

6 ? 3 ? ? ? ? ? ? 6

7 5 3 ? 1 ? 7 ? ? 7

8 MAB: no edges Experts: clique ? 3 ? ? ? ? ? ? 5 3 6 1 4 7 8 2 8

9 Modeling Directed vs Undirected Different types of dependencies Different measures – Independent set – Dominating set – Max Acyclic Subgraph Informed vs Uniformed When does the learner observes the graph – Before – After only the neighbors 9

10 Our Results Uniformed setting Undirected graph Uniformed setting – Only the neighbors of the node – Independent sets Directed graph – Max Acyclic Subgraph (not tight) – Random Erdos-Renyi graphs Informed setting Directed graphs Regret characterization – dominating sets and ind. set Both expectation and high prob. 10

11 EXP3-SET Online Algorithm where Theorem 11

12 EXP3-Set Regret – key lemma Lemma Note: MAB: Q=K Full info. Q=1 Proof: Build an i.s. S – consider action a with minimal Pr[a observed] – Add a to S – Delete a and its neighbors Note 12

13 EXP3-SET directed case directed graph – Lemma does not hold Example: – Tournament graph j  i iff j<i – probabilities p i =2 -i – α(G)=1 Random graph – Erdos-Renyi edge parameter r – Regret – MAB r=0; Experts r=1 – Note 13

14 EXP3-SET directed case Upper bound – directed mas(G)=maximum acyclic subgraph of G Tournament – mas(G)=K and α(G)=1 Regret Lower bound - directed Any fixed graph G Regret the graph in advance 14

15 Dominating set – directed graph ? ? ? ? ? ? ? ? 15

16 Dominating set – directed graph ? ? ? ? ? ? ? ? 16

17 EXP3-DOM Simplified version – fixed graph G – D is dominating set log approx Main modification – add probabilities to D induce observability probabilities: Select X t using p t Observe l t (a) for a in S Xt,t weights 17

18 EXP3-DOM Simple example Transitive observability – tournament action 1 observes all actions – D={1} EXP3-DOM Sample action 1 with prob γ – action 1 is the exploration Otherwise run a MAB – specifically EXP3-SET Intuition – action 1 replaces mixture with uniform 18

19 Conclusion Observability model – Between MAB and Experts more work to be done Uninformed setting – Undirected graph Informed setting – Directed graph [Kocak, Neu, Valko and R. Muno] improved uniformed 19

20

21 EXP3-DOM – main Theorem Theorem: tuning γ Corollary 21

22 EXP3-DOM – main Theorem Theorem: tuning γ Corollary 22

23 Outline Model and motivation symmetric observability non-symmetric observability 23

24 EXP3-DOM: key lemma Lemma – G directed graph, – d - i indegree of i, – α=α(G) Turan’s Theorem – undirected graph G(V,E) Proof: high level – shrink graph G K,G k-1, … – delete nodes step s: – delete max indegree node From Turan’s theorem 24

25 EXP3-DOM: key lemma (proof) Completing the proof Note, due to edge elimination 25

26 EXP3-DOM- Key lemma (modified) Lemma (what we really need!) G(V,E) directed graph – IN i indegree of i – r size dominating set; and α size ind. set – p distribution over V p i ≥β 26

27 EXP3 –DOM: changing graphs Simple – all dom. set same size – approx. same size Problem – different size dom. set can be 1 or K Solution – keep log levels depend on  log 2 (D t )  – algorithm per level Complications – parameters depend on level – setting the learning rate need a delicate doubling Main tech. challenge – handle dynamic adversary. 27

28 EXP3-DOM receive obs. graph – find dominating set D t logarithmic approximation Run the right copy – Let b t =  log 2 (D t )  – run copy b t log copies For Copy b t – param. depend on b t probabilities: Select X t using p Observe l t (a) for a in S Xt,t weights 28

29 EXP3-DOM – main Theorem Theorem: tuning γ b 29

30 Independent set Independent set α(G) [Mannor & Shamir 2012] Tight Regret – α(G) “replaces” K Cons: – requires to observe G – solves an LP each step ? ? ? ? ? ? ? ? 30


Download ppt "Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion."

Similar presentations


Ads by Google