Download presentation

Presentation is loading. Please wait.

Published byBertina McCarthy Modified over 7 years ago

1
Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2
SEWM2 ACTION Statistics, ML, DM, … DATA KNOWLEDGE UTILITY MORE DATA Reinforcement Learning

3
Outline Introduction Basic Solutions Advanced algorithms Advanced Offline Evaluation Conclusions 2012-05-25SEWM3

4
Yahoo-User Interaction 2012-05-25SEWM4 ads, news, ranking, … click, conversion, revenue, … gender, age, … ACTION REWARD CONTEXT serving strategy POLICY Goal Maximize total REWARD by optimizing POLICY

5
Today Module @ Yahoo! Front Page A small pool of articles chosen by editors “Featured Article” 2012-05-255SEWM

6
Objectives and Challenges 2012-05-256SEWM

7
Challenge: Explore/Exploit Observation: only displayed articles get user click feedback Article CTR estimates How to trade off? … with dynamic article pools … while considering user interests 2012-05-257SEWM

8
Insufficient Exploration Example always pays $5/round pays $100 a quarter of the time (so $25/round on average) 1 2 3 4 5 6 7 8 $5 $0 $5 2012-05-258SEWM It turns out… $100 $0 $5

9
Contextual Bandit Formulation Multi-armed contextual bandit [LZ’08] In Today Module: 2012-05-259SEWM Select Observe K arms A t and “context” Receive reward

10
2012-05-25SEWM10 Another Example – Display Ads

11
2012-05-25SEWM11 Yet Another Example - Ranking

12
Related Work Standard information retrieval and collaborative filtering Also concerns with (personalized) recommendation But with (almost) static users/items training often done in batch/offline mode no need for online exploration Full reinforcement learning General: including bandit problems as special cases Need to tackle “temporal credit assignment” 2012-05-2512SEWM

13
Outline Introduction Basic Solutions › Algorithms › Evaluation › Experiments Advanced algorithms Advanced Offline Evaluation Conclusions 2012-05-25SEWM13

14
2012-05-25SEWM14 Prior Bandit Algorithms Herbert RobbinsTze Leung Lai Regret minimization (focus of this talk) Bayesian optimal solution John Gittins

15
Traditional K-armed Bandits Assumption: CTR (click-through rate) not affected by user features The more “ a ” has been displayed, the less uncertainty in CTR a 2012-05-2515SEWM CTR estimates = #clicks / #impressions No contexts no personalization

16
EXP4 [ACFS’02], EXP4.P [BLLRS’11 ], elimination [ADKLS’12 ] Strong theoretical guarantees But computationally expensive Epoch-greedy [LZ’08] Similar to -greedy Simple, general and less expensive But not most effective This talk: algorithms with compact, parametric models Both efficient and effective Extension of UCB1 to linear models … and to generalized linear models Randomized algorithm with Thompson sampling Contextual Bandit Algorithms 2012-05-2516SEWM

17
Linear model assumption: Standard least-squares ridge regression Reward prediction for new user: Whether to explore requires quantifying parameter uncertainty LinUCB: UCB for Linear Models 2012-05-2517SEWM prediction error

18
LinUCB: UCB for Linear Models (II) LinUCB always selects an arm with highest UCB: to exploit to explore LinRel [Auer 2002] works similarly but in a more complicated way. 2012-05-2518SEWM Recall...

19
Outline Introduction Basic Solutions › Algorithms › Evaluation › Experiments Advanced algorithms Advanced Offline Evaluation Conclusions 2012-05-25SEWM19

20
Goal: estimate average reward of running with iid x Static Adaptive Golden standard Run in real system and see how well it works …but expensive and risky Evaluation of Bandit Algorithms 2012-05-2520SEWM

21
Benefits Cheap and risk-free! Avoid frequent bucket tests Replicable / fair comparisons Common in non-interactive learning problems (e.g., classification) Benchmark data organized as (input, label) pairs … but not straightforward for interactive learning problems Data in bandits usually consists of (context, arm, reward) triples No reward signal for other arm’ ≠ arm Offline Evaluation 2012-05-2521SEWM

22
Common/Prior Evaluation Approaches classification regression density estimation this (difficult) step is often biased In contrast, our approach avoids explicit user modeling simple gives unbiased evaluation results reliable unreliable evaluation bandit algorithm 2012-05-2522SEWM

23
Our Evaluation Method: “Replay” bandit algorithm 2012-05-2523SEWM Key requirement for data collection:

24
2012-05-25SEWM24 Theoretical Guarantees Thm 1: Our estimator is unbiased Mathematically, So on average reflects real, online performance Thm 2: Estimation error 0 with more data Mathematically, So accuracy guaranteed with large volume of data

25
Case Study in Today Module [LCLW’11] Data: › Large volume of real user traffic in Today Module Policies being evaluated: › EMP [ACE’ 09] › SEMP/CEMP: personalized EMP variants › Use policies’ online bucket CTR as “truth” Random bucket data for evaluation: › 40M visits, K ~= 20 on average › Use it to offline-evaluate policies’ CTR 2012-05-25SEWM25 Are they close?

26
Unbiasedness (Article nCTR) Estimated nCTR Recorded Online nCTR 2012-05-2526SEWM The offline estimate is indeed unbiased!

27
Unbiasedness (Daily nCTR) Recorded Online nCTR Estimated nCTR Ten Days in November 2009 2012-05-2527SEWM The offline estimate is indeed unbiased!

28
Estimation Error 2012-05-2528SEWM Number of Data ( L ) nCTR Estimation Error Recall our theoretical error bound:

29
Unbiased Offline Evaluation: Recap What we have shown › A principled method for benchmark data collection › which allows reliable/unbiased evaluation › of any bandit algorithms Analogue: UCI, Caltech101... datasets for supervised learning The first such benchmark was released by Yahoo! http://webscope.sandbox.yahoo.com/catalog.php?datatype=r 2 nd and 3 rd versions available for PASCAL2 Challenge › ICML 2012 workshop 2012-05-25SEWM29

30
Outline Introduction Basic Solutions › Algorithms › Evaluation › Experiments Advanced algorithms Advanced Offline Evaluation Conclusions 2012-05-25SEWM30

31
Experiment Setup: Architecture Model updated every 5 minutes Main metric: overall normalized CTR in deployment bucket nCTR = CTR * secretNumber (to protect sensitive business information) 2012-05-2531SEWM where E/E happens exploitation only “Learning Bucket” “Deployment Bucket” 5% 95%

32
Experiment Setup: Data May 1 2009 data for parameter tuning May 3-9 2009 data for performance evaluation (33M visits) Number of candidate articles per user visit is about 20 Dimension reduction on user features [CBP+’09] 6 features Data available from Yahoo! Research’s Webscope program http://webscope.sandbox.yahoo.com/catalog.php?datatype=r 2012-05-2532SEWM

33
“cheating” policy (no feature) CTR in Deployment Bucket [LCLS’10] UCB-type algorithms do better than -greedy counterparts CTR improved significantly when features/contexts are considered 2012-05-2533SEWM

34
Article CTR Lift 2012-05-2534SEWM no contextlinear model + -greedy o UCB

35
Outline Introduction Basic Solutions Advanced algorithms › Hybrid linear models › Generalized linear models › Thompson sampling › Theory Advanced Offline Evaluation Conclusions 2012-05-25SEWM35

36
Advantage learns faster when there are few data Challenge seems to require unbounded computation complexity Good news! Efficient implementation made possible by block matrix manipulations LinUCB for Hybrid Linear Models information shared by all articles (eg, teens like articles about Harry Potter) article-specific information (eg, Californian males like this article) 2012-05-2536SEWM

37
Overall CTR in Deployment Bucket advantage of hybrid model 2012-05-2537SEWM UCB-type algorithms do better than -greedy counterparts CTR improved significantly when features/contexts are considered Hybrid model is better when data are scarce

38
Outline Introduction Basic Solutions Advanced algorithms › Hybrid linear models › Generalized linear models › Thompson sampling › Theory Advanced Offline Evaluation Conclusions 2012-05-25SEWM38

39
2012-05-25SEWM39 Extensions to GLMs Linear models are unnatural for binary events Generalized linear models (GLMs) Logistic regression Probit regression : CDF of standard Gaussian) “inverse link function” logistic function

40
2012-05-25SEWM40 Model Fitting in GLMs Maintain a Bayesian posterior of parameter a by N( a, a ) Use Bayes’ formula with new data (x, r ): Current posteriorLikelihood Laplace approximation New posterior

41
2012-05-25SEWM41 UCB Heuristics for GLMs Use posterior N( a, a ) to derive (approximate) upper confidence bounds [LCLMW’12]

42
Experiment Setup One week data in from June 2009 (34M user visits) About 20 candidate articles per user visit Features: 20 features by PCA on raw binary user features Model updated every 5 minutes Main metric: overall (normalized) CTR in deployment bucket 2012-05-2542SEWM where E/E happens exploitation only “Learning Bucket” “Deployment Bucket” 5% 95%

43
2012-05-25SEWM43 GLM Comparisons Obs #1: active exploration is necessary Obs #2: Logistic/probit > linear Obs #3: UCB > -greedy -greedy exploration UCB exploration linear logisitc probit

44
Outline Introduction Basic Solutions Advanced algorithms › Hybrid linear models › Generalized linear models › Thompson sampling › Theory Advanced Offline Evaluation Conclusions 2012-05-25SEWM44

45
2012-05-25SEWM45 Limitations of UCB Exploration Exploration can be too much may explore the whole space exhaustively difficult to use prior knowledge Exploration is deterministic Poor performance when rewards are delayed Deriving an (approx.) UCB is not always easy

46
2012-05-25SEWM46 Thompson Sampling (1933) Algorithmic idea: “probability matching” Pr(a|x) = Pr(a is optimal for x) Randomized action selection (by definition) More robust to reward delay Straightforward to implement [CL’12] Maintain parameter posterior: Draw random models: Act accordingly: Easily combined with other (non-)parametric models

47
2012-05-25SEWM47 Thompson Sampling One-week data from Today Module on Yahoo!’s front page Logistic regression with Gaussian posteriors Obs #1: TS is competitive uniformly Obs #2: TS is more robust to reward delay

48
Outline Introduction Basic Solutions Advanced algorithms › Hybrid linear models › Generalized linear models › Thompson sampling › Theory Advanced Offline Evaluation Conclusions 2012-05-25SEWM48

49
Regret-based Competitive Analysis the best we could do if we knew all achieved by algorithm 2012-05-2549SEWM An algorithm “learns” if An algorithm “learns fast” if is small

50
Regret Bounds LinUCB [CLRS’11] : with matching lower bound Generalized LinUCB: still open A variant [FCGSz’11] : Thompson sampling A variant [L’12]: 2012-05-2550SEWM

51
Outline Introduction Basic Solutions Advanced algorithms Advanced Offline Evaluation › Importance weighting › Doubly robust technique Conclusions 2012-05-25SEWM51

52
Uniformly random data sometimes are a luxury… › System/cost constraints, user experience considerations, … Randomized log suffices (by importance weighting) Variance reduction with the “doubly robust” technique [DLL’11] Better bias/variance tradeoff by soft rejection sampling [DDLL’12] Extensions 2012-05-25 SEWM52 controls bias/variance trade-off [SLLK 2011]

53
Offline Evaluation with Non-Uniform Data 2012-05-25SEWM53 Key idea: importance reweighting Can use weighted empirical average with estimated p(a |x) controls bias/variance trade-off [SLLK 2011]

54
Results in Today Module Data [SLLK’11] 2012-05-25SEWM54

55
Outline Introduction Basic Solutions Advanced algorithms Advanced Offline Evaluation › Importance weighting › Doubly robust technique Conclusions 2012-05-25SEWM55

56
Doubly Robust Estimation Importance weighted formula Doubly robust technique Usually DR estimate decreases variance [DLL’11] 2012-05-25SEWM56 Estimation has high variance if p(a |x) is small

57
2012-05-25SEWM57 Multiclass Classification K-class classification as a K-armed bandit Training data › In usual (non-bandit) setting, › In bandit setting, usual setting bandit setting 123...m123...m 1 2 3 … K observed loss unobserved loss Loss matrix with r ij in (i,j) entry

58
Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially labeled) Train on training data, evaluate on test data Repeated 500 times 2012-05-25SEWM58

59
Outline Introduction Basic Solutions Advanced algorithms Advanced Offline Evaluation Conclusions 2012-05-25SEWM59

60
Conclusions Contextual bandit as a principled formulation for News article recommendation Internet advertising Web search... An offline evaluation method of bandit algorithms unbiased accurate compared to online bucket results Encouraging results in significant applications strong performance of UCB/TS exploration 2012-05-2560SEWM

61
Future Work Offline evaluation Better use of non-uniform data Extension to full reinforcement learning Use of prior knowledge Variants of bandits Bandits with budgets Bandits with many arms Bandits with multiple objectives Bandits with submodular rewards Bandits with delayed reward observations … 2012-05-2561SEWM

62
2012-05-25SEWM62 References Offline policy evaluation [LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011 [SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual bandits. Under review. Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with generalized linear models. JMLR W&PS, 2012

63
2012-05-25SEWM63

Similar presentations

© 2022 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google