Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25.

Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

SEWM2 ACTION Statistics, ML, DM, … DATA KNOWLEDGE UTILITY MORE DATA Reinforcement Learning

Outline  Introduction  Basic Solutions  Advanced algorithms  Advanced Offline Evaluation  Conclusions 2012-05-25SEWM3

Yahoo-User Interaction 2012-05-25SEWM4 ads, news, ranking, … click, conversion, revenue, … gender, age, … ACTION REWARD CONTEXT serving strategy POLICY Goal Maximize total REWARD by optimizing POLICY

Today Module @ Yahoo! Front Page A small pool of articles chosen by editors “Featured Article” 2012-05-255SEWM

Objectives and Challenges 2012-05-256SEWM

Challenge: Explore/Exploit Observation: only displayed articles get user click feedback Article CTR estimates How to trade off? … with dynamic article pools … while considering user interests 2012-05-257SEWM

Insufficient Exploration Example always pays $5/round pays $100 a quarter of the time (so $25/round on average) 1 2 3 4 5 6 7 8 $5 $0 $5 2012-05-258SEWM It turns out… $100 $0 $5

Contextual Bandit Formulation Multi-armed contextual bandit [LZ’08] In Today Module: 2012-05-259SEWM Select Observe K arms A t and “context” Receive reward

2012-05-25SEWM10 Another Example – Display Ads

2012-05-25SEWM11 Yet Another Example - Ranking

Related Work Standard information retrieval and collaborative filtering Also concerns with (personalized) recommendation But with (almost) static users/items  training often done in batch/offline mode  no need for online exploration Full reinforcement learning General: including bandit problems as special cases Need to tackle “temporal credit assignment” 2012-05-2512SEWM

Outline  Introduction  Basic Solutions › Algorithms › Evaluation › Experiments  Advanced algorithms  Advanced Offline Evaluation  Conclusions 2012-05-25SEWM13

2012-05-25SEWM14 Prior Bandit Algorithms Herbert RobbinsTze Leung Lai Regret minimization (focus of this talk) Bayesian optimal solution John Gittins

Traditional K-armed Bandits Assumption: CTR (click-through rate) not affected by user features The more “ a ” has been displayed, the less uncertainty in CTR a 2012-05-2515SEWM CTR estimates = #clicks / #impressions No contexts  no personalization

EXP4 [ACFS’02], EXP4.P [BLLRS’11 ], elimination [ADKLS’12 ] Strong theoretical guarantees But computationally expensive Epoch-greedy [LZ’08] Similar to  -greedy Simple, general and less expensive But not most effective This talk: algorithms with compact, parametric models Both efficient and effective Extension of UCB1 to linear models … and to generalized linear models Randomized algorithm with Thompson sampling Contextual Bandit Algorithms 2012-05-2516SEWM

Linear model assumption: Standard least-squares ridge regression Reward prediction for new user: Whether to explore requires quantifying parameter uncertainty LinUCB: UCB for Linear Models 2012-05-2517SEWM prediction error

LinUCB: UCB for Linear Models (II) LinUCB always selects an arm with highest UCB: to exploit to explore LinRel [Auer 2002] works similarly but in a more complicated way. 2012-05-2518SEWM Recall...

Goal: estimate average reward of running  with iid x Static  Adaptive  Golden standard Run  in real system and see how well it works …but expensive and risky Evaluation of Bandit Algorithms 2012-05-2520SEWM

Benefits Cheap and risk-free! Avoid frequent bucket tests Replicable / fair comparisons Common in non-interactive learning problems (e.g., classification) Benchmark data organized as (input, label) pairs … but not straightforward for interactive learning problems Data in bandits usually consists of (context, arm, reward) triples No reward signal for other arm’ ≠ arm Offline Evaluation 2012-05-2521SEWM

Common/Prior Evaluation Approaches classification regression density estimation this (difficult) step is often biased In contrast, our approach avoids explicit user modeling  simple gives unbiased evaluation results  reliable unreliable evaluation bandit algorithm  2012-05-2522SEWM

Our Evaluation Method: “Replay” bandit algorithm  2012-05-2523SEWM Key requirement for data collection:

2012-05-25SEWM24 Theoretical Guarantees  Thm 1: Our estimator is unbiased  Mathematically,  So on average reflects real, online performance  Thm 2: Estimation error  0 with more data  Mathematically,  So accuracy guaranteed with large volume of data

Case Study in Today Module [LCLW’11]  Data: › Large volume of real user traffic in Today Module  Policies being evaluated: › EMP [ACE’ 09] › SEMP/CEMP: personalized EMP variants › Use policies’ online bucket CTR as “truth”  Random bucket data for evaluation: › 40M visits, K ~= 20 on average › Use it to offline-evaluate policies’ CTR 2012-05-25SEWM25 Are they close?

Unbiasedness (Article nCTR) Estimated nCTR Recorded Online nCTR 2012-05-2526SEWM The offline estimate is indeed unbiased!

Unbiasedness (Daily nCTR) Recorded Online nCTR Estimated nCTR Ten Days in November 2009 2012-05-2527SEWM The offline estimate is indeed unbiased!

Estimation Error 2012-05-2528SEWM Number of Data ( L ) nCTR Estimation Error Recall our theoretical error bound:

Unbiased Offline Evaluation: Recap  What we have shown › A principled method for benchmark data collection › which allows reliable/unbiased evaluation › of any bandit algorithms Analogue: UCI, Caltech101... datasets for supervised learning  The first such benchmark was released by Yahoo! http://webscope.sandbox.yahoo.com/catalog.php?datatype=r  2 nd and 3 rd versions available for PASCAL2 Challenge › ICML 2012 workshop 2012-05-25SEWM29

Experiment Setup: Architecture Model updated every 5 minutes Main metric: overall normalized CTR in deployment bucket nCTR = CTR * secretNumber (to protect sensitive business information) 2012-05-2531SEWM  where E/E happens  exploitation only “Learning Bucket” “Deployment Bucket” 5% 95%

Experiment Setup: Data May 1 2009 data for parameter tuning May 3-9 2009 data for performance evaluation (33M visits) Number of candidate articles per user visit is about 20 Dimension reduction on user features [CBP+’09] 6 features Data available from Yahoo! Research’s Webscope program http://webscope.sandbox.yahoo.com/catalog.php?datatype=r 2012-05-2532SEWM

“cheating” policy (no feature) CTR in Deployment Bucket [LCLS’10] UCB-type algorithms do better than  -greedy counterparts CTR improved significantly when features/contexts are considered 2012-05-2533SEWM

Article CTR Lift 2012-05-2534SEWM no contextlinear model +  -greedy o UCB

Outline  Introduction  Basic Solutions  Advanced algorithms › Hybrid linear models › Generalized linear models › Thompson sampling › Theory  Advanced Offline Evaluation  Conclusions 2012-05-25SEWM35

Advantage learns faster when there are few data Challenge seems to require unbounded computation complexity Good news! Efficient implementation made possible by block matrix manipulations LinUCB for Hybrid Linear Models information shared by all articles (eg, teens like articles about Harry Potter) article-specific information (eg, Californian males like this article) 2012-05-2536SEWM

Overall CTR in Deployment Bucket advantage of hybrid model 2012-05-2537SEWM UCB-type algorithms do better than  -greedy counterparts CTR improved significantly when features/contexts are considered Hybrid model is better when data are scarce

2012-05-25SEWM39 Extensions to GLMs  Linear models are unnatural for binary events  Generalized linear models (GLMs)  Logistic regression  Probit regression  : CDF of standard Gaussian) “inverse link function” logistic function

2012-05-25SEWM40 Model Fitting in GLMs Maintain a Bayesian posterior of parameter  a by N(  a,  a ) Use Bayes’ formula with new data (x, r ): Current posteriorLikelihood Laplace approximation New posterior

2012-05-25SEWM41 UCB Heuristics for GLMs Use posterior N(  a,  a ) to derive (approximate) upper confidence bounds [LCLMW’12]

Experiment Setup One week data in from June 2009 (34M user visits) About 20 candidate articles per user visit Features: 20 features by PCA on raw binary user features Model updated every 5 minutes Main metric: overall (normalized) CTR in deployment bucket 2012-05-2542SEWM  where E/E happens  exploitation only “Learning Bucket” “Deployment Bucket” 5% 95%

2012-05-25SEWM43 GLM Comparisons Obs #1: active exploration is necessary Obs #2: Logistic/probit > linear Obs #3: UCB >  -greedy  -greedy exploration UCB exploration linear logisitc probit

2012-05-25SEWM45 Limitations of UCB Exploration  Exploration can be too much  may explore the whole space exhaustively  difficult to use prior knowledge  Exploration is deterministic  Poor performance when rewards are delayed  Deriving an (approx.) UCB is not always easy

2012-05-25SEWM46 Thompson Sampling (1933)  Algorithmic idea: “probability matching”  Pr(a|x) = Pr(a is optimal for x)  Randomized action selection (by definition)  More robust to reward delay  Straightforward to implement [CL’12]  Maintain parameter posterior:  Draw random models:  Act accordingly: Easily combined with other (non-)parametric models

2012-05-25SEWM47 Thompson Sampling One-week data from Today Module on Yahoo!’s front page Logistic regression with Gaussian posteriors Obs #1: TS is competitive uniformly Obs #2: TS is more robust to reward delay

Regret-based Competitive Analysis the best we could do if we knew all achieved by algorithm 2012-05-2549SEWM An algorithm “learns” if An algorithm “learns fast” if is small

Regret Bounds LinUCB [CLRS’11] : with matching lower bound Generalized LinUCB: still open A variant [FCGSz’11] : Thompson sampling A variant [L’12]: 2012-05-2550SEWM

Outline  Introduction  Basic Solutions  Advanced algorithms  Advanced Offline Evaluation › Importance weighting › Doubly robust technique  Conclusions 2012-05-25SEWM51

 Uniformly random data sometimes are a luxury… › System/cost constraints, user experience considerations, …  Randomized log suffices (by importance weighting)  Variance reduction with the “doubly robust” technique [DLL’11]  Better bias/variance tradeoff by soft rejection sampling [DDLL’12] Extensions 2012-05-25 SEWM52  controls bias/variance trade-off [SLLK 2011]

Offline Evaluation with Non-Uniform Data 2012-05-25SEWM53  Key idea: importance reweighting  Can use weighted empirical average with estimated p(a |x)  controls bias/variance trade-off [SLLK 2011]

Results in Today Module Data [SLLK’11] 2012-05-25SEWM54

Outline  Introduction  Basic Solutions  Advanced algorithms  Advanced Offline Evaluation › Importance weighting › Doubly robust technique  Conclusions 2012-05-25SEWM55

Doubly Robust Estimation  Importance weighted formula  Doubly robust technique Usually DR estimate decreases variance [DLL’11] 2012-05-25SEWM56 Estimation has high variance if p(a |x) is small

2012-05-25SEWM57 Multiclass Classification  K-class classification as a K-armed bandit  Training data › In usual (non-bandit) setting, › In bandit setting, usual setting bandit setting 123...m123...m 1 2 3 … K observed loss unobserved loss Loss matrix with r ij in (i,j) entry

Experimental Results on UCI Datasets  Split data 50/50 for training (fully labeled) and testing (partially labeled)  Train  on training data, evaluate  on test data  Repeated 500 times 2012-05-25SEWM58

Outline  Introduction  Basic Solutions  Advanced algorithms  Advanced Offline Evaluation  Conclusions 2012-05-25SEWM59

Conclusions Contextual bandit as a principled formulation for News article recommendation Internet advertising Web search... An offline evaluation method of bandit algorithms unbiased accurate compared to online bucket results Encouraging results in significant applications strong performance of UCB/TS exploration 2012-05-2560SEWM

Future Work Offline evaluation Better use of non-uniform data Extension to full reinforcement learning Use of prior knowledge Variants of bandits Bandits with budgets Bandits with many arms Bandits with multiple objectives Bandits with submodular rewards Bandits with delayed reward observations … 2012-05-2561SEWM

2012-05-25SEWM62 References  Offline policy evaluation  [LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011  [SLLK] Learning from logged implicit exploration data. NIPS, 2010  [DLL] Doubly robust policy evaluation and learning. ICML, 2011  [DDLL] Sample-efficient nonstationary-policy evaluation for contextual bandits. Under review.  Bandit algorithms  [LCLS] A contextual-bandit approach to personalized news article recommendation. WWW, 2010  [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011  [BLLRS] Contextual bandit algorithms with supervised learning guarantees. AISTATS, 2011  [CL] An empirical evaluation of Thompson sampling. NIPS, 2011  [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with generalized linear models. JMLR W&PS, 2012

2012-05-25SEWM63

Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25.

Similar presentations

Presentation on theme: "Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25.

Similar presentations

Presentation on theme: "Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25."— Presentation transcript:

Similar presentations

About project

Feedback