Online learning, minimizing regret, and combining expert advice

Slides:



Advertisements
Similar presentations
Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.
Advertisements

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Online Learning for Online Pricing Problems Maria Florina Balcan.
How to Schedule a Cascade in an Arbitrary Graph F. Chierchetti, J. Kleinberg, A. Panconesi February 2012 Presented by Emrah Cem 7301 – Advances in Social.
MS 101: Algorithms Instructor Neelima Gupta
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Regret Minimizing Audits: A Learning-theoretic Basis for Privacy Protection Jeremiah Blocki, Nicolas Christin, Anupam Datta, Arunesh Sinha Carnegie Mellon.
Making Decisions and Predictions
Follow the regularized leader
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Machine Learning Week 2 Lecture 1.
A New Understanding of Prediction Markets via No-Regret Learning.
Maria-Florina Balcan Approximation Algorithms and Online Mechanisms for Item Pricing Maria-Florina Balcan & Avrim Blum CMU, CSD.
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.
Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.
Lecture 20: April 12 Introduction to Randomized Algorithms and the Probabilistic Method.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
An Intro to Game Theory Avrim Blum 12/07/04.
Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.
Online Learning Algorithms
CPS Learning in games Vincent Conitzer
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Nash equilibrium Nash equilibrium is defined in terms of strategies, not payoffs Every player is best responding simultaneously (everyone optimizes) This.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 14, October 7 th 2010.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Online Learning Yiling Chen. Machine Learning Use past observations to automatically learn to make better predictions or decisions in the future A large.
Diversification of Parameter Uncertainty Eric R. Ulm Georgia State University.
Issues on the border of economics and computation נושאים בגבול כלכלה וחישוב Speaker: Dr. Michael Schapira Topic: Dynamics in Games (Part III) (Some slides.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Decision theory under uncertainty
Part 3 Linear Programming
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.
Shall we play a game? Game Theory and Computer Science Game Theory /06/05 - Zero-sum games - General-sum games.
Improved Equilibria via Public Service Advertising Maria-Florina Balcan TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Experts and Multiplicative Weights slides from Avrim Blum.
/0604 © Business & Legal Reports, Inc. BLR’s Training Presentations Effective Decision-Making Strategies.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
Instructor: Shengyu Zhang 1. Offline algorithms Almost all algorithms we encountered in this course assume that the entire input is given all at once.
Chapter 1 Algorithms with Numbers. Bases and Logs How many digits does it take to represent the number N >= 0 in base 2? With k digits the largest number.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Carnegie Mellon University
Introduction to Randomized Algorithms and the Probabilistic Method
Better Algorithms for Better Computers
CS 3343: Analysis of Algorithms
The Boosting Approach to Machine Learning
Static Optimality and Dynamic Search Optimality in Lists and Trees
The Boosting Approach to Machine Learning
The
The Weighted Majority Algorithm
Presentation transcript:

Online learning, minimizing regret, and combining expert advice December 6th 2011 Webpage of the class, schedule Maria Florina Balcan

Online learning, minimizing regret, and combining expert advice. “The weighted majority algorithm” N. Littlestone & M. Warmuth “Online Algorithms in Machine Learning” (survey) A. Blum 2

Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money (buy or sell stocks) What route to drive to work each day Playing repeatedly a game against an opponent with unknown strategy We will study: Learning algos for such settings with connections to game theoretic notions of equilibria

Online learning, minimizing regret, and combining expert advice. 4

Using “expert” advice Assume we want to predict the stock market. We solicit n “experts” for their advice. Will the market go up or down? We then want to use their advice somehow to make our prediction. E.g., Can we do nearly as well as best in hindsight? Note: “expert” ´ someone with an opinion. [Not necessairly someone who knows anything.]

Formal model Can we do nearly as well as best in hindsight? There are n experts. For each round t=1,2, …, T Each expert makes a prediction in {0,1} The learner (using experts’ predictions) makes a prediction in {0,1} The learner observes the actual outcome. There is a mistake if the predicted outcome is different form the actual outcome. The learner gets to update his hypothesis. Can we do nearly as well as best in hindsight?

Formal model There are n experts. For each round t=1,2, …, T Each expert makes a prediction in {0,1} The learner (using experts’ predictions) makes a prediction in {0,1} The learner observes the actual outcome. There is a mistake if the predicted outcome is different form the actual outcome. Can we do nearly as well as best in hindsight? We are not given any other info besides the yes/no bits produced by the experts. We make no assumptions about the quality or independence of the experts. We cannot hope to achieve an absolute level of quality in our predictions.

Simpler question We have n “experts”. One of these is perfect (never makes a mistake). We don’t know which one. A strategy that makes no more than lg(n) mistakes?

Halving algorithm Take majority vote over all experts that have been correct so far. I.e., if # surviving experts predicting 1 > # surviving experts predicting 0, then predict 1; else predict 0. Claim: If one of the experts is perfect, then at most lg(n) mistakes. Proof: Each mistake cuts # surviving by factor of 2, so we make · lg(n) mistakes. Note: this means ok for n to be very large.

Using “expert” advice If one expert is perfect, get · lg(n) mistakes with halving algorithm. But what if none is perfect? Can we do nearly as well as the best one in hindsight?

Using “expert” advice Strategy #1: Iterated halving algorithm. Same as before, but once we've crossed off all the experts, restart from the beginning. Makes at most log(n)*[OPT+1] mistakes, where OPT is #mistakes of the best expert in hindsight. Divide the whole history into epochs. Beginning of an epoch is when we restart Halving; end of an epoch is when we have crossed off all the available experts. At the end of an epoch we have crossed all the experts, so every single expert must make a mistake. So, the best expert must have made a mistake. We make at most log n mistakes per epoch. If OPT=0 we get the previous guarantee.

Using “expert” advice Strategy #1: Iterated halving algorithm. Same as before, but once we've crossed off all the experts, restart from the beginning. Makes at most log(n)*[OPT+1] mistakes, where OPT is #mistakes of the best expert in hindsight. Wasteful. Constantly forgetting what we've “learned”. Can we do better?

Weighted Majority Algorithm Key Point: A mistake doesn't completely disqualify an expert. Instead of crossing off, just lower its weight. Weighted Majority Algorithm Start with all experts having weight 1. Predict based on weighted majority vote. If then predict 1 else predict 0

Weighted Majority Algorithm Key Point: A mistake doesn't completely disqualify an expert. Instead of crossing off, just lower its weight. Weighted Majority Algorithm Start with all experts having weight 1. Predict based on weighted majority vote. Penalize mistakes by cutting weight in half.

Analysis: do nearly as well as best expert in hindsight Theorem: If M = # mistakes we've made so far and OPT = # mistakes best expert has made so far, then:

Analysis: do nearly as well as best expert in hindsight Theorem: If M = # mistakes we've made so far and OPT = # mistakes best expert has made so far, then: Proof: Analyze W = total weight (starts at n). After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4)M. Weight of best expert is (1/2)OPT. So, constant ratio

Randomized Weighted Majority 2.4(OPT + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes. Instead of taking majority vote, use weights as probabilities & predict each outcome with prob. ~ to its weight. (e.g., if 70% on up, 30% on down, then pick 70:30) Key Point: smooth out the worst case.

Randomized Weighted Majority 2.4(OPT + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes. Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Also, generalize ½ to 1- e. Equivalent to select an expert with probability proportional with its weight.

Randomized Weighted Majority

Formal Guarantee for Randomized Weighted Majority If M = expected # mistakes we've made so far and OPT = # mistakes best expert has made so far, then: Theorem: M · (1+e)OPT + (1/e) log(n)

Analysis Key idea: if the algo has significant expected loss, then the total weight must drop substantially. Say at time t we have fraction Ft of weight on experts that made mistake. i.e., For all t, Ft is our expected loss at time t; probability we make a mistake at time t.

Analysis Say at time t we have fraction Ft of weight on experts that made mistake. So, we have probability Ft of making a mistake, and we remove an eFt fraction of the total weight. Wfinal = n(1-e F1)(1 - e F2)… ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - e åt Ft (using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes]) If best expert makes OPT mistakes, ln(Wfinal) > ln((1-e)OPT). Now solve: ln(n) - e M > OPT ln(1-e).

Randomized Weighted Majority Solves to:

Summarizing E[# mistakes] · (1+e)OPT + e-1log(n). If set =(log(n)/OPT)1/2 to balance the two terms out (or use guess-and-double), get bound of E[mistakes]·OPT+2(OPT¢log n)1/2 Note: Of course we might not know OPT, so if running T time steps, since OPT · T, set ² to get additive loss (2T log n)1/2 E[mistakes]·OPT+2(T¢log n)1/2 regret So, regret/T ! 0. [no regret algorithm]

What if have n options, not n predictors? We’re not combining n experts, we’re choosing one. Can we still do it? Nice feature of RWM: can be applied when experts are n different options E.g., n different ways to drive to work each day, n different ways to invest our money.

Randomized Weighted Majority Note: Did not see the predictions to select an expert (only needed to see their losses to update our weights)

What if have n options, not n predictors? We’re not combining n experts, we’re choosing one. Can we still do it? Nice feature of RWM: can be applied when experts are n different options E.g., n different ways to drive to work each day, n different ways to invest our money. We did not see the predictions in order to select an expert (only needed to see their losses to update our weights)

Decision Theoretic Version; Formal model There are n experts. For each round t=1,2, …, T No predictions. The learner produces a prob distr. on experts based on their past performance pt. The learner is given a loss vector lt and incurs expected loss lt ¢ pt. The learner updates the weights. The guarantee also applies to this model!!! [Interesting for connections between GT and Learning.]

Can generalize to losses in [0,1] If expert i has loss li, do: wi à wi(1-li). [before if an expert had a loss of 1, we multiplied by (1-epsilon), if it had loss of 0 we left it alone, now we do linearly in between] Same analysis as before.

Summarizing E[# mistakes] · (1+e)OPT + e-1log(n). If set =(log(n)/OPT)1/2 to balance the two terms out (or use guess-and-double), get bound of E[mistakes]·OPT+2(OPT¢log n)1/2 Note: Of course we might not know OPT, so if running T time steps, since OPT · T, set ² to get additive loss (2T log n)1/2 E[mistakes]·OPT+2(T¢log n)1/2 regret So, regret/T ! 0. [no regret algorithm]

Lower Bounds Consider T< log n. 9 a stochastic series of expert predictions and correct answers such that for any alg, we have E[#mistakes]=T/2 and yet the best expert makes 0 mistakes. At t=1, half the experts predict 0, half predict 1. Correct answer given by fair coin flip. At t=2, 3,… : of experts correct so far, half predict 0 and half predict 1. Correct answer given by fair coin flip. Any online algorithm incurs an expected loss of 1/2 at each time step. Yet, for T < log n there will always be some expert with 0 mistakes.

Lower Bounds Our second lower bound shows the dependence on sqrt(T) is needed. Consider n=2. Expert 1 always predicts 0. Expert 2 always predicts 1. Correct answer given by fair coin flip. Any online algorithm has 50% chance of predicting correctly at each time step, so expected loss of T/2. E(loss of best expert) = E(min(#heads, #tails)) = T/2 - \Theta(\sqrt(T))

Summary Can use to combine multiple algorithms to do nearly as well as best in hindsight. Can apply RWM in situations where experts are making choices that cannot be combined. E.g., repeated game-playing. E.g., online shortest path problem Extensions: “bandit” problem. efficient algs for some cases with many experts. Sleeping experts / “specialists” setting.