Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Online Learning for Online Pricing Problems Maria Florina Balcan.
On-line learning and Boosting
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
Online learning, minimizing regret, and combining expert advice
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]
Boosting Approach to ML
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Online Algorithms – II Amrinder Arora Permalink:
The Decision-Making Process IT Brainpower
Active Learning of Binary Classifiers
CS 536 Spring Global Optimizations Lecture 23.
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Three kinds of learning
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
ICS 273A Intro Machine Learning
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Introduction to machine learning
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Incorporating Unlabeled Data in the Learning Process
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
online convex optimization (with partial information)
Machine Learning Algorithms in Computational Learning Theory
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Experts and Multiplicative Weights slides from Avrim Blum.
Distribution Specific Learnability – an Open Problem in Statistical Learning Theory M. Hassan Zokaei-Ashtiani December 2013.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Learning with General Similarity Functions Maria-Florina Balcan.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Data Mining and Text Mining. The Standard Data Mining process.
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
Carnegie Mellon University
Large Margin classifiers
Introduction to Randomized Algorithms and the Probabilistic Method
The Boosting Approach to Machine Learning
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Learning
The
Machine Learning in Practice Lecture 17
The Weighted Majority Algorithm
Presentation transcript:

Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective and open questions

Using “expert” advice We solicit n “experts” for their advice. (Will the market go up or down?) We then want to use their advice somehow to make our prediction. E.g., Say we want to predict the stock market. Can we do nearly as well as best in hindsight? [“expert” ´ someone with an opinion. Not necessarily someone who knows anything.]

Simpler question We have n “experts”. One of these is perfect (never makes a mistake). We just don’t know which one. Can we find a strategy that makes no more than lg(n) mistakes? Answer: sure. Just take majority vote over all experts that have been correct so far.  Each mistake cuts # available by factor of 2.  Note: this means ok for n to be very large. “halving algorithm”

Using “expert” advice If one expert is perfect, can get · lg(n) mistakes with halving alg. But what if none is perfect? Can we do nearly as well as the best one in hindsight? Strategy #1: Iterated halving algorithm. Same as before, but once we've crossed off all the experts, restart from the beginning. Makes at most log(n)*[OPT+1] mistakes, where OPT is #mistakes of the best expert in hindsight. Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?

Weighted Majority Algorithm Intuition: Making a mistake doesn't completely disqualify an expert. So, instead of crossing off, just lower its weight. Weighted Majority Alg: – Start with all experts having weight 1. – Predict based on weighted majority vote. – Penalize mistakes by cutting weight in half.

Analysis: do nearly as well as best expert in hindsight M = # mistakes we've made so far. m = # mistakes best expert has made so far. W = total weight (starts at n). After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4) M. Weight of best expert is (1/2) m. So, constant ratio

Randomized Weighted Majority 2.4(m + lg n) 2.4(m + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes. Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case. Also, generalize ½ to 1- . unlike most worst-case bounds, numbers are pretty good.

Analysis Say at time t we have fraction F t of weight on experts that made mistake. So, we have probability F t of making a mistake, and we remove an  F t fraction of the total weight. –W final = n(1-  F 1 )(1 -  F 2 )... –ln(W final ) = ln(n) +  t [ln(1 -  F t )] · ln(n) -   t F t (using ln(1-x) < -x) = ln(n) -  M. (  F t = E[# mistakes]) If best expert makes m mistakes, then ln(W final ) > ln((1-  ) m ). Now solve: ln(n) -  M > m ln(1-  ).

Summarizing At most  times worse than best expert in hindsight, with additive  -1 log(n). If have prior, can replace additive term with   log(1/p i ). [   x number of bits] Often written in terms of additive loss. If running T time steps, set epsilon to get additive loss (2T log n) 1/2

What can we use this for? Can use to combine multiple algorithms to do nearly as well as best in hindsight. Can apply RWM in situations where experts are making choices that cannot be combined. –E.g., repeated game-playing. –E.g., online shortest path problem [OK if losses in [0,1]. Replace F t with P t ¢ L t and penalize expert i by (1-  ) loss(i) ] Extensions: –“bandit” problem. –efficient algs for some cases with many experts. –Sleeping experts / “specialists” setting.

Summary and Open Questions

12 Image Classification Document Categorization Speech Recognition Branch Prediction Protein Classification Spam Detection Fraud Detection Machine Learning Computational Advertising Incredibly useful in many domains across computer science, engineering, and science. Etc

what kinds of tasks we can hope to learn, and from what kind of data Goals of Machine Learning Theory Develop and analyze models to understand: what types of guarantees might we hope to achieve prove guarantees for practically successful algs (when will they succeed, how long will they take?); develop new algs that provably meet desired criteria

14 Example: Supervised Classification Goal: use s seen so far to produce good prediction rule for future data. Not spam spam Decide which s are spam and which are important. Supervised classification

15 Two Main Aspects of Supervised Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Confidence Bounds, Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. Well understood for passive supervised learning.

16 Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. Active Learning The algorithm interactively asks for labels of informative examples. Other Protocols for Supervised Learning Theoretical understanding severely lacking until a couple of years ago. Lots of progress recently. We will cover some of these. Learning with Membership Queries Statistical Query Learning

Topics we covered Simple algos and hardness results for supervised learning. Classic, state of the art algorithms: AdaBoost and SVM. Basic models for supervised learning: PAC and SLT. Standard Sample Complexity Results (VC dimension) Weak-learning vs. Strong-learning

Structure of the Class Classification noise and the Statistical-Query model Incorporating Unlabeled Data in the Learning Process. Learning Real Valued Functions Modern Sample Complexity Results Rademacher Complexity Margin analysis of Boosting and SVM Incorporating Interaction in the Learning Process: Active Learning Learning with Membership Queries

Open Questions Models and algorithms for exciting new paradigms Active learning and SSL In the classic PAC model learning decision trees, DNF learning functions with a few relevant vars (junta problem) right sample complex quantities interesting positive algorithmic results e.g., transfer learning, multi-agent learning, never ending learning