Machine Learning Algorithms in Computational Learning Theory

Slides:



Advertisements
Similar presentations
On-line learning and Boosting
Advertisements

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
Online learning, minimizing regret, and combining expert advice
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Longin Jan Latecki Temple University
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Linear Separators.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Machine Learning Week 2 Lecture 2.
Sparse vs. Ensemble Approaches to Supervised Learning
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Probably Approximately Correct Model (PAC)
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
The Theory of NP-Completeness
Ensemble Learning: An Introduction
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
CS 4700: Foundations of Artificial Intelligence
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
CS 4700: Foundations of Artificial Intelligence
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Chapter 13: Inference in Regression
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
SVM by Sequential Minimal Optimization (SMO)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
Machine Learning CSE 681 CH2 - Supervised Learning.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Learning from observations
Learning from Observations Chapter 18 Through
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
HW 2.
Computational Learning Theory
Computational Learning Theory
The Boosting Approach to Machine Learning
The Boosting Approach to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Objective of This Course
CSCI B609: “Foundations of Data Science”
The
CS639: Data Management for Data Science
Lecture 14 Learning Inductive inference
Presentation transcript:

Machine Learning Algorithms in Computational Learning Theory TIAN HE JI GUAN WANG Shangxuan Xiangnan Kun Peiyong Hancheng http://en.wikipedia.org/wiki/AIBO 25th Jan 2013

Outlines Introduction Probably Approximately Correct Framework (PAC) PAC Framework Weak PAC-Learnability Error Reduction Mistake Bound Model of Learning Mistake Bound Model Predicting from Expert Advice The Weighted Majority Algorithm Online Learning from Examples The Winnow Algorithm PAC versus Mistake Bound Model Conclusion Q & A

Machine Learning Machine cannot learn but can be trained.

Definition Machine Learning "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". ---- Tom M.Mitchell Algorithm types Supervised learning : Regression, Label Unsupervised learning : Clustering, Data Mining Reinforcement learning : Act better with observations.

Machine Learning Other Examples Medical diagnosis Handwritten character recognition Customer segmentation (marketing) Document segmentation (classifying news) Spam filtering Weather prediction and climate tracking Gene prediction Face recognition

Computational Learning Theory Why learning works Under what conditions is successful learning possible and impossible? Under what conditions is a particular learning algorithm assured of learning successfully? We need particular settings (models) Probably approximately correct (PAC) Mistake bound models

Probably Approximately Correct Framework (PAC) PAC Learnability Weak PAC-Learnability Error Reduction Occam’s Razor

PAC Learning PAC Learning Any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be wrong. Stationarity : The future being like the past. Concept: An efficiently computable function of a domain. Function : {0,1} n -> {0,1} . A concept class is a collection of concepts.

Learnability Requirements for ALG PAC Learnability ALG must, with arbitrarily high probability (1-d), output a hypothesis having arbitrarily low error(e). ALG must do as efficiently as in time that grows at most polynomially with 1/d and 1/e.

PAC Learning for Decision Lists A Decision List (DL) is a way of presenting certain class of functions over n-tuples. Example if x4 = 1 then f(x) = 0 else if x2 = 1 then f(x) = 1 else f(x) = 0. x1 x2 x3 x4 x5 f(x) 1 Upper bound on the number of all possible boolean decision lists on n variables is : n!4 n = O(n n )

PAC Learning for Decision Lists Algorithms : A greedy approach (Rivest, 1987) If the example set S is empty, halt. Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. Add (t, v) to decision list and remove those examples from S. Repeat 1-3. Clearly, it runs in polynomial time.

A supervised learning framework to classify data What does PAC do? A supervised learning framework to classify data

Use PAC to evaluate the performance of some algorithms How can we use PAC? Use PAC as a general framework to guide us on efficient sampling for machine learning Use PAC as a theoretical analyzer to distinguish hard problems from easy problems Use PAC to evaluate the performance of some algorithms Use PAC to solve some real problems

What we are going to cover? Explore what PAC can learn Apply PAC to real data with noise Give a probabilistic analysis on the performance of PAC

PAC Learning for Decision Lists Algorithms : A greedy approach x1 x2 x3 x4 x5 f(x) 1

Analysis of Greedy Algorithm The output Performance Guarantee

PAC Learning for Decision Lists 1. For a given S, by partitioning the set of all concepts that agree with f on S into a “bad” set and a “good”, we want to achieve 2. Consider any h, the probability that we pick S such that h ends up in bad set is 3. 4. Putting together

The Limitation of PAC for DLs What if the examples are like below? x1 x2 f(x) 1    

AND-formulas: PAC-learnable. 3-CNF formulas: PAC-learnable. Other Concept Classes Decision tree : Dts of restricted size are not PAC- learnable, although those of arbitrary size are. AND-formulas: PAC-learnable. 3-CNF formulas: PAC-learnable. 3-term DNF formulas: In fact, it turns out that it is an NP-hard problem, given S, to come up with a 3- term DNF formula that is consistent with S. Therefore this concept class is not PAC-learnable— but only for now, as we shall soon revisit this class with a modified definition of PAC-learning.

Revised Definition for PAC Learning Reciprocal It is hard to design an algorithm which is able to generate an arbitrarily accurate concept It requires a large size of examples to achieve a high probably correct accuracy while the running time of the algorithm is of polynomial time of the example set size |S|

Weak PAC-Learnability Benefits: To loose the requirements for a highly accurate algorithm To reduce the running time as |S| can be smaller To find a “good” concept using the simple algorithm A

Confidence Boosting Algorithm  

Boosting the Confidence  

Boosting the Confidence   there is at least one 𝒉_𝒊 doesn’t satisfy the criteria above with probability …

Boosting the Confidence   there is at least one 𝒉_𝒊 doesn’t satisfy the criteria above with probability …

Error Reduction by Boosting The basic idea exploits the fact that you can learn a little on every distribution and with more iterations we can get much lower error rate.

Error Reduction by Boosting Detailed Steps: 1. Some algorithm A produces a hypothesis that has an error probability of no more than p = 1/2−γ (γ>0). We would like to decrease this error probability to 1/2−γ′ with γ′> γ. 2. We invoke A three times, each time with a slightly different distribution and get hypothesis h1, h2 and h3, respectively. 3. Final hypothesis then becomes h=Maj(h1, h2,h3).

Error Reduction by Boosting Learn h1 from D1 with error p Modify D1 so that the total weight of incorrectly marked examples are 1/2, thus we get D2. Pick sample S2 from this distribution and use A to learn h2. Modify D2 so that h1 and h2 always disagree, thus we get D3. Pick sample S3 from this distribution and use A to learn h3.

Error Reduction by Boosting The total error probability h is at most 3p^2−2p^3, which is less than p when p∈(0,1/2). The proof of how to get this probability is shown in [1]. Thus there exists γ′> γ such that the error probability of our new hypothesis is at most 1/2−γ′. [1] http://courses.csail.mit.edu/6.858/lecture-12.ps

Error Reduction by Boosting

Defines a classifier using an additive model: Adaboost Defines a classifier using an additive model:

Adaboost

Adaboost Example

Adaboost Example

Adaboost Example

Adaboost Example

Adaboost Example

Adaboost Example

Error Reduction by Boosting Fig. Error curves for boosting C4.5 on the letter dataset as reported by Schapire et al.[]. Training and test error curves are lower and upper curves respectively.

PAC learning conclusion Strong PAC learning Weak PAC learning Error reduction and boosting

Mistake Bound Model of Learning Predicting from Expert Advice The Weighted Majority Algorithm Online Learning from Examples The Winnow Algorithm

Mistake Bound Model of Learning | Basic Settings x – examples c – the target function, ct ∈ C x1, x2… xt an input series at the tth stage The algorithm receives xt The algorithm predicts a classification for xt, bt The algorithm receives the true classification, ct(x). a mistake occurs if ct(xt) ≠ bt http://en.wikipedia.org/wiki/Online_machine_learning

Mistake Bound Model of Learning | Basic Settings A hypotheses class C has an algorithm A with mistake M: if for any concept c ∈ C, and for any ordering of examples, the total number of mistakes ever made by A is bounded by M. http://en.wikipedia.org/wiki/Online_machine_learning

Mistake Bound Model of Learning | Basic Settings Predicting from Expert Advice The Weighted Majority Algorithm Online Learning from Examples The Winnow Algorithm http://en.wikipedia.org/wiki/Online_machine_learning

Predicting from Expert Advice The Weighted Majority Algorithm Deterministic Randomized Predicting from Expert Advice

Predicting from Expert Advice | Basic Flow Combining Expert Advice Truth http://www.metaphonica.com/tag/expert/ Prediction Assumption: prediction ∈ {0, 1}.

Predicting from Expert Advice | Trial (1) Receiving prediction from experts (2) Making its own prediction (3) Being told the correct answer

Predicting from Expert Advice | An Example Task : predicting whether it will rain today. Input : advices of n experts ∈ {1 (yes), 0 (no)}. Output : 1 or 0. Goal : make the least number of mistakes. Expert 1 Expert 2 Expert 3 Truth 21 Jan 2013 1 22 Jan 2013 23 Jan 2013 24 Jan 2013 25 Jan 2013

The Weighted Majority Algorithm | Deterministic

The Weighted Majority Algorithm | Deterministic Date Expert Advice Weight ∑wi Prediction Correct Answer x1 x2 x3 w1 w2 w3 (xi=0) (xi=1) 21 Jan 2013 22 Jan 2013 23 Jan 2013 24 Jan 2013 25 Jan 2013 1 1 1 1 1 1 2 1 1 1 1 0.50 1 2 0.50 1 1 1 0.50 0.50 0.50 0.50 1 1 1 1 1 0.50 0.25 0.50 0.50 0.75 1 1 1 1 0.25 0.25 0.50 0.25 0.75 1 1

The Weighted Majority Algorithm | Deterministic Proof: Let M := # of mistakes made by Weight Majority algorithm. W := total weight of all experts (initially = n). A mistaken prediction: At least ½ W predicted incorrectly. In step 3 total weight reduced by a factor of ¼ (= ½ W x ½). W ≤ n(¾)M Assuming the best expert made m mistakes. W ≥ ½m So, ½m ≤ n(¾)M  M ≤ 2.41(m + lgn).

The Weighted Majority Algorithm | Randomized View as Probability Multiply by β

The Weighted Majority Algorithm | Randomized Advantages Dilutes the worst case: Worst Case: slightly > ½ of the weights predicted incorrectly. Simple Version: make a mistake and reduce total weight by ¼. Randomized Version: has a 50/50 chance to predicate correctly. Selecting an expert with probability proportional to its weight: Feasibility: (e.g., when weights cannot be easily combined). Efficiency: (e.g., when the experts are programs to be run or evaluated).

The Weighted Majority Algorithm | Randomized

The Weighted Majority Algorithm | Randomized β = ½ then M < 1.39m + 2ln n β = ¾ then M < 1.15m + 4ln n Adjusting β to make the “competitive ratio” as close to 1 as desired.

Mistake Bound Model of Learning Predicting from Expert Advice The Weighted Majority Algorithm Online Learning from Examples The Winnow Algorithm

Online Learning from Examples Weighted Majority Algorithm is “learning from expert advice” Recall Offline learning V.S. Online Learning Offline learning: training dataset => learning the model parameters Online learning: each “training” instance comes like a stream => update the model parameters after receiving each instance Recall the Mistake Bound Model (Online learning): At each iteration: Goal: Minimize the number of mistakes made. We have seen the problem of “learning from expert advice”, now we consider the more general scenario of on-line learning from examples Receive a feature vector x Predict x’s label: b Receive x’s true label: c Update the model params

Simple Winnow Algorithm: Each input vector x = {x1, x2, … xn}, xi ∈ {0, 1} Assume the target function is the disjunction of r relevant variables. I.e. f(x) = xt1 V xt2 V … V xtr Winnow algorithm provides a linear separator

Initialize: weights w1 = w2 = … = wn=1 Iterate: Winnow Algorithm Initialize: weights w1 = w2 = … = wn=1 Iterate: Receive an example vector x = {x1, x2, … xn} Predict: Output 1 if Output 0 otherwise Get the true label Update if making a mistake: False positive error: for each xi = 1: wi = 2*wi False negative error: for each xi = 1: wi = wi/2 Difference with Weighted Majority Algorithm?

Analysis of Simple Winnow Assumption: the target function is the disjunction of r relevant variables and remains unchanged Bound: #total mistakes ≤ 2 + 3r(1+log n) Analysis & Proof False Positive errors: The true label is +1 because at least ONE relevant variable(xt) is 1 Error is caused by x ∙ w < n => wt < n Update rule: wt = wt * 2 => #error when xt=1 is at most (1+log n) The false positive error bound is r(1+log n) False Negative error: The similar analysis procedure can get the bound 2+2r(1+log n)

The target function may change with time: Extensions of Winnow… Examples are not exactly match the target function: Error bound is O(r∙mc + r∙lg n) mc is the errors made by the target function It means Winnow has a O(r)-competitive error bound. The target function may change with time: Imagine an adversary(enemy) changes the target function by adding or removing variables. The error bound is O(cA ∙ log n) cA is the number of variables changed.

Extensions of Winnow… The feature variable is continuous rather than boolean Theorem: if the target function is embedding-closed, then Winnow can also be learned in the Infinite-Attribute model By adding some randomness in the algorithm, the bound can be further improved.

PAC versus MBM (Mistake Bound Model) Intuitively, MBM is stronger than PAC MBM gives the determinant error upper bound PAC guarantees the mistake with constant probability A natural question: if we know A learns some concept class C in the MBM, can A learn the same C in the PAC model? Answer: Of course! We can construct APAC in a principled way [1]

PAC (Probably Approximately Correct) Conclusion PAC (Probably Approximately Correct) Easier than MBM model since examples are restricted to be coming from a distribution. Strong PAC and weak PAC Error reduction and boosting MBM(Mistake Bound Model) Stronger bound than PAC: exactly upper bound of errors 2 representative algorithms Weighted Majority: for online expert learning Winnow: for online linear classifier learning Relationship between PAC and MBM

Q & A

References [1] Tel Aviv University’s Machine Lecture: http://www.cs.tau.ac.il/~mansour/ml-course-10/scribe4.pdf Machine Learning Theory http://www.cs.ucla.edu/~jenn/courses/F11.html http://www.staff.science.uu.nl/~leeuw112/soiaML.pdf PAC www.cs.cmu.edu/~avrim/Talks/FOCS03/tutorial.ppt http://www.autonlab.org/tutorials/pac05.pdf http://www.cis.temple.edu/~giorgio/cis587/readings/pac.html Occam’s Razor http://www.cs.iastate.edu/~honavar/occam.pdf The Weighted Majority Algorithm http://www.mit.edu/~9.520/spring08/Classes/online_learning_2008.pdf http://users.soe.ucsc.edu/~manfred/pubs/C50.pdf http://users.soe.ucsc.edu/~manfred/pubs/J24.pdf The Winnow Algorithm http://www.cc.gatech.edu/~ninamf/ML11/lect0906.pdf http://stat.wharton.upenn.edu/~skakade/courses/stat928/lectures/lecture19.pdf http://www.cc.gatech.edu/~ninamf/ML10/lect0121.pdf

Supplementary Slides

Chernoff’s Bound Chernoff bounds are another kind of tail bound. Like Markoff and Chebyshev, they bound the total amount of probability of some random variable Y that is in the “tail”, i.e. far from the mean. For detailed derivation for Chernoff bounds, you may refer to http://www.cs.cmu.edu/afs/cs/academic/class/15859-f04/www/scribes/lec9.pdf http://www.cs.berkeley.edu/~jfc/cs174/lecs/lec10/lec10.pdf

Proof of Theorem 2 (1/2)

Proof of Theorem 2 (2/2)

Corollary 3: