Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

VC Dimension – definition and impossibility result
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Machine Learning Week 3 Lecture 1. Programming Competition
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Machine Learning Week 2 Lecture 1.
HW HW1: Let us know if you have any questions. ( the TAs) HW2:
PAC Learning adapted from Tom M.Mitchell Carnegie Mellon University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning Week 2 Lecture 2.
Computational Learning Theory
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Probably Approximately Correct Model (PAC)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
For Friday Read chapter 18, section 7 Program 3 due.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Linear Discrimination Reading: Chapter 2 of textbook.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Chapter 5. Evaluating Hypotheses
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CS 188: Artificial Intelligence Spring 2006 Lecture 12: Learning Theory 2/23/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Machine Learning Concept Learning General-to Specific Ordering
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CS 9633 Machine Learning Support Vector Machines
HW HW1: Let us know if you have any questions. ( the TAs)
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
Support Vector Machines
CH. 2: Supervised Learning
ECE 5424: Introduction to Machine Learning
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Machine Learning: UNIT-3 CHAPTER-2
Supervised Learning Berlin Chen 2005 References:
Lecture 14 Learning Inductive inference
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately Correct (PAC) learning theorem Vapnik-Chervonenkis (VC) dimension Recommended reading: Mitchell: Ch. 7 Slides largely pilfered from:

Computational Learning Theory What general laws constrain learning? What can we derive by analysis? * See annual Conference on Computational Learning Theory

Computational Learning Theory What general laws constrain learning? What can we derive by analysis? An successful analysis for computation: the time complexity of a program. –Predict running time as a function of input size (actually – we upper bound it) –Can we predict error rate as a function of sample size? (or upper bound it?) sample complexity –What sort of assumptions do we need to make to do this? * See annual Conference on Computational Learning Theory

Sample Complexity How many training examples suffice to learn target concept? (eg, a linear classifier, or a decision tree)? What exactly do we mean by this? Usually: There is a space of possible hypotheses H (e.g., binary decision trees) In each learning task, there is a target concept c (usually c is in H). There is some defined protocol where the learner gets information.

Sample Complexity How many training examples suffice to learn target concept? 1.If learner proposes instances as queries to teacher? - learner proposes x, teacher provides c(x) 2.If teacher (who knows c) proposes training examples? - teacher proposes sequence {, … 3.If some random process (e.g., nature) proposes instances, and teacher labels them? - instances x drawn according to P(X), teacher provides label c(x)

A simple setting to analyze Problem setting: Set of instances Set of hypotheses Set of possible target functions Set of m training instances drawn from teacher provides noise-free label Learner outputs any consistent Question: how big should m be (as a function of H) for the learner to succeed? But first - how do we define“success”? Question: how big should m be (as a function of H) for the learner to succeed? But first - how do we define“success”? Find c? Find an h that approximates c?

The true error of h is the probability that it will misclassify an example drawn at random from Usually we can’t compute this but we might be able to prove things about it

A simple setting to analyze Problem setting: Set of instances Set of hypotheses Set of possible target functions Set of m training instances drawn from teacher provides noise-free label Learner outputs any consistent Question: how big should m be (as a function of H) to guarantee that error true (h) < ε ? Question: how big should m be (as a function of H) to guarantee that error true (h) < ε ? What if we happen to get a really bad set of examples?

A simple setting to analyze Problem setting: Set of instances Set of hypotheses Set of possible target functions Sequence of m training instances drawn from teacher provides noise-free label Learner outputs any consistent Question: how big should m be to ensure that Prob D [ error true (h) 1- δ Question: how big should m be to ensure that Prob D [ error true (h) 1- δ This criterion is called “probably approximately correct” (pac learnability)

Valiant’s proof

Assume H is finite, and let h* be learner’s output Fact: for ε in [0,1], (1-ε) ≤ e -ε Consider a “bad” h, where error true (h) > ε Prob[ h consistent with x 1 < (1-ε) ≤ e -ε ] Prob[ h consistent with all of x 1 …, x m < (1-ε) m ≤ e -εm ] Prob[ h* is “bad”] = Prob[ error true (h*) > ε ] < |H| e -εm (union bound) How big does m have to be?

Valiant’s proof Assume H is finite, and let h* be learner’s output How big does m have to be? Let’s solve: Prob[ error true (h*) > ε ] < |H| e -εm < δ

Pac-learnability result Problem setting: Set of instances Set of hypotheses Set of possible target functions Set of m training instances drawn from P(X), with noise- free labels If the learner outputs any h from H consistent with the data and then Prob D [ error true (h) 1- δ logarithmic in |H| linear in 1/ ε logarithmic in |H| linear in 1/ ε

Stopped Tues 10/6

E.g., X= Each h  H constrains each Xi to be 1, 0, or “don’t care” In other words, each h is a rule such as: If X2=0 and X5=1 Then Y=1, else Y=0 Q:How do we find a consistent hypothesis? A: Conjoin all constraints that are true off all positive examples Notation: VS H,D = { h | h consistent with D}

Pac-learnability result Problem setting: Let X = {0,1} n, H=C={conjunctions of features} This is pac-learnable in polynomial time, with a sample complexity of

Comments on pac-learnability This is a worst case theory –We assumed the training and test distribution were the same –We didn’t assume anything about P(X) at all, expect that it doesn’t change from train to test The bounds are not very tight empirically –you usually don’t need as many examples as the bounds suggest We did assume clean training data –this can be relaxed

Another pac-learnability result Problem setting: Let X = {0,1} n, H=C=k-CNF = {conjunctions of disjunctions of up to k features} –Example: This is pac-learnable in polynomial time, with a sample complexity of Proof: it’s equivalent to learning in a new space X’ with n k features, each corresponding to a length-k disjunction <= k

Pac-learnability result Problem setting: Let X = {0,1} n, H=C=k-term DNF = {up to k disjunctions of conjunctions of any # of features} –Example: The sample complexity is polynomial, but it’s NP-hard to find a consistent hypothesis! Proof of sample complexity: for any h, the negation of h is in our last H. (Conjunctions of length-k disjunctions). So H is not larger than k-CNF. In fact - k-term-DNF is a subset of k-CNF…

Pac-learnability result Problem setting: Let X = {0,1} n, H=k-CNF, C=k-term-DNF This is pac-learnable in polynomial time, with a sample complexity of I.e. k-term NDF is pac-learnable by k-CNF An increase in sample complexity might be worth suffering if it decreases the time complexity

Pac-learnability result Problem setting: Let X = {0,1} n, H = {boolean functions that can be written down in a subset of Python} –Eg: (x[1] and not x[3]) or (x[7]==x[9]) –H n = { h in H | h can be written in n bits or less} Learning algorithm: return the simplest consistent hypothesis h. –I.e.: if h in H n, then for all j<n, no h’ in H j is consistent This is pac-learnable (not in polytime), with a sample complexity of

Pac-learnability result (Occam’s razor) Problem setting: Let X = {0,1} n, H=C=F = {boolean functions that can be written down in a subset of Python} –F n = { f in F | f can be written in n bits or less} Learning algorithm: return the simplest consistent hypothesis h. This is pac-learnable (not in polytime), with a sample complexity of H1 H2H3H4

VC DIMENSION AND INFINITE HYPOTHESIS SPACES

Pac-learnability result: what if H is not finite? Problem setting: Set of instances Set of hypotheses Set of possible target functions Set of m training instances drawn from P(X), with noise- free labels If the learner outputs any h from H consistent with the data and then Prob D [ error true (h) 1- δ Can I find some property like ln|H| for perceptrons, neural nets, … ?

a labeling of each member of S as positive or negative

VC(H)>=3

Pac-learnability result: what if H is not finite? Problem setting: Set of instances Set of hypotheses/target functions H=C where VC(H)=d Set of m training instances drawn from P(X), with noise- free labels If the learner outputs any h from H consistent with the data and then Prob D [ error true (h) 1- δ Compare:

VC dimension: An example Consider H = { rectangles in 2-D space } and X = {x1,x2,x3} below. Is this set shattered? x1 x2 x3

VC dimension: An example Consider H = { rectangles in 2-D space } Is this set shattered? x1 x2 x3

VC dimension: An example Consider H = { rectangles in 2-D space } Is this set shattered? x1 x2 x3 Yes: every dichotomy can be expressed by some rectangle in H

VC dimension: An example Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered? x1 x2 x3 x4 No: I can’t include {x1,x2,x3} in a rectangle without including x4.

VC dimension: An example Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered? x1 x2 x3 x4

VC dimension: An example Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered? x1 x2 x3 x4 Yes: every dichotomy can be expressed by some rectangle in H

VC dimension: An example Consider H = { rectangles in 2-D space }. Can you shatter any set of 5? x1 x2 x3 x4 No: consider x1 = min(horizontal), x2=max(horizontal), x3=min(vertical), x4=max(vertical), and let x5 the last point. If you include x1…x4, you must include x5. x5 So the VC(H)=4

VC dimension: An example Consider H = { rectangles in 3-D space }. What’s the VC dimension? x1 x2 x3 x4 x5

Pac-learnability result: what if H is not finite? Problem setting: Set of instances Set of hypotheses/target functions H=C where VC(H)=d Set of m training instances drawn from P(X), with noise- free labels If the learner outputs any h from H consistent with the data and then Prob D [ error true (h) 1- δ

Intuition: why large VC dimension hurts Suppose you have labels of all but one of the instances in X, you know the labels are consistent with something in H, and X is shattered by H. Can you guess label of the final element of X? No: you need at least VC(H) examples before you can start learning

Intuition: why small VC dimension helps x1 x7 x3 x4 x9 x10 x6 x3 x8 x5 x14 x11 x15 x16 x13 x12 How many rectangles are there that are different on the sample S, if |S|=m? Claim: choose(m,4). Every 4 positive points defines a distinct-on-the-data h. This can be generalized: if VC(H)=d, then the number of distinct-on-the-data hypotheses grows as O(m d ). How many rectangles are there that are different on the sample S, if |S|=m? Claim: choose(m,4). Every 4 positive points defines a distinct-on-the-data h. This can be generalized: if VC(H)=d, then the number of distinct-on-the-data hypotheses grows as O(m d ). We don’t need to worry about h’s that classify the data exactly the same.

How tight is this bound? How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-  ) approximately (  ) correct? Tightness of Bounds on Sample Complexity Lower bound on sample complexity (Ehrenfeucht et al., 1989): Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 <  < 1/8, and any 0 <  < Then there exists a distribution and a target concept in C, such that if L observes fewer examples than Then with probability at least , L outputs a hypothesis with

AGNOSTIC LEARNING AND NOISE

Here  is the difference between the training error and true error of the output hypothesis (the one with lowest training error)

Additive Hoeffding Bounds – Agnostic Learning Given m independent flips of a coin with true Pr(heads) =  we can bound the error in the maximum likelihood estimate Relevance to agnostic learning: for any single hypothesis h But we must consider all hypotheses in H (union bound again!) So, with probability at least (1-  ) every h satisfies

Agnostic Learning: VC Bounds With probability at least (1-  ) every h  H satisfies [Schölkopf and Smola, 2002]

Structural Risk Minimization Which hypothesis space should we choose? Bias / variance tradeoff H1 H2H3H4 [Vapnik] SRM: choose H to minimize bound on expected true error! * this is a very loose bound...

MORE VC DIMENSION EXAMPLES

VC dimension: examples Consider X = <, want to learn c:X  {0,1} What is VC dimension of Open intervals: Closed intervals: x

VC dimension: examples Consider X = <, want to learn c:X  {0,1} What is VC dimension of Open intervals: Closed intervals: x VC(H1)=1 VC(H2)=2 VC(H3)=2 VC(H4)=3

VC dimension: examples What is VC dimension of lines in a plane? H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0  y=1) }

VC dimension: examples What is VC dimension of H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0  y=1) } –VC(H 2 )=3 For H n = linear separating hyperplanes in n dimensions, VC(H n )=n+1

For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ? (hint: yes)

More VC Dimension Examples to Think About Logistic regression over n continuous features –Over n boolean features? Linear SVM over n continuous features Decision trees defined over n boolean features F:  Y Decision trees of depth 2 defined over n features How about 1-nearest neighbor?