1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

VC Dimension – definition and impossibility result
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
For Wednesday Read chapter 19, sections 1-3 No homework.
PAC Learning adapted from Tom M.Mitchell Carnegie Mellon University.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 16 March 2007 William.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
For Friday Read chapter 18, section 7 Program 3 due.
For Wednesday Read chapter 18, section 7 Homework: –Chapter 18, exercise 11.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
For Thursday No reading or homework Exam 2 Take-home due Tuesday Read chapter 22 for next Tuesday.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
For Tuesday Read chapter 18, section 7 Read chapter 19, sections 1-2 and 5 Homework: –Chapter 18, exercises 6 and 11.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
For Wednesday No reading Homework: –Chapter 18, exercise 25 (a & b)
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
For Monday Read Chapter 18, sections Homework: –Chapter 18, exercise 6.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
HW HW1: Let us know if you have any questions. ( the TAs)
For Friday No reading Homework: Program 3 due
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
Web-Mining Agents Computational Learning Theory
CH. 2: Supervised Learning
CSI 5388:Topics in Machine Learning
Computational Learning Theory
Machine Learning: Lecture 3
Computational Learning Theory
The probably approximately correct (PAC) learning model
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
CS344 : Introduction to Artificial Intelligence
Computational Learning Theory
Machine Learning: UNIT-3 CHAPTER-2
Supervised Learning Berlin Chen 2005 References:
Lecture 14 Learning Inductive inference
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin

2 Learning Theory Theorems that characterize classes of learning problems or specific algorithms in terms of computational complexity or sample complexity, i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy. Complexity of a learning problem depends on: –Size or expressiveness of the hypothesis space. –Accuracy to which target concept must be approximated. –Probability with which the learner must produce a successful hypothesis. –Manner in which training examples are presented, e.g. randomly or by query to an oracle.

3 Formal Definition of PAC-Learnable Consider a concept class C defined over an instance space X containing instances of length n, and a learner, L, using a hypothesis space, H. C is said to be PAC-learnable by L using H iff for all c  C, distributions D over X, 0<ε<0.5, 0<δ<0.5; learner L by sampling random examples from distribution D, will with probability at least 1  δ output a hypothesis h  H such that error D (h)  ε, in time polynomial in 1/ε, 1/δ, n and size(c).

4 Sample Complexity Result Therefore, any consistent learner, given at least: examples will produce a result that is PAC.

5 PAC for decision trees of depth k Assume m attributes H k = Number of decision trees of depth k H 0 = 2 H k+1 = m*H k *H k Let L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k L k = (2 k -1)(1+log 2 m) + 1 So to PAC learn we need

6 Infinite Hypothesis Spaces The preceding analysis was restricted to finite hypothesis spaces. Some infinite hypothesis spaces (such as those including real-valued thresholds or parameters) are more expressive than others. –Compare a rule allowing one threshold on a continuous feature (length<3cm) vs one allowing two thresholds (1cm<length<3cm). Need some measure of the expressiveness of infinite hypothesis spaces. More power: Can model more complex classifiers but might overfit Less power: Restricted in what it can model How do we characterize the amount of power?

7 VC Dimension Training error Test error Given some machine f, let h be its VC dimension h is a measure of f’s power Vapnik showed that with probability (1-eta)

8 Infinite Hypothesis Spaces The Vapnik-Chervonenkis (VC) dimension provides just such a measure, denoted VC(H). Analagous to ln|H|, there are bounds for sample complexity using VC(H).

9 Shattering a Set of Instances

10 Three Instances Shattered

11 + – _ x,y x y y x x,y Shattering Instances A hypothesis space is said to shatter a set of instances iff for every partition of the instances into positive and negative, there is a hypothesis that produces that partition. For example, consider 2 instances described using a single real-valued feature being shattered by intervals. x y

12 + – _ x,y,z x y,z y x,z x,y z x,y,z y,z x z x,y x,z y Shattering Instances (cont) But 3 instances cannot be shattered by a single interval. x y z Cannot do Since there are 2 m partitions of m instances, in order for H to shatter instances: |H| ≥ 2 m.

13 VC Dimension An unbiased hypothesis space shatters the entire instance space. The larger the subset of X that can be shattered, the more expressive the hypothesis space is, i.e. the less biased. The Vapnik-Chervonenkis dimension, VC(H). of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite subsets of X can be shattered then VC(H) =  If there exists at least one subset of X of size d that can be shattered then VC(H) ≥ d. If no subset of size d can be shattered, then VC(H) < d. For a single intervals on the real line, all sets of 2 instances can be shattered, but no set of 3 instances can, so VC(H) = 2. Since |H| ≥ 2 m, to shatter m instances, VC(H) ≤ log 2 |H|

14 VC Dimension Example Consider axis-parallel rectangles in the real-plane, i.e. conjunctions of intervals on two real-valued features. Some 4 instances can be shattered. Some 4 instances cannot be shattered:

15 VC Dimension Example (cont) No five instances can be shattered since there can be at most 4 distinct extreme points (min and max on each of the 2 dimensions) and these 4 cannot be included without including any possible 5 th point. Therefore VC(H) = 4 Generalizes to axis-parallel hyper-rectangles (conjunctions of intervals in n dimensions): VC(H)=2n.

16 Upper Bound on Sample Complexity with VC Using VC dimension as a measure of expressiveness, the following number of examples have been shown to be sufficient for PAC Learning (Blumer et al., 1989). Compared to the previous result using ln|H|, this bound has some extra constants and an extra log 2 (1/ε) factor. Since VC(H) ≤ log 2 |H|, this can provide a tighter upper bound on the number of examples needed for PAC learning.

17 Conjunctive Learning with Continuous Features Consider learning axis-parallel hyper-rectangles, conjunctions on intervals on n continuous features. –1.2 ≤ length ≤ 10.5  2.4 ≤ weight ≤ 5.7 Since VC(H)=2n sample complexity is Since the most-specific conjunctive algorithm can easily find the tightest interval along each dimension that covers all of the positive instances (f min ≤ f ≤ f max ) and runs in linear time, O(|D|n), axis-parallel hyper-rectangles are PAC learnable.

18 Sample Complexity Lower Bound with VC There is also a general lower bound on the minimum number of examples necessary for PAC learning (Ehrenfeucht, et al., 1989): Consider any concept class C such that VC(H)≥2 any learner L and any 0<ε<1/8, 0<δ<1/100. Then there exists a distribution D and target concept in C such that if L observes fewer than: examples, then with probability at least δ, L outputs a hypothesis having error greater than ε. Ignoring constant factors, this lower bound is the same as the upper bound except for the extra log 2 (1/ ε) factor in the upper bound.

19 Analyzing a Preference Bias Unclear how to apply previous results to an algorithm with a preference bias such as simplest decisions tree or simplest DNF. If the size of the correct concept is n, and the algorithm is guaranteed to return the minimum sized hypothesis consistent with the training data, then the algorithm will always return a hypothesis of size at most n, and the effective hypothesis space is all hypotheses of size at most n. Calculate |H| or VC(H) of hypotheses of size at most n to determine sample complexity. c All hypotheses Hypotheses of size at most n

20 Computational Complexity and Preference Bias However, finding a minimum size hypothesis for most languages is computationally intractable. If one has an approximation algorithm that can bound the size of the constructed hypothesis to some polynomial function, f(n), of the minimum size n, then can use this to define the effective hypothesis space. However, no worst case approximation bounds are known for practical learning algorithms (e.g. ID3). c All hypotheses Hypotheses of size at most n Hypotheses of size at most f(n).

21 “Occam’s Razor” Result (Blumer et al., 1987) Assume that a concept can be represented using at most n bits in some representation language. Given a training set, assume the learner returns the consistent hypothesis representable with the least number of bits in this language. Therefore the effective hypothesis space is all concepts representable with at most n bits. Since n bits can code for at most 2 n hypotheses, |H|=2 n, so sample complexity if bounded by: This result can be extended to approximation algorithms that can bound the size of the constructed hypothesis to at most n k for some fixed constant k (just replace n with n k )

22 Interpretation of “Occam’s Razor” Result Since the encoding is unconstrained it fails to provide any meaningful definition of “simplicity.” Hypothesis space could be any sufficiently small space, such as “the 2 n most complex boolean functions, where the complexity of a function is the size of its smallest DNF representation” Assumes that the correct concept (or a close approximation) is actually in the hypothesis space, so assumes a priori that the concept is simple. Does not provide a theoretical justification of Occam’s Razor as it is normally interpreted.

23 COLT Conclusions The PAC framework provides a theoretical framework for analyzing the effectiveness of learning algorithms. The sample complexity for any consistent learner using some hypothesis space, H, can be determined from a measure of its expressiveness |H| or VC(H), quantifying bias and relating it to generalization. If sample complexity is tractable, then the computational complexity of finding a consistent hypothesis in H governs its PAC learnability. Constant factors are more important in sample complexity than in computational complexity, since our ability to gather data is generally not growing exponentially. Experimental results suggest that theoretical sample complexity bounds over-estimate the number of training instances needed in practice since they are worst-case upper bounds.

24 COLT Conclusions (cont) Additional results produced for analyzing: –Learning with queries –Learning with noisy data –Average case sample complexity given assumptions about the data distribution. –Learning finite automata –Learning neural networks Analyzing practical algorithms that use a preference bias is difficult. Some effective practical algorithms motivated by theoretical results: –Boosting –Support Vector Machines (SVM)

25 Mistake Bounds

26 Mistake Bounds: Find-S No mistake for negative examples In the worst case, n+1 mistakes. –When do we have the worst case?

27 Mistake Bounds: Halving Algorithm

28 Optimal Mistake Bounds

29 Optimal Mistake Bounds