Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
VC Dimension – definition and impossibility result
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
CS479/679 Pattern Recognition Dr. George Bebis
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Machine Learning Week 2 Lecture 1.
Visual Recognition Tutorial
Machine Learning Week 2 Lecture 2.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Learning Theory
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Evaluating Hypotheses
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Support Vector Machines
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Random Sampling, Point Estimation and Maximum Likelihood.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning Concept Learning General-to Specific Ordering
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Chapter 3: Maximum-Likelihood Parameter Estimation
Computational Learning Theory
LECTURE 03: DECISION SURFACES
CH. 2: Supervised Learning
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Computational Learning Theory
The probably approximately correct (PAC) learning model
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: UNIT-3 CHAPTER-2
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Lecture 14 Learning Inductive inference
Presentation transcript:

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)

Recall: Bayesian learning Create a model based on some parameters Assume some prior distribution on those parameters Learning problem –Adjust the model parameters so as to maximize the likelihood of the model given the data –Utilize Bayesian formula for that.

PAC Learning Given distribution D of observables X Given a family of functions (concepts) F For each x ε X and f ε F: f(x) provides the label for x Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Pr x ε D [f(x) ≠ h(x)] is minimal

PAC New Concepts Large family of distributions D Large family of concepts F Family of hypothesis H Main questions: –Is there a hypothesis h ε H that can be learned –How fast can it be learned –What is the error that can be expected

Estimation vs. approximation Note: –The distribution D is fixed –There is no noise in the system (currently) –F is a space of binary functions (concepts) This is thus an approximation problem as the function is given exactly for each x X Estimation problem: f is not given exactly but estimated from noisy data

Example (PAC) Concept: Average body-size person Inputs: for each person: –height –weight Sample: labeled examples of persons –label + : average body-size –label - : not average body-size Two dimensional inputs

Observable space X with concept f

Example (PAC) Assumption: target concept is a rectangle. Goal: –Find a rectangle h that “approximates” the target –Hypothesis family H of rectangles Formally: –With high probability –output a rectangle such that –its error is low.

Example (Modeling) Assume: –Fixed distribution over persons. Goal: –Low error with respect to THIS distribution!!! How does the distribution look like? –Highly complex. –Each parameter is not uniform. –Highly correlated.

Model Based approach First try to model the distribution. Given a model of the distribution: –find an optimal decision rule. Bayesian Learning

PAC approach Assume that the distribution is fixed. Samples are drawn are i.i.d. –independent –Identically distributed Concentrate on the decision rule rather than distribution.

PAC Learning Task: learn a rectangle from examples. Input: point (x,f(x)) and classification + or - –classifies by a rectangle R Goal: –Using the fewest examples –compute h –h is a good approximation for f

PAC Learning: Accuracy Testing the accuracy of a hypothesis: – using the distribution D of examples. Error = h  f (symmetric difference) Pr[Error] = D(Error) = D(h  h) We would like Pr[Error] to be controllable. Given a parameter  : –Find h such that Pr[Error] < 

PAC Learning: Hypothesis Which Rectangle should we choose? –Similar to parametric modeling?

Setting up the Analysis: Choose smallest rectangle. Need to show: –For any distribution D and Rectangle h –input parameters:  and  –Select m(  ) examples –Let h be the smallest consistent rectangle. –Such that with probability 1-  (on X): D(f  h) < 

More general case (no rectangle) A distribution: D (unknown) Target function: c t from C –c t : X  {0,1} Hypothesis: h from H –h: X  {0,1} Error probability: –error(h) = Prob D [h(x)  c t (x)] Oracle: EX(c t,D)

PAC Learning: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: –For any distribution D over X and c t in C –for every input  and  : –outputs a hypothesis h in H, –while having access to EX(c t,D) –with probability 1-  we have error(h) <  Complexities.

Finite Concept class Assume C=H and finite. h is  -bad if error(h)> . Algorithm: –Sample a set S of m( ,  ) examples. –Find h in H which is consistent. Algorithm fails if h is  -bad.

X is the set of all possible examples D is the distribution from which the examples are drawn H is the set of all possible hypotheses, c  H m is the number of training examples error(h) = Pr(h(x)  c(x) | x is drawn from X with D) h is approximately correct if error(h)   PAC learning: formalization (1)

PAC learning: formalization (2) To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an  - ball around c. c H bad  H

Complexity analysis (1) The probability that hypothesis h bad  H bad is consistent with the first m examples: error(h bad ) >  by definition. The probability that it agrees with an example is thus (1-  ) and with m examples (1-  ) m

Complexity analysis (2) For H bad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(H bad has a consistent hypothesis)  | H bad |(1-  ) m  | H|(1-  ) m

To reduce the probability of error below  |H|(1-  ) m   This is possible when at least m examples m  1/  (ln 1/  + ln |H|) are seen. This is the sample complexity Complexity analysis (3)

“at least m examples are necessary to build a consistent hypothesis h that is wrong at most  times with probability 1-  ” Since |H| = 2 2^n, the complexity grows exponentially with the number of attributes n Conclusion: learning any boolean function is no better in the worst case than table lookup! Complexity analysis (4)

PAC learning -- observations “Hypothesis h(X) is consistent with m examples and has an error of at most  with probability 1-  ” This is a worst-case analysis. Note that the result is independent of the distribution D! Growth rate analysis: –for   0, m   proportionally –for   0, m   logarithmically –for | H|  , m   logarithmically

PAC: comments We only assumed that examples are i.i.d. We have two independent parameters: –Accuracy  –Confidence  No assumption about the likelihood of a hypothesis. Hypothesis is tested on the same distribution as the sample.

PAC: non-feasible case What happens if c t not in H Needs to redefine the goal. Let h * in H minimize the error  =error(h * ) Goal: find h in H such that error(h)  error(h * ) + 

Analysis* For each h in H: – let obs-error(h) be the average error on the sample S. Compute the probability that: Pr {|obs-error(h) - error(h) | <  /2} Chernoff bound: Pr < exp(-(  /2) 2 m) Consider entire H : Pr < |H| exp(-(  /2) 2 m) Sample size m > (4/   ) ln (|H|/  )

Correctness Assume that for all h in H: –|obs-error(h) - error(h) | <  /2 In particular: –obs-error(h * ) < error(h * ) +  /2 –error(h) -  /2 < obs-error(h) For the output h: –obs-error(h) < obs-error(h * ) Conclusion: error(h) < error(h * )+ 

Sample size issue Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | <  /2} Chernoff bound: Pr < exp(-(  /2) 2 m) and on entire H : Pr < |H| exp(-(  /2) 2 m) It follows that the sample size m > (4/   ) ln (|H|/  ) Not (1/  ln (|H|/  ) as before

Example: Learning OR of literals Inputs: x 1, …, x n Literals : x 1, OR functions: For each variable, target disjunction may contain x i or not, thus Number of disjunctions is 3 n

ELIM: Algorithm for learning OR Keep a list of all literals For every example whose classification is 0: –Erase all the literals that are 1. Example c(00110)=0 results in deleting Correctness: –Our hypothesis h: An OR of our set of literals. –Our set of literals includes the target OR literals. –Every time h predicts zero: we are correct. Sample size: m > (1/  ) ln (3 n /  )

Learning parity Functions: x 1  x 7  x 9 Number of functions: 2 n Algorithm: –Sample set of examples –Solve linear equations (Matrix exists) Sample size: m > (1/  ) ln (2 n /  )

Infinite Concept class X=[0,1] and H={c  |  in [0,1]} c   x  iff x <  Assume C=H: Which c  should we choose in [min,max]? min max

Proof I Define max = min{x|c(x)=1}, min = max{x|c(x)=0} _Show that the probability that –Pr[ D([min,max]) >  ] <  Proof: By Contradiction. –The probability that x in [min,max] at least  –The probability we do not sample from [min,max] Is (1-  ) m –Needs m > (1/  ) ln (1/  ) There is something wrong

Proof II (correct): Let max’ be : D([ ,max’])=  /2 Let min’ be : D([ ,min’])=  /2 Goal: Show that with high probability –X + in [max’,  ] and –X - in [ ,min’] In such a case any value in [x -,x + ] is good. Compute sample size!

Proof II (correct): Pr{x 1,x 2,..,x m } is not in [ ,min’])= (1-  /2) m < exp(-m  /2) Similarly with the other side We require 2exp(-m  /2) < δ Thus, m > 2/  ln(2/δ)

Comments The hypothesis space was very simple H={c  |  in [0,1]} There was no noise in the data or labels So learning was trivial in some sense (analytic solution)

Non-Feasible case: Label Noise Suppose we sample: Algorithm: –Find the function h with lowest error!

Analysis Define: z i as an  4 - net (w.r.t. D) For the optimal h* and our h there are –z j : |error(h[z j ]) - error(h*)| <  /4 –z k : |error(h[z k ]) - error(h)| <  /4 Show that with high probability: –|obs-error(h[z i ])-error(h[z i ])| <  /4

Exercise (Submission Mar 29, 04) 1.Assume there is Gaussian (0,σ) noise on x i. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.

General  -net approach Given a class H define a class G –For every h in H –There exist a g in G such that –D(g  h) <  Algorithm: Find the best h in H. Computing the confidence and sample size.

Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” A.Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?

Occam Razor Finding the shortest consistent hypothesis. Definition: (  )-Occam algorithm –  >0 and  <1 –Input: a sample S of size m –Output: hypothesis h –for every (x,b) in S: h(x)=b (consistency) –size(h) < size  (c t ) m  Efficiency.

Occam algorithm and compression A B S (x i,b i ) x 1, …, x m

Compression Option 1: –A sends B the values b 1, …, b m –m bits of information Option 2: –A sends B the hypothesis h –Occam: large enough m has size(h) < m Option 3 (MDL): –A sends B a hypothesis h and “corrections” –complexity: size(h) + size(errors)

Occam Razor Theorem A: ( ,  )-Occam algorithm for C using H D distribution over inputs X c t in C the target function, n=size(c t ) Sample size: with probability 1-  A(S)=h has error(h) < 

Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) size(h) < n  m  Sample size: The VC dimension will replace 2 size(h)

Exercise (Submission Mar 29, 04) 2. For an ( ,  )-Occam algorithm, given noisy data with noise ~ (0, σ 2 ) find the limitations on m. Hint (ε-net and Chernoff bound)

Learning OR with few attributes Target function: OR of k literals Goal: learn in time: – polynomial in k and log n –  and  constant ELIM makes “slow” progress –disqualifies one literal per round –May remain with O(n) literals

Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every U j Some S in C’ covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j < k ln |U| Ex 2 Solve

Lists of arbitrary size can represent any boolean function. Lists with tests of at most with at most k < n literals define the k-DL boolean language. For n attributes, the language is k- DL(n). The language of tests Conj(n,k) has at most 3 |Conj(n,k)| distinct component sets (Y,N,absent) |k-DL(n)|  3 |Conj(n,k)| |Conj(n,k)|! (any order) |Conj(n,k)| =  i=0 ( ) = O(n k ) Learning decision lists 2ni2ni k

Building an Occam algorithm Given a sample S of size m –Run ELIM on S –Let LIT be the set of literals –There exists k literals in LIT that classify correctly all S Negative examples: –any subset of LIT classifies theme correctly

Building an Occam algorithm Positive examples: –Search for a small subset of LIT –Which classifies S + correctly –For a literal z build T z ={x | z satisfies x} –There are k sets that cover S + –Find k ln m sets that cover S + Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

Criticism of PAC model The worst-case emphasis makes it unusable –Useful for analysis of computational complexity –Methods to estimate the cardinality of the space of concepts (VC dimension). Unfortunately not sufficiently practical The notions of target concepts and noise-free training are too restrictive –True. Switch to concept approximation weak. –Some extensions to label noise and fewer to variable noise

Summary PAC model –Confidence and accuracy –Sample size Finite (and infinite) concept class Occam Razor

References A theory of the learnable. Comm. ACM 27(11): , Original work Probably approximately correct learning. D. Haussler. (Review) Haussler Efficient noise-tolerant learning from statistical queries. M. Kearns. (Review on noise methods) PAC learning with simple examples. F. Denis et al. (Simple)

Learning algorithms OR function Parity function OR of a few literals Open problems –OR in the non-feasible case –Parity of a few literals