Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142.

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142

Recall: Bayesian learning Create a model based on some parameters Assume some prior distribution on those parameters Learning problem –Adjust the model parameters so as to maximize the likelihood of the model given the data –Utilize Bayesian formula for that.

PAC Learning Given distribution D of observables X Given a family of functions (concepts) F For each x ε X and f ε F: f(x) provides the label for x Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Pr x ε D [f(x) ≠ h(x)] is minimal

PAC New Concepts Large family of distributions D Large family of concepts F Family of hypothesis H Main questions: –Is there a hypothesis h ε H that can be learned –How fast can it be learned –What is the error that can be expected

Estimation vs. approximation Note: –The distribution D is fixed –There is no noise in the system (currently) –F is a space of binary functions (concepts) This is thus an approximation problem as the function is given exactly for each x X Estimation problem: f is not given exactly but estimated from noisy data

Example (PAC) Concept: Average body-size person Inputs: for each person: –height –weight Sample: labeled examples of persons –label + : average body-size –label - : not average body-size Two dimensional inputs

Observable space X with concept f

Example (PAC) Assumption: target concept is a rectangle. Goal: –Find a rectangle h that “approximates” the target –Hypothesis family H of rectangles Formally: –With high probability –output a rectangle such that –its error is low.

Example (Modeling) Assume: –Fixed distribution over persons. Goal: –Low error with respect to THIS distribution!!! How does the distribution look like? –Highly complex. –Each parameter is not uniform. –Highly correlated.

Model Based approach First try to model the distribution. Given a model of the distribution: –find an optimal decision rule. Bayesian Learning

PAC approach Assume that the distribution is fixed. Samples are drawn are i.i.d. –independent –Identically distributed Concentrate on the decision rule rather than distribution.

PAC Learning Task: learn a rectangle from examples. Input: point (x,f(x)) and classification + or - –classifies by a rectangle R Goal: –Using the fewest examples –compute h –h is a good approximation for f

PAC Learning: Accuracy Testing the accuracy of a hypothesis: – using the distribution D of examples. Error = h  f (symmetric difference) Pr[Error] = D(Error) = D(h  h) We would like Pr[Error] to be controllable. Given a parameter  : –Find h such that Pr[Error] < 

PAC Learning: Hypothesis Which Rectangle should we choose? –Similar to parametric modeling?

Setting up the Analysis: Choose smallest rectangle. Need to show: –For any distribution D and Rectangle h –input parameters:  and  –Select m(  ) examples –Let h be the smallest consistent rectangle. –Such that with probability 1-  (on X): D(f  h) < 

More general case (no rectangle) A distribution: D (unknown) Target function: c t from C –c t : X  {0,1} Hypothesis: h from H –h: X  {0,1} Error probability: –error(h) = Prob D [h(x)  c t (x)] Oracle: EX(c t,D)

PAC Learning: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: –For any distribution D over X and c t in C –for every input  and  : –outputs a hypothesis h in H, –while having access to EX(c t,D) –with probability 1-  we have error(h) <  Complexities.

Finite Concept class Assume C=H and finite. h is  -bad if error(h)> . Algorithm: –Sample a set S of m( ,  ) examples. –Find h in H which is consistent. Algorithm fails if h is  -bad.

X is the set of all possible examples D is the distribution from which the examples are drawn H is the set of all possible hypotheses, c  H m is the number of training examples error(h) = Pr(h(x)  c(x) | x is drawn from X with D) h is approximately correct if error(h)   PAC learning: formalization (1)

PAC learning: formalization (2) To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an  - ball around c. c H bad  H

Complexity analysis (1) The probability that hypothesis h bad  H bad is consistent with the first m examples: error(h bad ) >  by definition. The probability that it agrees with an example is thus (1-  ) and with m examples (1-  ) m

Complexity analysis (2) For H bad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(H bad has a consistent hypothesis)  | H bad |(1-  ) m  | H|(1-  ) m

To reduce the probability of error below  |H|(1-  ) m   This is possible when at least m examples m  1/  (ln 1/  + ln |H|) are seen. This is the sample complexity Complexity analysis (3)

“at least m examples are necessary to build a consistent hypothesis h that is wrong at most  times with probability 1-  ” Since |H| = 2 2^n, the complexity grows exponentially with the number of attributes n Conclusion: learning any boolean function is no better in the worst case than table lookup! Complexity analysis (4)

PAC learning -- observations “Hypothesis h(X) is consistent with m examples and has an error of at most  with probability 1-  ” This is a worst-case analysis. Note that the result is independent of the distribution D! Growth rate analysis: –for   0, m   proportionally –for   0, m   logarithmically –for | H|  , m   logarithmically

PAC: comments We only assumed that examples are i.i.d. We have two independent parameters: –Accuracy  –Confidence  No assumption about the likelihood of a hypothesis. Hypothesis is tested on the same distribution as the sample.

PAC: non-feasible case What happens if c t not in H Needs to redefine the goal. Let h * in H minimize the error  =error(h * ) Goal: find h in H such that error(h)  error(h * ) + 

Analysis* For each h in H: – let obs-error(h) be the average error on the sample S. Compute the probability that: Pr {|obs-error(h) - error(h) | <  /2} Chernoff bound: Pr < exp(-(  /2) 2 m) Consider entire H : Pr < |H| exp(-(  /2) 2 m) Sample size m > (4/   ) ln (|H|/  )

Correctness Assume that for all h in H: –|obs-error(h) - error(h) | <  /2 In particular: –obs-error(h * ) < error(h * ) +  /2 –error(h) -  /2 < obs-error(h) For the output h: –obs-error(h) < obs-error(h * ) Conclusion: error(h) < error(h * )+ 

Sample size issue Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | <  /2} Chernoff bound: Pr < exp(-(  /2) 2 m) and on entire H : Pr < |H| exp(-(  /2) 2 m) It follows that the sample size m > (4/   ) ln (|H|/  ) Not (1/  ln (|H|/  ) as before

Example: Learning OR of literals Inputs: x 1, …, x n Literals : x 1, OR functions: For each variable, target disjunction may contain x i or not, thus Number of disjunctions is 3 n

ELIM: Algorithm for learning OR Keep a list of all literals For every example whose classification is 0: –Erase all the literals that are 1. Example c(00110)=0 results in deleting Correctness: –Our hypothesis h: An OR of our set of literals. –Our set of literals includes the target OR literals. –Every time h predicts zero: we are correct. Sample size: m > (1/  ) ln (3 n /  )

Learning parity Functions: x 1  x 7  x 9 Number of functions: 2 n Algorithm: –Sample set of examples –Solve linear equations (Matrix exists) Sample size: m > (1/  ) ln (2 n /  )

Infinite Concept class X=[0,1] and H={c  |  in [0,1]} c   x  iff x <  Assume C=H: Which c  should we choose in [min,max]? min max

Proof I Define max = min{x|c(x)=1}, min = max{x|c(x)=0} _Show that the probability that –Pr[ D([min,max]) >  ] <  Proof: By Contradiction. –The probability that x in [min,max] at least  –The probability we do not sample from [min,max] Is (1-  ) m –Needs m > (1/  ) ln (1/  ) There is something wrong

Proof II (correct): Let max’ be : D([ ,max’])=  /2 Let min’ be : D([ ,min’])=  /2 Goal: Show that with high probability –X + in [max’,  ] and –X - in [ ,min’] In such a case any value in [x -,x + ] is good. Compute sample size!

Proof II (correct): Pr{x 1,x 2,..,x m } is not in [ ,min’])= (1-  /2) m < exp(-m  /2) Similarly with the other side We require 2exp(-m  /2) < δ Thus, m > 2/  ln(2/δ)

Comments The hypothesis space was very simple H={c  |  in [0,1]} There was no noise in the data or labels So learning was trivial in some sense (analytic solution)

Non-Feasible case: Label Noise Suppose we sample: Algorithm: –Find the function h with lowest error!

Exercise (Submission Mar 29, 04) 1.Assume there is Gaussian (0,σ) noise on x i. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.

General  -net approach Given a class H define a class G –For every h in H –There exist a g in G such that –D(g  h) <  Algorithm: Find the best h in H. Computing the confidence and sample size.

Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” A.Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?

Occam Razor Finding the shortest consistent hypothesis. Definition: (  )-Occam algorithm –  >0 and  <1 –Input: a sample S of size m –Output: hypothesis h –for every (x,b) in S: h(x)=b (consistency) –size(h) < size  (c t ) m  Efficiency.

Occam algorithm and compression A B S (x i,b i ) x 1, …, x m

Compression Option 1: –A sends B the values b 1, …, b m –m bits of information Option 2: –A sends B the hypothesis h –Occam: large enough m has size(h) < m Option 3 (MDL): –A sends B a hypothesis h and “corrections” –complexity: size(h) + size(errors)

Occam Razor Theorem A: ( ,  )-Occam algorithm for C using H D distribution over inputs X c t in C the target function, n=size(c t ) Sample size: with probability 1-  A(S)=h has error(h) < 

Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) size(h) < n  m  Sample size: The VC dimension will replace 2 size(h)

Exercise (Submission Mar 29, 04) 2. For an ( ,  )-Occam algorithm, given noisy data with noise ~ (0, σ 2 ) find the limitations on m. Hint (ε-net and Chernoff bound)

Learning OR with few attributes Target function: OR of k literals Goal: learn in time: – polynomial in k and log n –  and  constant ELIM makes “slow” progress –disqualifies one literal per round –May remain with O(n) literals

Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every U j Some S in C’ covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j < k ln |U| Ex 2 Solve

Building an Occam algorithm Given a sample S of size m –Run ELIM on S –Let LIT be the set of literals –There exists k literals in LIT that classify correctly all S Negative examples: –any subset of LIT classifies theme correctly

Building an Occam algorithm Positive examples: –Search for a small subset of LIT –Which classifies S + correctly –For a literal z build T z ={x | z satisfies x} –There are k sets that cover S + –Find k ln m sets that cover S + Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

Criticism of PAC model The worst-case emphasis makes it unusable –Useful for analysis of computational complexity –Methods to estimate the cardinality of the space of concepts (VC dimension). Unfortunately not sufficiently practical The notions of target concepts and noise-free training are too restrictive –True. Switch to concept approximation weak. –Some extensions to label noise and fewer to variable noise

Summary PAC model –Confidence and accuracy –Sample size Finite (and infinite) concept class Occam Razor

References A theory of the learnable. Comm. ACM 27(11):1134-42, 1984. Original work Probably approximately correct learning. D. Haussler. (Review) Haussler Efficient noise-tolerant learning from statistical queries. M. Kearns. (Review on noise methods) PAC learning with simple examples. F. Denis et al. (Simple)

Learning algorithms OR function Parity function OR of a few literals Open problems –OR in the non-feasible case –Parity of a few literals

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142.

Similar presentations

Presentation on theme: "Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142.

Similar presentations

Presentation on theme: "Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142."— Presentation transcript:

Similar presentations

About project

Feedback