Probably Approximately Correct Model (PAC)

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
VC Dimension – definition and impossibility result
A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Machine Learning Week 2 Lecture 2.
Probably Approximately Correct Learning Yongsub Lim Applied Algorithm Laboratory KAIST.
Computational Learning Theory
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Vapnik-Chervonenkis Dimension
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Support Vector Machines
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
C. – C. Yao Data Structure. C. – C. Yao Chap 1 Basic Concepts.
Random Sampling, Point Estimation and Maximum Likelihood.
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
Unsolvability and Infeasibility. Computability (Solvable) A problem is computable if it is possible to write a computer program to solve it. Can all problems.
Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CS623: Introduction to Computing with Neural Nets (lecture-18) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
CS 9633 Machine Learning Support Vector Machines
HW HW1: Let us know if you have any questions. ( the TAs)
HW 2.
Chapter 7. Classification and Prediction
Computational Learning Theory
Introduction to Machine Learning
Data Mining Lecture 11.
CSCI B609: “Foundations of Data Science”
The Curve Merger (Dvir & Widgerson, 2008)
Computational Learning Theory
Computational Learning Theory
The probably approximately correct (PAC) learning model
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
Presentation transcript:

Probably Approximately Correct Model (PAC)

Example (PAC) Concept: Average body-size person Inputs: for each person: height weight Sample: labeled examples of persons label + : average body-size label - : not average body-size Two dimensional inputs

Example (PAC) Assumption: target concept is a rectangle. Goal: Find a rectangle that “approximate” the target. Formally: With high probability output a rectangle such that its error is low.

Example (Modeling) Assume: Goal: How does the distribution look like? Fixed distribution over persons. Goal: Low error with respect to THIS distribution!!! How does the distribution look like? Highly complex. Each parameter is not uniform. Highly correlated.

Model Based approach First try to model the distribution. Given a model of the distribution: find an optimal decision rule. Bayesian Learning

PAC approach Assume that the distribution is fixed. Samples are drawn are i.i.d. independent identical Concentrate on the decision rule rather than distribution.

PAC Learning Task: learn a rectangle from examples. Input: point (x,y) and classification + or - classifies by a rectangle R Goal: in the fewest examples compute R’ R’ is a good approximation for R

PAC Learning: Accuracy Testing the accuracy of a hypothesis: using the distribution D of examples. Error = R D R’ Pr[Error] = D(Error) = D(R D R’) We would like Pr[Error] to be controllable. Given a parameter e: Find R’ such that Pr[Error] < e.

PAC Learning: Hypothesis Which Rectangle should we choose?

Setting up the Analysis: Choose smallest rectangle. Need to show: For any distribution D and Rectangle R input parameters: e and d Select m(e,d) examples. Let R’ be the smallest consistent rectangle. With probability 1-d: D(R D R’) < e

Analysis Note that R’  R, therefore R D R’ = R - R’ Tu T’u R’ R

Analysis (cont.) By Definition: D(Tu)= e/4 Compute the probability that:T’u  Tu PrD[(x,y) in Tu]= e/4 Probability of NO example in Tu For m examples: (1-e/4)m < e-e m/4 Failure probability: 4 e-e m/4 < d Sample bound: m > (4/e) ln (4/d)

PAC: comments We only assumed that examples are i.i.d. We have two independent parameters: Accuracy e Confidence d No assumption about the likelihood of rectangles. Hypothesis is tested on the same distribution as the sample.

PAC model: Setting A distribution: D (unknown) Target function: ct from C ct : X  {0,1} Hypothesis: h from H h: X  {0,1} Error probability: error(h) = ProbD[h(x) ct(x)] Oracle: EX(ct,D)

PAC Learning: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: For any distribution D over X and ct in C for every input e and d: outputs a hypothesis h in H, while having access to EX(ct,D) with probability 1-d we have error(h) < e Complexities.

Finite Concept class Assume C=H and finite. h is e-bad if error(h)> e. Algorithm: Sample a set S of m(e,d) examples. Find h in H which is consistent. Algorithm fails if h is e-bad.

Analysis Assume hypothesis g is e-bad. The probability that g is consistent: Pr[g consistent]  (1-e)m < e- em The probability that there exists: g is e-bad and consistent: |H| Pr[g consistent and e-bad]  |H| e- em Sample size: m > (1/e) ln (|H|/d)

PAC: non-feasible case What happens if ct not in H Needs to redefine the goal. Let h* in H minimize the error b=error(h*) Goal: find h in H such that error(h)  error(h*) +e = b+e

Analysis For each h in H: Compute the probability that: let obs-error(h) be the error on the sample S. Compute the probability that: |obs-error(h) - error(h) | < e/2 Chernoff bound: exp(-(e/2)2m) Consider entire H : |H| exp(-(e/2)2m) Sample size m > (4/e2) ln (|H|/d)

Correctness Assume that for all h in H: In particular: |obs-error(h) - error(h) | < e/2 In particular: obs-error(h*) < error(h*) + e/2 error(h) -e/2 < obs-error(h) For the output h: obs-error(h) < obs-error(h*) Conclusion: error(h) < error(h*)+e

Example: Learning OR of literals Inputs: x1, … , xn Literals : x1, OR functions: Number of functions? 3n

ELIM: Algorithm for learning OR Keep a list of all literals For every example whose classification is 0: Erase all the literals that are 1. Example Correctness: Our hypothesis h: An OR of our set of literals. Our set of literals includes the target OR literals. Every time h predicts zero: we are correct. Sample size: m > (1/e) ln (3n/d)

Learning parity Functions: x1  x7  x9 Number of functions: 2n Algorithm: Sample set of examples Solve linear equations Sample size: m > (1/e) ln (2n/d)

Infinite Concept class X=[0,1] and H={cq | q in [0,1]} cq(x) = 0 iff x < q Assume C=H: Which cq should we choose in [min,max]? min max

Proof I What’s WRONG ?! Show that the probability that Pr[ D([min,max]) > e ] < d Proof: By Contradiction. The probability that x in [min,max] at least e The probability we do not sample from [min,max] Is (1-e)m Needs m > (1/e) ln (1/d) What’s WRONG ?!

Proof II (correct): Let max’ be : D([q,max’])=e/2 Let min’ be : D([q,min’])=e/2 Goal: Show that with high probability X+ in [max’,q] and X- in [q,min’] In such a case any value in [x-,x+] is good. Compute sample size!

Non-Feasible case Suppose we sample: Algorithm: Find the function h with lowest error!

Analysis Define: zi as a e/4 - net (w.r.t. D) For the optimal h* and our h there are zj : |error(h[zj]) - error(h*)| < e/4 zk : |error(h[zk]) - error(h)| < e/4 Show that with high probability: |obs-error(h[zi]) -error(h[zi])| < e/4 Completing the proof. Computing the sample size.

General e-net approach Given a class H define a class G For every h in H There exist a g in G such that D(g D h) < e/4 Algorithm: Find the best h in H. Computing the confidence and sample size.

Occam Razor Finding the shortest consistent hypothesis. Definition: (a,b)-Occam algorithm a >0 and b <1 Input: a sample S of size m Output: hypothesis h for every (x,b) in S: h(x)=b size(h) < sizea(ct) mb Efficiency.

Occam algorithm and compression B S (xi,bi) x1, … , xm

compression Option 1: A sends B the values b1 , … , bm m bits of information Option 2: A sends B the hypothesis h Occam: large enough m has size(h) < m Option 3 (MDL): A sends B a hypothesis h and “corrections” complexity: size(h) + size(errors)

Occam Razor Theorem A: (a,b)-Occam algorithm for C using H D distribution over inputs X ct in C the target function, n=size(ct) Sample size: with probability 1-d A(S)=h has error(h) < e

Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2size(h) size(h) < na mb Sample size:

Learning OR with few attributes Target function: OR of k literals Goal: learn in time: polynomial in k and log n e and d constant ELIM makes “slow” progress disqualifies one literal per round May remain with O(n) literals

Set Cover - Definition Input: S1 , … , St and Si  U Output: Si1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm j=0 ; Uj=U; C= While Uj   Let Si be arg max |Si  Uj| Add Si to C Let Uj+1 = Uj – Si j = j+1

Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every Uj Some S in C’ covers Uj/k elements of Uj Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k Solving the recursion. Number of sets j < k ln |U|

Building an Occam algorithm Given a sample S of size m Run ELIM on S Let LIT be the set of literals There exists k literals in LIT that classify correctly all S Negative examples: any subset of LIT classifies theme correctly

Building an Occam algorithm Positive examples: Search for a small subset of LIT Which classifies S+ correctly For a literal z build Tz={x | z satisfies x} There are k sets that cover S+ Find k ln m sets that cover S+ Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

Summary PAC model Finite (and infinite) concept class Occam Razor Confidence and accuracy Sample size Finite (and infinite) concept class Occam Razor

Learning algorithms OR function Parity function OR of a few literals Open problems OR in the non-feasible case Parity of a few literals