CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Concept Learning and the General-to-Specific Ordering
2. Concept Learning 2.1 Introduction
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Concept Learning DefinitionsDefinitions Search Space and General-Specific OrderingSearch Space and General-Specific Ordering The Candidate Elimination.
PAC Learning adapted from Tom M.Mitchell Carnegie Mellon University.
Università di Milano-Bicocca Laurea Magistrale in Informatica
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Computational Learning Theory
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
true error less Any learner that outputs a hypothesis consistent with all training examples.
Evaluating Hypotheses
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 25 January 2008 William.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 16 March 2007 William.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Machine Learning What is learning?. 2 Machine Learning What is learning? “That is what learning is. You suddenly understand something you've understood.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
1 Concept Learning By Dong Xu State Key Lab of CAD&CG, ZJU.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Chapter 2: Concept Learning and the General-to-Specific Ordering.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
CpSc 810: Machine Learning Concept Learning and General to Specific Ordering.
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Machine Learning Chapter 5. Evaluating Hypotheses
Machine Learning Concept Learning General-to Specific Ordering
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Concept Learning and The General-To Specific Ordering
Computational Learning Theory Part 1: Preliminaries 1.
Concept learning Maria Simi, 2011/2012 Machine Learning, Tom Mitchell Mc Graw-Hill International Editions, 1997 (Cap 1, 2).
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Chapter 2 Concept Learning
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
CH. 2: Supervised Learning
Ordering of Hypothesis Space
Vapnik–Chervonenkis Dimension
Computational Learning Theory
Evaluating Hypotheses
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Machine Learning: UNIT-3 CHAPTER-2
Supervised Learning Berlin Chen 2005 References:
Machine Learning Chapter 2
Machine Learning Chapter 2
Presentation transcript:

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic learning VC Dimension and Shattering Mistake Bounds

CS 8751 ML & KDDComputational Learning Theory2 What general laws constrain inductive learning? Some potential areas of interest: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Efficiency of learning process Manner in which training examples are presented

CS 8751 ML & KDDComputational Learning Theory3 The Concept Learning Task Given Instance space X – (e.g., possible faces described by attributes Hair, Nose, Eyes, etc.) A unknown target function c – (e.g., Smiling : X → {yes, no}) A hypothesis space H: H = { h : X → {yes, no} } A unknown, likely not observable probability distribution D over the instance space X Determine A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X?

CS 8751 ML & KDDComputational Learning Theory4 Variations on the Task – Data Sample How many training examples sufficient to learn target concept? 1.Random process (e.g., nature) produces instances Instances x generated randomly, teacher provides c(x) 2.Teacher (knows c) provides training examples Teacher provides sequences of form 3.Learner proposes instances, as queries to teacher Learner proposes instance x, teacher provides c(x)

CS 8751 ML & KDDComputational Learning Theory5 True Error of a Hypothesis True error of a hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. c h error Instance Space X

CS 8751 ML & KDDComputational Learning Theory6 Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) ≠ c(x) over training instances True error of hypothesis h with respect to c How often h(x) ≠ c(x) over future random instances Our concern Can we bound the true error of h given training error of h? Start by assuming training error of h is 0 (i.e., h  VS H,D )

CS 8751 ML & KDDComputational Learning Theory7 Exhausting the Version Space Definition: the version space VS H,D is said to be  - exhausted with respect to c and D, if every hypothesis h in VS H,D has error less than  with respect to c and D. error D =.3 error S =.1 error D =.1 error S =.2 error D =.1 error S =.0 error D =.2 error S =.0 error D =.3 error S =.4 error D =.2 error S =.3 VS H,D

CS 8751 ML & KDDComputational Learning Theory8 How many examples to  -exhaust VS? Theorem: If hypothesis space H is finite, and D is sequence of m ≥ 1 independent random examples of target concept c, then for any 0 ≤  ≤ 1, probability that version space with respect to H and D is not  -exhausted (with respect to c) is less than |H|e -  m Bounds the probability that any consistent learner will output a hypothesis h with error(h) ≥  If we want this probability to be below  |H|e -  m ≤  Then m ≥ (1/  )(ln |H| + ln(1/  ))

CS 8751 ML & KDDComputational Learning Theory9 Learning conjunctions of boolean literals How many examples are sufficient to assure with probability at least (1 –  ) that every h in VS H,D satisfies error D (h) ≤  Use our theorem: m ≥ (1/  )(ln |H| + ln(1/  )) Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals). Then |H| = 3 n, and m ≥ (1/  )(ln3 n + ln(1/  )) or m ≥ (1/  )(n ln3 + ln(1/  ))

CS 8751 ML & KDDComputational Learning Theory10 For concept Smiling Face Concept features: Eyes {round,square} → RndEyes, ¬RndEyes Nose {triangle,square} → TriNose, ¬TriNose Head {round,square} → RndHead, ¬RndHead FaceColor {yellow,green,purple} → YelFace, ¬YelFace, GrnFace, ¬GrnFace, PurFace, ¬PurFace Hair {yes,no} → Hair, ¬Hair Size of |H| = 3 7 = 2187 If we want to assure that with probability 95%, VS contains only hypotheses error D (h) ≤.1, then sufficient to have m examples, where m ≥ (1/.1)(ln(2187)+ ln(1/.05)) m ≥ 10(ln(2187)+ ln(20))

CS 8751 ML & KDDComputational Learning Theory11 PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c  C, distributions D over X,  such that 0 <  < ½, and  such that 0 <  < ½, learner L will with prob. at least (1 -  ) output a hypothesis h  H such that error D (h) ≤ , in time that is polynomial in 1/ , 1/ , n and size(c).

CS 8751 ML & KDDComputational Learning Theory12 Agnostic Learning So far, assumed c  H Agnostic learning setting: don’t assume c  H What do we want then? –The hypothesis h that makes fewest errors on training data What is sample complexity in this case? m ≥ (1/ 2  2 )(ln |H| + ln(1/  )) Derived from Hoeffding bounds: Pr[error true (h) > error D (h) +  ] ≤ e -2m  2

CS 8751 ML & KDDComputational Learning Theory13 But what if hypothesis space not finite? What if |H| can not be determined? It is still possible to come up with estimates based not on counting how many hypotheses, but based on how many instances can be completely discriminated by H Use the notion of a shattering of a set of instances to measure the complexity of a hypothesis space VC Dimension measures this notion and can be used as a stand in for |H|

CS 8751 ML & KDDComputational Learning Theory14 Shattering a Set of Instances Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H iff for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. Example: 3 instances shattered Instance space X

CS 8751 ML & KDDComputational Learning Theory15 The Vapnik-Chervonenkis Dimension Definition: the Vapnik-Chervonenkis (VC) dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) = ∞. Example: VC dimension of linear decision surfaces is 3.

CS 8751 ML & KDDComputational Learning Theory16 Sample Complexity with VC Dimension How many randomly drawn examples suffice to  -exhaust VS H,D with probability at least (1 –  )?

CS 8751 ML & KDDComputational Learning Theory17 Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Consider setting similar to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging?

CS 8751 ML & KDDComputational Learning Theory18 Mistake Bounds: Find-S Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis: l 1   l 1  l 2   l 2  l 3   l 3  …  l n   l n For each positive training instance x –Remove from h any literal that is not satisfied by x Output hypothesis h How many mistakes before converging to correct h?

CS 8751 ML & KDDComputational Learning Theory19 Mistakes in Find-S Assuming c  H –Negative examples – can never be mislabeled as positive, the current hypothesis h is always at least as specific as target concept c –Positive examples – can be mislabeled as negative (concept not general enough, consider initial) –First positive example, 2n terms in literal (positive and negative of each feature), n will be eliminated –Each subsequent mislabeled positive example – will eliminate at least one term –Thus at most n+1 mistakes

CS 8751 ML & KDDComputational Learning Theory20 Mistake Bounds: Halving Algorithm Consider the Halving Algorithm Learn concept using version space candidate elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h? … in worst case? … in best case?

CS 8751 ML & KDDComputational Learning Theory21 Mistakes in Halving At each point, predictions are made based on a majority of the remaining hypotheses A mistake can be made only when at least half of the hypotheses are wrong Thus the size of H decreases by half for each mistake Thus, worst case bound is related to log 2 |H| How about best case? –Note, prediction of the majority could be correct but number of remaining hypotheses can decrease –Possible for the number of hypotheses to reach one with no mistakes