# Computational Learning Theory

## Presentation on theme: "Computational Learning Theory"— Presentation transcript:

Computational Learning Theory

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Introduction Goal: Theoretical characterisation of the difficulty of several types of ML problems Capabilities of several types of ML algorithms Answer to the questions: Under what condition is successful learning possible and impossible? Under what condition is a particular ML algorithm assured to learn successfully? PAC: Identify classes of hypotheses that can or cannot be learned given a polynomial number of training examples Define a natural complexity measure for hypothesis space that allows to limit the number of training examples required for inductive learning

Introduction 2 Task Questions Given: Goal: Inductive learning of
Some training example , Space of candidate hypotheses H Goal: Inductive learning of Questions Sample complexity: How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis? Computational complexity: How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis? Mistake bound: How many training examples will the learner misclassify before converging to a successful hypothesis?

Introduction 3 Possibility to set quantitative bounds on these measures, depending on attributes of the learning problem such as: The size or complexity of the hypothesis space considered by the learner The accuracy to which the target concept must be approximated The probability that the learner will output a successful hypothesis The manner in which training examples are represented to the learner

Content Introduction Probably Learning an Approximately Correct Hypothesis The Problem Setting Error of the Hypothesis PAC Learnability Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Probably Learning an Approximately Correct Hypothesis
PAC (Probably approximately correct) Probably learning a approximately correct solution Restriction: We only consider the case of learning boolean valued concepts from noise free training data Result can be extended to the more general scenario of learning real-valued target functions (Natarajan 1991) Result can be extended learning from certain types of noisy data (Laird 1988, Kearns and Vazirani 1994)

The Problem Setting Names
X - set of all possible instances over which target functions may be defined C - set of target concepts that our learner might be called upon to learn D - probability distribution which is generally not known to the learner as stationary: distribution does not change over time T - set of training examples H - space of candidate hypotheses Each target concept c in C corresponds to some subset of X or equivalent to some boolean-valued function Searched: After observing a sequence of training examples of c, L must output some h from H, which estimates c. Evaluation of success of L: Performance of h over new instances drawn randomly from X according to D

Error of the Hypothesis
True error: error of h with respect to c observable L can only observe the performance of h over a training example Training error: Fraction of training examples misclassified by h Analysis: how probable is it that the observed training error for h gives a misleading estimate of the true

PAC Learnability Goal: characterise classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and a reasonable amount of computation Possible definition of the success of the training: for search : Problems Multiple hypotheses consistent with the training examples Non-representative training set Definition of PAC-Learning: Consider a concept class called C defined over a set of instances X of length n and a learner L using a hypothesis space H. C is PAC-learnable by L using h if for all , the distribution D over X, an such that , and such that , the learner L will with a probability of at least output a hypothesis such that , in a time that is polynomial in , , n and size(c)

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Agnostic learning and Inconsistent Hypotheses Conjunctions of Boolean Literals Are PAC-Learnable Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Sample Complexity for Finite Hypothesis Spaces
Definition: Sample complexity of the learning problem is the required number of training examples which are necessary for successful learning Depending on the constraints of the learning problem Consistent Learner: It outputs a hypothesis that perfectly fits the training data whenever possible Question: can a bound be derived for the number of training examples required by any consistent learner, independent of the specific alg. it uses to derive a consistent hypothesis? -> YES Significance of the version space : every consistent learner outputs a hypothesis belonging to the version space Therefore to limit the number of examples needed by any consistent learner we need only to limit the number of examples needed to assure that the version space contains no unacceptable hypotheses

Sample Complexity for Finite Hypothesis Spaces 2
Definition of -exhausted (Haussler 1988): Consider a hypothesis space H, target concept c, instance distribution D and a set of training examples T of c. The version space is said to be -exhausted with respect to c and D, if every hypothesis h in has an error less than with respect to c and D Picture: is 0.3 exhausted but not 0.1-exhausted

Sample Complexity for Finite Hypothesis Spaces 3
Theorem -exhausting the version space (Haussler 1988) If and D is a sequence of independent randomly drawn examples of some c then for any the probability that is not -exhausting (with respect to c) is less than or equal Important information: given the upper limit of the misclassification, using choose Hint 1: m grows linearly in , logarithmically in , and logarithmically in the size of H Hint 2: bound can be substantially overestimated:

Agnostic learning and Inconsistent Hypotheses
Problem: Consistent hypotheses are not always possible (H does not contain c) Agnostic learning: choose hypothesis where for example Searched , so that if => with high possibility

Agnostic learning and Inconsistent Hypotheses 2
Analogous: m independent coin flips showing “head” with some probability (m distinct trials of a Bernoulli experiment) Hoeffding boundary: characterise the deviation between the true probability of some event and its observed frequency over m independent trials => Requirement: The error of must be limited => Interpretation: Given choose: m depends logarithmically on H and on but m now grows as

Conjunction of Boolean Literals are PAC-Learnable
Example: C is the class where the target concept is described by a conjunction of boolean literals (a literal is any boolean variable or its negation) Is C PAC-learnable ->YES Any consistent learner will require only a polynomial number of training examples to learn any c in C Suggesting a specific algorithm that uses polynomial time per training example: Assumption H=C from the Theorem of Haussler follows: M grows linearly in the number of literals n, linearly in and logarithmically in

Conjunction of Boolean Literals are PAC-Learnable 2
Example with numbers: 10 boolean variables: Wanted: Safety 95% that the error of the hypothesis => algorithm with polynomial computing time Find-S Algorithm computes for each new positive training example the intersection of the literals shared by the current hypothesis and the new training example using time linear in n

Find-S: Finding a Maximally Specific Hypothesis
Use the more_general_than partial ordering: Begin with the most specific possible hypothesis in H Generalise this hypothesis each time it fails to cover an observed positive example 1. Initialise h to the most specific hypothesis in H 2. For each positive training instance x For each attribute constraint in h If the constrain is satisfied by x Then do nothing Else replace in h by the next more general constraint that is satisfied by x 3. Output hypothesis h

Find-S: Finding a Maximally Specific Hypothesis (Example)
1. Step: 2. Step: 1.Example + 1 Step: 3. Step: substituting a '?' in place of any attribute value in h that is not satisfied by new example 3.negative Example: FIND-S algorithm simply ignores every negative example 4.Step:

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space Sample Complexity and the VC Dimension The Vapnik-Chervonenkis Dimension The Mistake Bound Model of Learning Summary

Sample Complexity for Infinite Hypothesis Spaces
Disadvantage of the estimation before: Weak boundary In the case of an infinite hypothesis space it cannot be used Def: Shattering a Set of Instances A set of instances S is shattered by a hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy here the measuring is not based on the number of distinct hypotheses in |H| but on the number of distinct instances form X that can be completely discriminated using H

Shattering a Set of Instance
Follows from the definition: is not shattered by h <=> : from the aspect of all hypotheses

The Vapnik-Chervonenkis Dimension
S instance => different dichotomy Definition Vapnik-Chervonenkis Dimension: The Vapnik-Chervonenkis Dimension, VC(H), of hypothesis space H defined over the instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H then Example: Let H the set of intervals on real numbers VC(H) =?

The Vapnik-Chervonenkis Dimension 2
Example: Let , H the set of linear decision surface in the x, y plane; VC(H) = shattering is obviously general case irregular special case no shattering possibility

Sample Complexity and the VC Dimension
Earlier: the number of randomly drawn examples suffice to probably approximately learn any c in C Theorem: Lower bound on sample complexity Consider any concept class C such that , any learner L, and any and , then there exists a distribution D and a target concept in C such that: if L observes fewer examples than then with probability at least , L outputs a hypothesis h having Hint: Both boundaries are logarithmic in and linear in VC(H)

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning The Mistake Bound for the FIND-S Algorithm The Mistake Bound for the HALVING Algorithm Optimal Mistake Bounds WEIGHTED-MAJORITY Algorithm Summary

The Mistake Bound Model of Learning
Mistake bound model: the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis. Problem Inductive learning It receives a set of training examples but after each x, the learner must predict the target value c(x) before it is shown the correct target value by the trainer Success: exact/PAC-learning How many mistakes will the learner make in its predictions before it learns the target concept. It is significant in practical application when the learning must be done while the system is in actual use Exact learning:

The Mistake Bound for the Find-S Algorithm
Assumption: , H: conjunction of up to n boolean literals and their negations Learning without noisy Find-S algorithm: Initialise h as the most specific hypothesis For each positive training instance Remove from h any literal that is not satisfied by x Output hypothesis h Can we prove the total number of mistakes that Find-S will make before exactly learning C ->YES Note: No error on negative instances Step 1: any additional error => maximal n+1 errors (case )

The Mistake Bound for the HALVING Algorithm
Refine the version space Maintaining the version space through majority vote decision Halving algorithm = + Every error => maximal Note: reduction of the version space also in the case of correct prediction Extension: WEIGHTED-MAJORITY Algorithm ( weighted vote)

Optimal Mistake Bounds
Question: What is the optimal mistake bound for an arbitrary concept class C – the lowest worst case mistake bound in respect to all possible learning algorithms Let H=C for algorithm A: For example: Littlestone (1987)

WEIGHTED-MAJORITY Algorithm
Generalisation of the Halving Algorithm Weighted vote among the pool of prediction algorithms Learns by altering the weight associated with each prediction algorithm Advantage: Accommodate inconsistent training data Note: => Halving algorithm Theorem: Relative mistake bound for WEIGHTED-MAJORITY Let T be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence T. Then the number of mistakes over T made by the WEIGHTED-MAJORITY algorithm using is at most

WEIGHTED-MAJORITY Algorithm 2
denotes the prediction algorithm in the pool A of algorithms denotes the weight associated with For each i initialise For each training example Initialise and to 0 For each prediction algorithm If then If then predict If then predict If then predict 0 or 1 at random for c(x) For each prediction algorithm in A do If then

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Summary PAC learning versus exact learning
Consistent and inconsistent hypothesis, agnostic learning VC-Dimension: complexity of hypothesis space - largest subset of instances that can be shattered Bound on the number of training examples sufficient for successful learning under the PAC model Mistake bound model: Analyse the number of training examples a learner will misclassify before it exactly learns the target concept WEIGHTED-MAJORITY Algorithm: combines the weighted votes of multiple prediction algorithms to classify new instances

Similar presentations