Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.

Similar presentations


Presentation on theme: "Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker."— Presentation transcript:

1 Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker

2 2 Learning  Learning is essential for unknown environments –i.e., when designer lacks omniscience  Learning is useful as a system construction method –i.e., expose the agent to reality rather than trying to write it down  Learning modifies the agent's decision mechanisms to improve performance

3 3 Learning Agents

4 4 Learning Element  Design of a learning element is affected by: – Which components of the performance element are to be learned – What feedback is available to learn these components – What representation is used for the components  Type of feedback: – Supervised learning: correct answers for each example – Unsupervised learning: correct answers not given – Reinforcement learning: occasional rewards

5 5 Inductive Learning  Simplest form: learn a function from examples - f is the target function - an example is a pair (x, f(x))  Problem: find a hypothesis h such that h ≈ f given a training set of examples  This is a highly simplified model of real learning: - ignores prior knowledge - assumes examples are given

6 6 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

7 7 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

8 8 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

9 9 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

10 10 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

11 11 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting: Occam’s razor: prefer the simplest hypothesis consistent with data

12 12 Occam’s Razor William of Occam (1285-1349, England)  “If two theories explain the facts equally well, then the simpler theory is to be preferred.”  Rationale:  There are fewer short hypotheses than long hypotheses.  A short hypothesis that fits the data is unlikely to be a coincidence.  A long hypothesis that fits the data may be a coincidence.  Formal treatment in computational learning theory

13 13 The Problem Why does learning work? How do we know that the learned hypothesis h is close to the target function f if we do not know what f is? answer provided by computational learning theory

14 14 The Answer Any hypothesis h that is consistent with a sufficiently large number of training examples is unlikely to be seriously wrong. Therefore it must be: P robably A pproximately C orrect PAC

15 15 The Stationarity Assumption The training and test sets are drawn randomly from the same population of examples using the same probability distribution. Therefore training and test data are I ndependently and I dentically D istributed IID “the future is like the past”

16 16 How many examples are needed? Number of examples Probability that h and f disagree on an example Probability of existence of a wrong hypothesis consistent with all examples Size of hypothesis space Sample complexity

17 17 Formal Derivation H (the set of all possible hypothese) f  H BAD (the set of “wrong” hypotheses)

18 18 What if hypothesis space is infinite?  Can’t use our result for finite H  Need some other measure of complexity for H –Vapnik-Chervonenkis dimension

19 19

20 20

21 21

22 22 Shattering two binary dimensions over a number of classes  In order to understand the principle of shattering sample points into classes we will look at the simple case of  two dimensions  of binary value

23 23 2-D feature space 0 0 1 1 f1f1 f2f2

24 24 2-D feature space, 2 classes 0 0 1 1 f1f1 f2f2

25 25 the other class… 0 0 1 1 f1f1 f2f2

26 26 2 left vs 2 right 0 0 1 1 f1f1 f2f2

27 27 top vs bottom 0 0 1 1 f1f1 f2f2

28 28 right vs left 0 0 1 1 f1f1 f2f2

29 29 bottom vs top 0 0 1 1 f1f1 f2f2

30 30 lower-right outlier 0 0 1 1 f1f1 f2f2

31 31 lower-left outlier 0 0 1 1 f1f1 f2f2

32 32 upper-left outlier 0 0 1 1 f1f1 f2f2

33 33 upper-right outlier 0 0 1 1 f1f1 f2f2

34 34 etc. 0 0 1 1 f1f1 f2f2

35 35 2-D feature space 0 0 1 1 f1f1 f2f2

36 36 2-D feature space 0 0 1 1 f1f1 f2f2

37 37 2-D feature space 0 0 1 1 f1f1 f2f2

38 38 XOR configuration A 0 0 1 1 f1f1 f2f2

39 39 XOR configuration B 0 0 1 1 f1f1 f2f2

40 40 2-D feature space, two classes: 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 “hypothesis” = possible class partioning of all data samples

41 41 2-D feature space, two classes, 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 two XOR class configurations: 2/16 of hypotheses requires a non-linear separatrix

42 42 XOR, a possible non-linear separation 0 0 1 1 f1f1 f2f2

43 43 XOR, a possible non-linear separation 0 0 1 1 f1f1 f2f2

44 44 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 ……………………

45 45 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 …………………… 3 4 = 81 possible hypotheses

46 46 Maximum, discrete space  Four classes: 4 4 = 256 hypotheses  Assume that there are no more classes than discrete cells  Nhypmax = ncells nclasses

47 47 2-D feature space, three classes… 0 0 1 1 f1f1 f2f2 In this example,   is linearly separatable  from the rest, as is .  But  is not linearly separatable from the rest of the classes.

48 48 2-D feature space, four classes… 0 0 1 1 f1f1 f2f2 Minsky & Papert: simple table lookup or logic will do nicely.

49 49 2-D feature space, four classes… 0 0 1 1 f1f1 f2f2 Spheres or radial-basis functions may offer a compact class encapsulation in case of limited noise and limited overlap (but in the end the data will tell: experimentation required!)

50 50 SVM (1): Kernels Complicated separation boundary Simple separation boundary: Hyperplane f1f1 f2f2 f1f1 f2f2 f3f3 Kernels  Polynomial  Radial basis  Sigmoid  Implicit mapping to a higher dimensional space where linear separation is possible.

51 51 SVM (2): Max Margin Support vectors Max Margin “Best” Separating Hyperplane  From all the possible separating hyperplanes, select the one that gives Max Margin.  Solution found by Quadratic Optimization – “Learning”. f1f1 f2f2 Good generalization


Download ppt "Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker."

Similar presentations


Ads by Google