Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational.

Similar presentations


Presentation on theme: "1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational."— Presentation transcript:

1 1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory

2 2 Computational models of cognitive phenomena Computing capabilities: Computability theory Reasoning/deduction: Formal logic Learning/induction: ?

3 3 A theory of the learnable (Valiant ‘84) […] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn […] Learning machines must have all 3 of the following properties: the machines can provably learn whole classes of concepts, these classes can be characterized the classes of concepts are appropriate and nontrivial for general-purpose knowledge the computational process by which the machine builds the desired programs requires a “feasible” (i.e. polynomial) number of steps

4 4 A theory of the learnable We seek general laws that constrain inductive learning, relating: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

5 5 Probably approximately correct learning formal computational model which want shed light on the limits of what can be learned by a machine, analysing the computational cost of learning algorithms

6 6 What we want to learn That is: to determine uniformly good approximations of an unknown function from its value in some sample points interpolation pattern matching concept learning CONCEPT = recognizing algorithm LEARNING = computational description of recognizing algorithms starting from: - examples - incomplete specifications

7 7 What’s new in p.a.c. learning? Accuracy of results and running time for learning algorithms are explicitly quantified and related A general problem: use of resources (time, space…) by computations  COMPLEXITY THEORY Example Sorting: n · logn time (polynomial, feasible) Bool. satisfiability: 2ⁿ time (exponential, intractable)

8 8 Learning from examples DOMAIN Concept LEARNER EXAMPLES A REPRESENTATION OF A CONCEPT CONCEPT: subset of domain EXAMPLES: elements of concept (positive) REPRESENTATION: domain→expressions GOOD LEARNER ? EFFICIENT LEARNER ?

9 9 The P.A.C. model A domain X (e.g. {0,1}ⁿ, Rⁿ) A concept: subset of X, f ⊆ Xor f: X → {0,1} A class of concepts F ⊆ 2 X A probability distribution P on X Example 1 X ≡ a square F ≡ triangles in the square

10 10 The P.A.C. model Example 2 X ≡ {0,1}ⁿ F ≡ family of boolean functions 1 if there are at least r ones in (x 1,…,x n ) f r (x 1,…,x n ) = 0 otherwise P a probability distribution on X Uniform Non uniform

11 11 The P.A.C. model The learning process Labeled sample((x 0, f(x 0 )), (x 1, f(x 1 )), …, (x n, f(x n )) Hypothesisa function h consistent with the sample (i.e., h(x i ) = f(x i )  i) Error probabilityP err (h(x) ≠ f(x), x  X)

12 12 LEARNER Examples generator with probability distribution p Inference procedure A t examples Hypothesis h (implicit representation of a concept) The learning algorithm A is good if the hypothesis h is “ALMOST ALWAYS” “CLOSE TO” the target concept c TEACHER The P.A.C. model X, f  F X, F (x 1,f(x 1 )), …, (x t,f(x t )))

13 13 f h x random choice Given an approximation parameter  (0<  ≤1), h is an ε-approximation of f if d p (f,h)≤  “ALMOST ALWAYS” Confidence parameter  (0 <  ≤ 1) The “measure” of sequences of examples, randomly choosen according to P, such that h is an ε-approximation of f is at least 1-  “CLOSE TO” METRIC : given P d p (f,h) = P err = P  x  f(x)≠h(x)  The P.A.C. model

14 14 Generator of examples Learner h F concept class S set of labeled samples from a concept in F A: S  F such that: I) A(S) consistent with S II) P(P err 1-   0< ,  <1  f  F  m  N  S s.t. |S|≥m Learning algorithm

15 15 COMPUTATIONAL RESOURCES SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning) DEF 1: a concept class F =   n=1 F n is statistically PAC learnable if there is a learning algorithm  with sample size t = t(n, 1/ , 1/  ) bounded by some polynomial function in n, 1/ , 1/  Look for algorithms which use “reasonable” amount of computational resources The efficiency issue

16 16 The efficiency issue POLYNOMIAL PAC STATISTICAL PAC DEF 2: a concept class F =   n=1 F n is polynomially PAC learnable if there is a learning algorithm  with running time bounded by some polynomial function in n, 1/ , 1/ 

17 17  n = {f: {0, 1} n  {0, 1}}The set of boolean functions in n variables F n   n A class of concepts Example 1: F n = clauses with literals in Example 2: F n = linearly separable functions in n variables REPRESENTATION - TRUTH TABLE (EXPLICIT) - BOOLEAN CIRCUITS (IMPLICIT) BOOLEAN CIRCUITS BOOLEAN FUNCTIONS Learning boolean functions

18 18 BASIC OPERATIONS COMPOSITION in m variables in n variables CIRCUIT: Finite acyclic directed graph Output node Basic operations Input nodes Given an assignment {x 1 … x n }  {0, 1} to the input variables, the output node computes the corresponding value [f  (g 1, …, g m )](x) = f(g 1 (x), …, g m (x)) Boolean functions and circuits

19 19 F n   n C n : class of circuits which compute all and only the functions in F n Algorithm A to learn F by C INPUT (n,ε,δ) The learner computes t = t(n, 1/ , 1/  ) (t=number of examples sufficient to learn with accuracy ε and confidence δ) The learner asks the teacher for a labelled t-sample The learner receives the t-sample S and computes C = A n (S) Output C (C= representation of the hypothesis) Note that the inference procedure A receives as input the integer n and a t-sample on  0,1  n and outputs A n (S) = A(n, S) Boolean functions and circuits

20 20 An algorithm A is a learning algorithm with sample size t(n, 1/ , 1/  ) for a concept class using the class of representations If for all n≥1, for all f  F n, for all 0< ,  <1 and for every probability distribution p over  0,1  n the following holds: If the inference procedure A n receives as input a t-sample, it outputs a representation c  C n of a function g that is probably approximately correct, that is with probability at least 1-  a t-sample is chosen such that the function g inferred satisfies P{x  f(x)≠g(x)} ≤  g is  –good: g is an  –approximation of f g is  –bad: g is not an  –approximation of f NOTE: distribution free Boolean functions and circuits

21 21 Statistical P.A.C. learning DEF: An inference procedure A n for the class F n is consistent if, given the target function f  F n, for every t-sample S = (, …, ), A n (S) is a representation of a function g “consistent” with S, i.e. g(x 1 ) = b 1, …, g(x t ) = b t DEF: A learning algorithm A is consistent if its inference procedure is consistent PROBLEM: Estimate upper and lower bounds on the sample size t = t(n, 1/ , 1/  ) Upper bounds will be given for consistent algorithms Lower bounds will be given for arbitrary algorithms

22 22 THEOREM: t(n, 1/ , 1/  ) ≤  -1 ln(# F n ) +ln(1/  ) PROOF: Prob  (x 1, …, x t )  g (g(x 1 )=f(x 1 ), …, g(x t )=f(x t )  g  -bad)  ≤ ≤  Prob (g(x 1 ) = f(x 1 ), …, g(x t ) = f(x t )) ≤ Impose # F n e -  t ≤  Independent events g is ε-bad P(AUB)≤P(A)+P(B) g ε-bad NOTE - # F n must be finite ≤   Prob (g(x i ) = f(x i )) ≤ g ε-bad i=1, …, t ≤  (1-  ) t ≤ # F n (1-  ) t ≤ # F n e -  t g ε-bad A simple upper bound

23 23 X domain F  2 X class of concepts S = ( x 1, …, x t ) t-sample f  S g iff f( x i ) = g( x i )  x i  S undistinguishable by S  F (S) = #( F /  S ) index of F w.r.t. S Problem: uniform convergence of relative frequencies to their probabilities Vapnik-Chervonenkis approach (1971) S 1  S 2 M F (t) = max  F (S)  S is a t-sample  growth function

24 24 FACT THEOREM A general upper bound Prob  (x 1, …, x t )  g (g  -bad  g(x 1 ) = f(x 1 ), …, g(x t ) = f(x t ))  ≤ 2m F 2te -  t/2 m F (t) ≤ 2 t m F (t) ≤ # F (this condition gives immediately the simple upper bound) m F (t) = 2 t and j<t  m F (j) = 2 j

25 25 dt ? #F DEFINITION FUNDAMENTAL PROPERTY BOUNDED BY A POLYNOMIAL IN t ! Graph of the growth function d = VC dim ( F ) = max  t  m F (t) = 2 t 

26 26 THEOREM If d n = VC dim (F n ) then t(n, 1/ , 1/  ) ≤ max (4/  log(2/  ), (8d n /  )log(13/  ) PROOF Impose 2m Fn 2te - et/2 ≤  A lower bound on t(n, 1/ , 1/  ): Number of examples which are necessary for arbitrary algorithms THEOREM For 0≤  ≤1/  and  ≤1/100 t(n, 1/ , 1/  ) ≥ max ((1-  )/  ln(1/  ), (d n -1)/32  ) Upper and lower bounds

27 27 I.e. the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F If  F (S) = 2 S we say that S is shattered by F The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S  X that is shattered by F An equivalent definition of VC dim  F (S) = #(f -1 (1)  (x 1, …, x t ) | f  F )

28 28 Sufficient! 24.000 690 Necessary! Learn the family f of circles contained in the square Example 1

29 29 HS(x)= SIMPLE UPPER BOUND UPPER BOUND USING GROWS LINEARLY WITH n! Learn the family of linearly separable boolean functions in n variables, L n Example 2

30 30 Consider the class L 2 of linearly separable functions in two variables The green point cannot be separated from the other three No straight line can separate the green from the red points Example 2

31 31 Classi di formule booleane Monomix 1  x 2  …  x k DNFm 1  m 2  …  m j (m j monomi) Clausolex 1  x 2  …  x k CNFc 1  c 2  …  c j (c j clausole) k-DNF≤ k letterali nei monomi k-term-DNF≤ k monomi k-CNF≤ k letterali nelle clausole k-clause-CNF≤ k clausole Formule monotonenon contengono letterali negati  -formuleogni variabile appare al più una volta

32 32 Th. (Valiant) I monomi sono apprendibili da esempi positivi con 2  (n+log  ) esempi (  errore tollerato) ponendo in tutti gli es. N.B. L’apprendibilità è non monotona A B se B appr., allora A appr. Th. i monomi non sono apprendibili da esempi negativi I risultati

33 33 1) K-CNF apprendibili da soli esempi positivi 1b) K-DNF apprendibili da soli esempi negativi 2) (K-DNF K-CNF) apprendibili da es. (K-DNF K-CNF) positivi e negativi 3) la classe delle K-decision lists è apprendibile Th. Ogni K-DNF (o K-CNF)-formula è rappresentabile da una K-DL piccola Risultati positivi

34 34 1) Le  -formule non sono apprendibili 2) Le funzioni booleane a soglia non sono apprendibili 3) Per K ≥ 2, le formule K-term-DNF non sono apprendibili Risultati negativi

35 35 Mistake bound model So far: how many examples needed to learn ? What about: how many mistakes before convergence ? Let’s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging ?

36 36 Mistake bound model Learner: Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes before converging to the correct hypothesis I.e.: Learning takes place during the use of the system, not off-line Ex.: prediction of fraudolent use of credit cards

37 37 Mistake bound for Find-S Consider Find-S when H = conjunction of boolean literals FIND-S: Initialize h to the most specific hypothesis in H: x 1  x 1  x 2  x 2  …  x n  x n For each positive training instance x Remove from h any literal not satisfied by x Output h

38 38 Mistake bound for Find-S If C  H and training data noise free, Find-S converges to an exact hypothesis How many errors to learn c  H (only positive examples can be misclassified)? The first positive example will be misclassified, and n literals in the initial hypothesis will be eliminated Each subsequent error eliminates at least one literal #mistakes ≤ n+1 (worst case, for the “total” concept  x c(x)=1)

39 39 Mistake bound for Halving A version space is maintained and refined (e.g., Candidate- elimination) Prediction is based on majority vote among the hypotheses in the current version space “Wrong” hypotheses are removed (even if x is exactly classified) How many errors to exactly learn c  H (H finite)? Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake, the version space is at least halved At most log 2 (|H|) mistakes before exact learning (e.g., single hypothesis remaining) Note: learning without mistakes possible !

40 40 Optimal mistake bound Question: what is the optimal mistake bound (i.e., lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C, assuming H=C ? Formally, for any learning algorithm A and any target concept c: M A (c) = max #mistakes made by A to exactly learn c over all possible training sequences M A (C) = max c  C M A (c) Note: M find-S (C) = n+1 M Halving (C) ≤ log 2 (|C|) Opt(C) = min A M A (C) i.e., # of mistakes made for the hardest target concept in C, using the hardest training sequence, by the best algorithm

41 41 Optimal mistake bound Theorem (Littlestone 1987) VC(C) ≤ Opt(C) ≤ M Halving (C) ≤ log 2 (|C|) There exist concept classes for which VC(C) = Opt(C) = M Halving (C) = log 2 (|C|) e.g., the power set 2 X of X, for which it holds: VC(2 X ) = |X| = log 2 (|2 X |) There exist concept classes for which VC(C) < Opt(C) < M Halving (C)

42 42 Weighted majority algorithm Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms Learns by altering the weight associated with each prediction algorithm It does not eliminate hypotheses (i.e., algorithms) inconsistent with some training examples, but just reduces its weight, so is able to accommodate inconsistent training data

43 43 Weighted majority algorithm  i w i := 1  training example (x, c(x)) q 0 := q 1 := 0  prediction algorithm a i If a i (x)=0 then q 0 := q 0 + w i If a i (x)=1 then q 1 := q 1 + w i if q 1 > q 0 then predict c(x)=1 if q 1 < q 0 then predict c(x)=0 if q 1 > q 0 then predict c(x)=0 or 1 at random  prediction algorithm a i do If a i (x)≠c(x) then w i :=  w i (0 ≤  <1)

44 44 Weighted majority algorithm (WM) Coincides with Halving for  =0 Theorem - D any sequence of training examples, A any set of n prediction algorithms, k min # of mistakes made by any a j  A for D,  =1/2. Then W-M makes at most 2.4(k+log 2 n) mistakes over D

45 45 Weighted majority algorithm (WM) Proof Since a j makes k mistakes (best in A) its final weight w j will be (1/2) k The sum W of the weights associated with all n algorithms in A is initially n, and for each mistake made by WM is reduced to at most (3/4)W, because the “wrong” algorithms hold at least 1/2 of total weight, that will be reduced by a factor of 1/2. The final total weight W is at most n(3/4) M, where M is the total number of mistakes made by WM over D.

46 46 Weighted majority algorithm (WM) But the final weight w j cannot be greater than the final total weight W, hence: (1/2) k ≤ n(3/4) M from which M ≤ ≤ 2.4(k+log 2 n) I.e., the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the size of the pool (k+log 2 n) -log 2 (3/4)


Download ppt "1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational."

Similar presentations


Ads by Google