Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Similar presentations


Presentation on theme: "Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert."— Presentation transcript:

1 Experts and Boosting Algorithms

2 Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert Model –online model –Input: historical results.

3 Experts: Model N strategies (experts) At time t: –Learner A chooses a distribution over N. –Let p t (i) probability of i-th expert. –Clearly  p t (i) = 1 –Receiving a loss vector l t –Loss at time t:  p t (i) l t (i) Assume bounded loss, l t (i) in [0,1]

4 Expert: Goal Match the loss of best expert. Loss: –L A –L i Can we hope to do better?

5 Example: Guessing letters Setting: –Alphabet  of k letters Loss: –1 incorrect guess –0 correct guess Experts: –Each expert guesses a certain letter always. Game: guess the most popular letter online.

6 Example 2: Rock-Paper-Scissors Two player game. Each player chooses: Rock, Paper, or Scissors. Loss Matrix: Goal: Play as best as we can given the opponent. RockPaperScissors Rock1/210 Paper01/21 Scissors101/2

7 Example 3: Placing a point Action: choosing a point d. Loss (give the true location y): ||d-y||. Experts: One for each point. Important: Loss is Convex Goal: Find a “center”

8 Experts Algorithm: Greedy For each expert define its cumulative loss: Greedy: At time t choose the expert with minimum loss, namely, arg min L i t

9 Greedy Analysis Theorem: Let L G T be the loss of Greedy at time T, then Proof!

10 Better Expert Algorithms Would like to bound

11 Expert Algorithm: Hedge(b) Maintains weight vector w t Probabilities p t (k) = w t (k) /  w t (j) Initialization w 1 (i) = 1/N Updates: –w t+1 (k) = w t (k) U b (l t (k)) –where b in [0,1] and –b r < U b (r) < 1-(1-b)r

12 Hedge Analysis Lemma: For any sequence of losses Proof! Corollary:

13 Hedge: Properties Bounding the weights Similarly for a subset of experts.

14 Hedge: Performance Let k be with minimal loss Therefore

15 Hedge: Optimizing b For b=1/2 we have Better selection of b:

16 Occam Razor

17 Finding the shortest consistent hypothesis. Definition: (  )-Occam algorithm –  >0 and  <1 –Input: a sample S of size m –Output: hypothesis h –for every (x,b) in S: h(x)=b –size(h) < size  (c t ) m  Efficiency.

18 Occam algorithm and compression A B S (x i,b i ) x 1, …, x m

19 compression Option 1: –A sends B the values b 1, …, b m –m bits of information Option 2: –A sends B the hypothesis h –Occam: large enough m has size(h) < m Option 3 (MDL): –A sends B a hypothesis h and “corrections” –complexity: size(h) + size(errors)

20 Occam Razor Theorem A: ( ,  )-Occam algorithm for C using H D distribution over inputs X c t in C the target function Sample size: with probability 1-  A(S)=h has error(h) < 

21 Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) size(h) < n  m  Sample size:

22 Weak and Strong Learning

23 PAC Learning model There exists a distribution D over domain X Examples: –use c for target function (rather than c t ) Goal: –With high probability (1-  ) –find h in H such that –error(h,c ) <  –  arbitrarily small.

24 Weak Learning Model Goal: error(h,c) < ½ -  The parameter  is small –constant –1/poly Intuitively: A much easier task Question: –Assume C is weak learnable, –C is PAC (strong) learnable

25 Majority Algorithm Hypothesis: h M (x)= MAJ[ h 1 (x),..., h T (x) ] size(h M ) < T size(h t ) Using Occam Razor

26 Majority: outline Sample m example Start with a distribution 1/m per example. Modify the distribution and get h t Hypothesis is the majority Terminate when perfect classification – of the sample

27 Majority: Algorithm Use the Hedge algorithm. The “experts” will be associate with points. Loss would be a correct classification. –l t (i)= 1 - | h t (x i ) – c(x i ) | Setting b= 1-  h M (x) = MAJORITY( h i (x)) Q: How do we set T?

28 Majority: Analysis Consider the set of errors S –S={i | h M (x i )  c(x i ) } For ever i in S: –L i /T < ½ (Proof!) From Hedge properties:

29 MAJORITY: Correctness Error Probability: Number of Rounds: Terminate when error less than 1/m

30 AdaBoost: Dynamic Boosting Better bounds on the error No need to “know”  Each round a different b –as a function of the error

31 AdaBoost: Input Sample of size m: A distribution D over examples –We will use D(x i )=1/m Weak learning algorithm A constant T (number of iterations)

32 AdaBoost: Algorithm Initialization: w 1 (i) = D(x i ) For t = 1 to T DO –p t (i) = w t (i) /  w t (j) –Call Weak Learner with p t –Receive h t –Compute the error  t of h t on p t –Set b t =  t /(1-  t ) –w t+1 (i) = w t (i) (b t ) e, where e=1-|h t (x i )-c(x i )| Output

33 AdaBoost: Analysis Theorem: –Given  1,...,  T –the error  of h A is bounded by

34 AdaBoost: Proof Let l t (i) = 1-|h t (x i )-c(x i )| By definition: p t l t = 1 –  t Upper bounding the sum of weights –From the Hedge Analysis. Error occurs only if

35 AdaBoost Analysis (cont.) Bounding the weight of a point Bounding the sum of weights Final bound as function of b t Optimizing b t : –b t =  t / (1 –  t )

36 AdaBoost: Fixed bias Assume  t = 1/2 -  We bound:

37 Learning OR with few attributes Target function: OR of k literals Goal: learn in time: – polynomial in k and log n –  and  constant ELIM makes “slow” progress –disqualifies one literal per round –May remain with O(n) literals

38 Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

39 Set Cover Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

40 Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every U j Some S in C’ covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j < k ln |U|

41 Building an Occam algorithm Given a sample S of size m –Run ELIM on S –Let LIT be the set of literals –There exists k literals in LIT that classify correctly all S Negative examples: –any subset of LIT classifies theme correctly

42 Building an Occam algorithm Positive examples: –Search for a small subset of LIT –Which classifies S + correctly –For a literal z build T z ={x | z satisfies x} –There are k sets that cover S + –Find k ln m sets that cover S + Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))


Download ppt "Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert."

Similar presentations


Ads by Google