Download presentation
Presentation is loading. Please wait.
1
Experts and Boosting Algorithms
2
Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert Model –online model –Input: historical results.
3
Experts: Model N strategies (experts) At time t: –Learner A chooses a distribution over N. –Let p t (i) probability of i-th expert. –Clearly p t (i) = 1 –Receiving a loss vector l t –Loss at time t: p t (i) l t (i) Assume bounded loss, l t (i) in [0,1]
4
Expert: Goal Match the loss of best expert. Loss: –L A –L i Can we hope to do better?
5
Example: Guessing letters Setting: –Alphabet of k letters Loss: –1 incorrect guess –0 correct guess Experts: –Each expert guesses a certain letter always. Game: guess the most popular letter online.
6
Example 2: Rock-Paper-Scissors Two player game. Each player chooses: Rock, Paper, or Scissors. Loss Matrix: Goal: Play as best as we can given the opponent. RockPaperScissors Rock1/210 Paper01/21 Scissors101/2
7
Example 3: Placing a point Action: choosing a point d. Loss (give the true location y): ||d-y||. Experts: One for each point. Important: Loss is Convex Goal: Find a “center”
8
Experts Algorithm: Greedy For each expert define its cumulative loss: Greedy: At time t choose the expert with minimum loss, namely, arg min L i t
9
Greedy Analysis Theorem: Let L G T be the loss of Greedy at time T, then Proof!
10
Better Expert Algorithms Would like to bound
11
Expert Algorithm: Hedge(b) Maintains weight vector w t Probabilities p t (k) = w t (k) / w t (j) Initialization w 1 (i) = 1/N Updates: –w t+1 (k) = w t (k) U b (l t (k)) –where b in [0,1] and –b r < U b (r) < 1-(1-b)r
12
Hedge Analysis Lemma: For any sequence of losses Proof! Corollary:
13
Hedge: Properties Bounding the weights Similarly for a subset of experts.
14
Hedge: Performance Let k be with minimal loss Therefore
15
Hedge: Optimizing b For b=1/2 we have Better selection of b:
16
Occam Razor
17
Finding the shortest consistent hypothesis. Definition: ( )-Occam algorithm – >0 and <1 –Input: a sample S of size m –Output: hypothesis h –for every (x,b) in S: h(x)=b –size(h) < size (c t ) m Efficiency.
18
Occam algorithm and compression A B S (x i,b i ) x 1, …, x m
19
compression Option 1: –A sends B the values b 1, …, b m –m bits of information Option 2: –A sends B the hypothesis h –Occam: large enough m has size(h) < m Option 3 (MDL): –A sends B a hypothesis h and “corrections” –complexity: size(h) + size(errors)
20
Occam Razor Theorem A: ( , )-Occam algorithm for C using H D distribution over inputs X c t in C the target function Sample size: with probability 1- A(S)=h has error(h) <
21
Occam Razor Theorem Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) size(h) < n m Sample size:
22
Weak and Strong Learning
23
PAC Learning model There exists a distribution D over domain X Examples: –use c for target function (rather than c t ) Goal: –With high probability (1- ) –find h in H such that –error(h,c ) < – arbitrarily small.
24
Weak Learning Model Goal: error(h,c) < ½ - The parameter is small –constant –1/poly Intuitively: A much easier task Question: –Assume C is weak learnable, –C is PAC (strong) learnable
25
Majority Algorithm Hypothesis: h M (x)= MAJ[ h 1 (x),..., h T (x) ] size(h M ) < T size(h t ) Using Occam Razor
26
Majority: outline Sample m example Start with a distribution 1/m per example. Modify the distribution and get h t Hypothesis is the majority Terminate when perfect classification – of the sample
27
Majority: Algorithm Use the Hedge algorithm. The “experts” will be associate with points. Loss would be a correct classification. –l t (i)= 1 - | h t (x i ) – c(x i ) | Setting b= 1- h M (x) = MAJORITY( h i (x)) Q: How do we set T?
28
Majority: Analysis Consider the set of errors S –S={i | h M (x i ) c(x i ) } For ever i in S: –L i /T < ½ (Proof!) From Hedge properties:
29
MAJORITY: Correctness Error Probability: Number of Rounds: Terminate when error less than 1/m
30
AdaBoost: Dynamic Boosting Better bounds on the error No need to “know” Each round a different b –as a function of the error
31
AdaBoost: Input Sample of size m: A distribution D over examples –We will use D(x i )=1/m Weak learning algorithm A constant T (number of iterations)
32
AdaBoost: Algorithm Initialization: w 1 (i) = D(x i ) For t = 1 to T DO –p t (i) = w t (i) / w t (j) –Call Weak Learner with p t –Receive h t –Compute the error t of h t on p t –Set b t = t /(1- t ) –w t+1 (i) = w t (i) (b t ) e, where e=1-|h t (x i )-c(x i )| Output
33
AdaBoost: Analysis Theorem: –Given 1,..., T –the error of h A is bounded by
34
AdaBoost: Proof Let l t (i) = 1-|h t (x i )-c(x i )| By definition: p t l t = 1 – t Upper bounding the sum of weights –From the Hedge Analysis. Error occurs only if
35
AdaBoost Analysis (cont.) Bounding the weight of a point Bounding the sum of weights Final bound as function of b t Optimizing b t : –b t = t / (1 – t )
36
AdaBoost: Fixed bias Assume t = 1/2 - We bound:
37
Learning OR with few attributes Target function: OR of k literals Goal: learn in time: – polynomial in k and log n – and constant ELIM makes “slow” progress –disqualifies one literal per round –May remain with O(n) literals
38
Set Cover - Definition Input: S 1, …, S t and S i U Output: S i1, …, S ik and j S jk =U Question: Are there k sets that cover U? NP-complete
39
Set Cover Greedy algorithm j=0 ; U j =U; C= While U j –Let S i be arg max |S i U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1
40
Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every U j Some S in C’ covers U j /k elements of U j Analysis of U j : |U j+1 | |U j | - |U j |/k Solving the recursion. Number of sets j < k ln |U|
41
Building an Occam algorithm Given a sample S of size m –Run ELIM on S –Let LIT be the set of literals –There exists k literals in LIT that classify correctly all S Negative examples: –any subset of LIT classifies theme correctly
42
Building an Occam algorithm Positive examples: –Search for a small subset of LIT –Which classifies S + correctly –For a literal z build T z ={x | z satisfies x} –There are k sets that cover S + –Find k ln m sets that cover S + Output h = the OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.