Download presentation

Presentation is loading. Please wait.

Published byCandace Boot Modified over 2 years ago

1
Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

2
Organization of this talk 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

3
Boosting Methodology to combine prediction rules into a more accurate one. E.g. learning rule to classify web pages on “Drew Barrymore” accuracy 80% Barrymore? yn No Yes Set of pred. rules = words Labeled training data (web pages) yn Barrymore? NO ＹＥＳ yn Drew? NO ＹＥＳ yn Charlie’s engels? NO ＹＥＳ ＋ ＋ combination of prediction rules (say, majority vote) accuracy 51%! John Drew Barrymore (her father) “The Barrymore family” of Hollywood Jaid Barrymore (her mother) John Barrymore (her grandpa) Lionel Barrymore (her granduncle) Diana Barrymore (her aunt)

4
Boosting by filtering [Schapire 90], [Freund 95], Advantage 2: smaller space complexity (for sample) accept reject Boosting scheme that uses random sampling from data (Huge) databoosting algorithm sample randomly Advantage 1: can determine sample size adaptively batch learning: O(1/ ) boosting by filtering : polylog(1/ ) ( : desired error)

5
Some known results Boosting algorithms by filtering –Schapire’s first boosting alg. [Schapire 90],Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. –Criterion for choosing prediction rules ： accuracy Are there any better criteria? A candidate: information-based criterion –Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple version of Real AdaBoost) –Criterion for choosing prediction rules ： mutual information –sometimes faster than those using accuracy-based criterion Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03], [Hatano&Watanabe 04] –However, no boosting algorithm by filtering known

6
Our work Boosting by filtering Information-based criterion efficient boosting by filtering using an information-based criterion lower space complexityfaster convergence our work

7
1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

8
Illustration of general boosting Train. data (x 1,+1)(x 2,+1)(x 3,-1)(x 4,-1)(x 5,+1) Distribution D １ Choose a pred. rule h 1 maximizing some criterion w. r. t. D Assign a coefficient to h 1 based on its quality. h1h Update the distribution. Pred. of h １ lowerhigher : correct : wrong

9
Pred. of h 2 +1 Illustration of general boosting(2) Train. data (x 1,+1)(x 2,+1)(x 3,-1)(x 4,-1)(x 5,+1) Distribution D h2h lowerhigher : correct : wrong 1. Choose a pred. rule h 2 maximizing some criterion w. r. t. D Assign a coefficient to h 1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times

10
Illustration of general boosting(3) h2h h3h h1h Final pred. rule = weighted majority vote of chosen pred. rules. instance x predict +1, if H(x) >0 predict -1, otherwise

11
Example: AdaBoost [Freund&Schapire 97] -y i H t (x i ) wrongcorrect Difficult examples (possibly noisy) may have too much weights Criterion for choosing pred. rules Coefficient Update (edge)

12
Smooth boosting Keeping the distribution “smooth” makes boosting algorithms –noise-tolerant (statistical query model) MadaBoost [Domingo&Watanabe00] (malicious noise model ) SmoothBoost [Servedio01] (agnostic boosting model) AdaFlat [Gavinsky 03], –sampling from D t can be simulated efficiently via sampling from D 1 (e.g., by rejection sampling). applicable in the boosting by filtering framework D 1 (original distribution, e.g. uniform) D t (distribution costructed by the booster) sup x D t (x)/D 1 (x) is poly-bounded poly D 1

13
Example: MadaBoost [Domingo & Watanabe 00] -y i H t (x i ) D t is 1/ bounded ( : error of H t ) Criterion for choosing pred. rules Coefficient Update (edge) l( -y i H t (x i ))

14
Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function

15
1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

16
Our new booster -y i H t (x i ) Criterion for choosing pred. rules Coefficient Update Still, D t is 1/ bounded ( : error of H t ) (pseudo gain) l( -y i H t (x i ))

17
Pseudo gain Relation to edge Property: 2 (by convexity of of the square function)

18
Interpretation of pseudo gain but, ・・・ the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index min h (conditional entropy of labels given h t ) max h (mutual information between h and labels)

19
Information-based criteria Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. [Kearns & Mansour 98] Good news: Gini index can be estimated via sampling efficiently! Our booster chooses a pred. rule maximizing the mutual information defined by Gini Index (GiniBoost)

20
Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of H t )> for t=1,…,T. Then Coro. Further, if t (h t ) ¸ , train.err(H T ) · in T= O(1/ ) steps.

21
Comparison on convergence speed booster #of iterations to get a final rule with error comments MadaBoost [Domingo&Watanabe 00] O(1/ 2 ) ○ boost by filtering ○ adaptive (don’ need to know ) × needs technical assumptions SmoothBoost [Servedio 01] O(1/ 2 ) ○ boost by filtering × not adaptive AdaFlat [Gavinsky 03] O(1/ 2 2 ) ○ boost by filtering ○ adaptive GiniBoost (our result) O(1/ ) 1/ 2 ) ○ boost by filtering ○ adaptive AdaBoost [Schapire& Freund 97] O(log(1/ ) / 2 ) ○ adaptive ×boost by filtering : minimum pseudo gain : minimum edge

22
Boosting- by- filtering version of GiniBoost (outline) Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). Adaptive pred. rule selector. Boosting alg. in the PAC learning sense.

23
1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

24
Experiments Topic classification of Reuters news (Reuters-21578) Binary classification for each 5 topics (Results are averaged). 10,000 examples. 30,000 words used as base pred. rules. Run algorithms until they sample 1,000,000 examples in total. 10-fold CV.

25
Test error over Reuters Note ： GiniBoost2 doubles coefficients t [+1], t [-1] used in GiniBoost

26
Execution time test error(%)time (sec.) AdaBoost (w/o sampling ， run in 100 step) MadaBoost GiniBoost GiniBoost faster by about 4 times! (Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )

27
1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

28
Summary/Open problem Summary GiniBoost: uses pseudo gain (Gini index) to choose base prediction rules shows faster convergence in the filtering scheme. Open problem Theoretical analysis on noise-tolerance

29
Comparison on sample size # of sampling# of accepted examples time （ sec. ） AdaBoost (w/o sampling, run in 100 steps) N/A 1349 MadaBoost1,032,219157, GiniBoost11,039,943156, GiniBoost21,027,874140, Observation ： smaller accepted examples → faster selection of pred. rules

30
Extension to boosting-by-filtering Instance space train. sample distribution Ｄｔ distribution Ｄ boosting alg. Instance space random example filtering distribution Ｄ random example distribution Ｄｔ Choose a pred. rule that maximazes pseudo-gain w.r.t. Ｄ ｔ boosting alg. Choose a pred. rule that approximately maximazes pseudo-gain w.r.t. Ｄｔ Batch learning Learn over train. sample Boosting-by-filtering Learn over instance space directly via random sampling

31
Chernoff-type bound for pseudo gain

32
complexity of GiniBoost filt query complexity space complexity

33
Derivation of GiniBoost(1)

34
Derivation of GiniBoost(2)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google