Presentation is loading. Please wait.

Presentation is loading. Please wait.

Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN.

Similar presentations


Presentation on theme: "Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN."— Presentation transcript:

1 Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

2 Organization of this talk 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

3 Boosting Methodology to combine prediction rules into a more accurate one. E.g. learning rule to classify web pages on “Drew Barrymore” accuracy 80% Barrymore? yn No Yes Set of pred. rules = words Labeled training data (web pages) yn Barrymore? NO YES yn Drew? NO YES yn Charlie’s engels? NO YES + + combination of prediction rules (say, majority vote) accuracy 51%! John Drew Barrymore (her father) “The Barrymore family” of Hollywood Jaid Barrymore (her mother) John Barrymore (her grandpa) Lionel Barrymore (her granduncle) Diana Barrymore (her aunt)

4 Boosting by filtering [Schapire 90], [Freund 95], Advantage 2: smaller space complexity (for sample) accept reject Boosting scheme that uses random sampling from data (Huge) databoosting algorithm sample randomly Advantage 1: can determine sample size adaptively batch learning: O(1/  ) boosting by filtering : polylog(1/  ) (  : desired error)

5 Some known results Boosting algorithms by filtering –Schapire’s first boosting alg. [Schapire 90],Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. –Criterion for choosing prediction rules : accuracy Are there any better criteria? A candidate: information-based criterion –Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple version of Real AdaBoost) –Criterion for choosing prediction rules : mutual information –sometimes faster than those using accuracy-based criterion Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03], [Hatano&Watanabe 04] –However, no boosting algorithm by filtering known

6 Our work Boosting by filtering Information-based criterion efficient boosting by filtering using an information-based criterion lower space complexityfaster convergence our work

7 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

8 Illustration of general boosting Train. data (x 1,+1)(x 2,+1)(x 3,-1)(x 4,-1)(x 5,+1) Distribution D 1 0.2 1. Choose a pred. rule h 1 maximizing some criterion w. r. t. D 1. 2. Assign a coefficient to h 1 based on its quality. h1h1 +1 0.25 3. Update the distribution. Pred. of h 1 +1 +1 lowerhigher : correct : wrong

9 Pred. of h 2 +1 Illustration of general boosting(2) Train. data (x 1,+1)(x 2,+1)(x 3,-1)(x 4,-1)(x 5,+1) Distribution D 2 0.16 0.21 0.26 h2h2 +1 0.28 lowerhigher : correct : wrong 1. Choose a pred. rule h 2 maximizing some criterion w. r. t. D 2. 2. Assign a coefficient to h 1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times

10 Illustration of general boosting(3) h2h2 +1 0.28 h3h3 +1 0.05 h1h1 +1 0.25 ++ Final pred. rule = weighted majority vote of chosen pred. rules. instance x predict +1, if H(x) >0 predict -1, otherwise

11 Example: AdaBoost [Freund&Schapire 97] -y i H t (x i ) wrongcorrect Difficult examples (possibly noisy) may have too much weights Criterion for choosing pred. rules Coefficient Update (edge)

12 Smooth boosting Keeping the distribution “smooth” makes boosting algorithms –noise-tolerant (statistical query model) MadaBoost [Domingo&Watanabe00] (malicious noise model ) SmoothBoost [Servedio01] (agnostic boosting model) AdaFlat [Gavinsky 03], –sampling from D t can be simulated efficiently via sampling from D 1 (e.g., by rejection sampling).  applicable in the boosting by filtering framework D 1 (original distribution, e.g. uniform) D t (distribution costructed by the booster) sup x D t (x)/D 1 (x) is poly-bounded poly  D 1

13 Example: MadaBoost [Domingo & Watanabe 00] -y i H t (x i ) D t is 1/  bounded (  : error of H t ) Criterion for choosing pred. rules Coefficient Update (edge) l( -y i H t (x i ))

14 Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function

15 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

16 Our new booster -y i H t (x i ) Criterion for choosing pred. rules Coefficient Update Still, D t is 1/  bounded (  : error of H t ) (pseudo gain) l( -y i H t (x i ))

17 Pseudo gain Relation to edge Property:  2     (by convexity of of the square function)

18 Interpretation of pseudo gain but, ・・・ the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index  min h (conditional entropy of labels given h t )  max h (mutual information between h and labels)

19 Information-based criteria Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. [Kearns & Mansour 98] Good news: Gini index can be estimated via sampling efficiently! Our booster chooses a pred. rule maximizing the mutual information defined by Gini Index (GiniBoost)

20 Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of H t )>  for t=1,…,T. Then Coro. Further, if  t (h t ) ¸ , train.err(H T ) ·  in T= O(1/  ) steps.

21 Comparison on convergence speed booster #of iterations to get a final rule with error   comments MadaBoost [Domingo&Watanabe 00] O(1/   2 ) ○ boost by filtering ○ adaptive (don’ need to know  ) × needs technical assumptions SmoothBoost [Servedio 01] O(1/   2 ) ○ boost by filtering × not adaptive AdaFlat [Gavinsky 03] O(1/  2  2 ) ○ boost by filtering ○ adaptive GiniBoost (our result) O(1/  )   1/  2 ) ○ boost by filtering ○ adaptive AdaBoost [Schapire& Freund 97] O(log(1/  ) /  2 ) ○ adaptive ×boost by filtering  : minimum pseudo gain  : minimum edge

22 Boosting- by- filtering version of GiniBoost (outline) Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). Adaptive pred. rule selector. Boosting alg. in the PAC learning sense.

23 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

24 Experiments Topic classification of Reuters news (Reuters-21578) Binary classification for each 5 topics (Results are averaged). 10,000 examples. 30,000 words used as base pred. rules. Run algorithms until they sample 1,000,000 examples in total. 10-fold CV.

25 Test error over Reuters Note : GiniBoost2 doubles coefficients  t [+1],  t [-1] used in GiniBoost

26 Execution time test error(%)time (sec.) AdaBoost (w/o sampling , run in 100 step) 5.61349 MadaBoost6.7493 GiniBoost5.8408 GiniBoost25.5359 faster by about 4 times! (Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )

27 1.Introduction 2.Preliminaries 3.Our booster 4.Experiments 5.Summary

28 Summary/Open problem Summary GiniBoost: uses pseudo gain (Gini index) to choose base prediction rules shows faster convergence in the filtering scheme. Open problem Theoretical analysis on noise-tolerance

29 Comparison on sample size # of sampling# of accepted examples time ( sec. ) AdaBoost (w/o sampling, run in 100 steps) N/A 1349 MadaBoost1,032,219157,320493 GiniBoost11,039,943156,856408 GiniBoost21,027,874140,916359 Observation : smaller accepted examples → faster selection of pred. rules

30 Extension to boosting-by-filtering Instance space train. sample distribution Dt distribution D boosting alg. Instance space random example filtering distribution D random example distribution Dt Choose a pred. rule that maximazes pseudo-gain w.r.t. D t boosting alg. Choose a pred. rule that approximately maximazes pseudo-gain w.r.t. Dt Batch learning Learn over train. sample Boosting-by-filtering Learn over instance space directly via random sampling

31 Chernoff-type bound for pseudo gain

32 complexity of GiniBoost filt query complexity space complexity

33 Derivation of GiniBoost(1)

34 Derivation of GiniBoost(2)


Download ppt "Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN."

Similar presentations


Ads by Google