Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04.

Similar presentations


Presentation on theme: "Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04."— Presentation transcript:

1 Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04

2 Introduction

3 Problem Statement Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War Games > Online EverQuest Addict Warcraft III Clan Games > Single-Player Warcraft III Clan Games > Roleplaying Final Fantasy Fan Dragon Quest Home EverQuest Addict Warcraft III Clan Games > Strategy Shogun: Total War Warcraft III Clan

4 Possible Approach Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War Train EverQuest Addict Warcraft III Clan Classify ignores original Yahoo! categories

5 Another Approach (1/2) Use Yahoo! categories Advantage similar categories Potential Problem different structure categories do not match exactly

6 Another Approach (2/2) Example: Crayon Shin-chan Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan Arts > Animation > Anime > Titles > C > Crayon Shin-chan

7 This Paper’s Approach 1.Weak Learner (as opposed to Naïve Bayes) 2.Boosting to combine Weak Hypotheses 3.New Idea: Co-Bootstrapping to exploit source categories

8 Assumptions Multi-category data are reduced to binary data Totoro FanCartoon > My Neighbor Totoro Toys > My Neighbor Totoro is converted into Totoro FanCartoon > My Neighbor Totoro Totoro FanToys > My Neighbor Totoro Hierarchies are ignored Console > Sega and Console > Sega > Dreamcast are not related

9 Weak Learner 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

10 Weak Learner A type of classifier similar to Naïve Bayes + = accept - = reject Term may be a word or n-gram or … Weak Learner Weak Hypothesis (term-based classifier) After Training

11 Weak Hypothesis Example contain “Crayon Shin-chan”  in “Comics > Crayon Shin-chan” not in “Education > Early Childhood” not contain “Crayon Shin-chan”  not in “Comics > Crayon Shin-chan” in “Education > Early Childhood”

12 Weak Learner Inputs (1/2) Training data are in the form [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ] x i is a document y i is a category [x i, y i ] means document x i is in category y i D(x, y) is a distribution over all combinations of x i and y i D(x i, y j ) indicates the “importance” of (x i, y j ) w is the term (automatically found)

13 Weak Learner Algorithm For each possible category y, compute four values: Note: (x i,y) with greater D (x i,y) has more influence.

14 Weak Hypothesis h(x, y) Given unclassified document x and category y If x contains w, then Else if x does not contain w, then

15 Weak Learner Comments If sign[ h(x,y) ] = +, then x is in y | h(x,y) | is the confidence The term w is found as follows: Repeatedly run weak learner for all possible w Choose the run with the smallest value as the model Boosting: Minimizes probability of h(x,y) having wrong sign

16 Boosting (AdaBoost.MH) 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

17 Boosting Idea 1.Train the weak learner on different D t (x, y) distributions 2.After each run, adjust D t (x, y) by putting more weight on the most often misclassified training data 3.Output the final hypothesis as a linear combination of weak hypotheses

18 Boosting Algorithm Given: [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ], where x i  X and y i  Y Initialize D 1 (x,y) = 1/(mk) for t = 1,…,T do Pass distribution D t to weak learner Get weak hypothesis h t (x, y) Choose  t  R Update end for Output the final hypothesis

19 Boosting Algorithm Initialization Given: [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ] Initialize D(x, y) = 1/(mk) k = total number of categories uniform distribution

20 Boosting Algorithm Loop for t = 1,…,T do Run weak learner using distribution D Get weak hypothesis h t (x, y) For each possible pair (x,y) in training data If h t (x,y) guesses incorrectly, increase D(x,y) end for return

21 Co-Bootstrapping 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

22 Co-Bootstrapping Idea We want to use Yahoo! categories to increase classification accuracy

23 Recall Example Problem Games > Online EverQuest Addict Warcraft III Clan Games > Single-Player Warcraft III Clan Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War

24 Co-Bootstrapping Algorithm (1/4) 1. Run AdaBoost on Yahoo! sites Get classifier Y1 2. Run AdaBoost on Google sites Get classifier G1 3. Run Y1 on Google sites Get predicted Yahoo! categories for Google sites 4. Run G1 on Yahoo! sites Get predicted Google categories for Yahoo! sites

25 Co-Bootstrapping Algorithm (2/4) 5. Run AdaBoost on Yahoo! sites Include Google category as a feature Get classifier Y2 6. Run AdaBoost on Google sites Include Yahoo! category as a feature Get classifier G2 7. Run Y2 on original Google sites get more accurate Yahoo! categories for Google sites 8. Run G2 on original Yahoo! sites get more accurate Google categories for Yahoo! sites

26 Co-Bootstrapping Algorithm (3/4) 9. Run AdaBoost on Yahoo! sites Include Google category as a feature Get classifier Y3 10. Run AdaBoost on Google sites Include Yahoo! category as a feature Get classifier G3 11. Run Y3 on original Google sites get even more accurate Yahoo! categories for Google sites 12. Run G3 on original Yahoo! sites get even more accurate Google categories for Yahoo! sites

27 Co-Bootstrapping Algorithm (4/4) Repeat, repeat, and repeat… Hopefully, the classification will become more accurate after each iteration…

28 Enhanced Naïve Bayes (Benchmark)

29 Enhanced Naïve Bayes (1/2) Given document x source category S of x Predict master category C In NB, Pr[C | x]  Pr[C]  w  x (Pr[w | C]) n(x,w) w : word n(x,w) number of occurrences of w in x Pr[C | x, S]  Pr[C | S]  w  x (Pr[w | C]) n(x,w)

30 Enhanced Naïve Bayes (2/2) Pr[C] = Estimate Pr[C | S]  |C  S| : number of docs in S that is classified into C by NB classifier

31 Experiment

32 Datasets GoogleYahoo! Book /Top/Shopping/ Publications/Books /Business and Economy/Shopping and Services/Books//Bookstores Disease /Top/Health/Conditions and Diseases /Health/Diseases and Conditions Movie /Top/Arts/Movies/Genres/Entertainment/Movies and Film/Genres/ Music /Top/Arts/Music/Styles/Entertainment/Music/Genres News /Top/News/By Subject/News and Media

33 Number of Categories*/Dataset (1/2) GoogleYahoo! Book4941 Disease3051 Movie3425 Music4724 News2734 *Top level categories only

34 Number of Categories*/Dataset (2/2) Book Horror Science Fiction Non-fiction Biography History Merge into Non-fiction

35 Number of Websites GoogleYahoo! GYGYGYGY Book10,84211,26821,111999 Disease34,0479,78541,4392,393 Movie36,78714,36649,7441,409 Music76,42024,51895,9714,967 News31,50419,41949,3031,620

36 Method (1/2) Classify Yahoo! Book websites into Google Book categories (G  Y) 1.Find G  Y for Book 2.Hide Google categories for in G  Y 3.G  Y  Yahoo! Book 4.Randomly take |G  Y| sites from G-Y  Google Book

37 Method (2/2) For each dataset, do G  Y five times and G  Y five times macro F-score : calculate F-score for each category, then average over all categories micro F-score : calculate F-score on the entire dataset recall = 100%? Doesn’t say anything about multi-category ENB

38 Results (1/3) Co-Boostrapping-AdaBoost > AdaBoost macro-averaged F scoresmicro-averaged F scores

39 Results (2/3) Co-Bootstrapping-AdaBoost iteratively improves AdaBoost Book Dataset

40 Results (3/3) Co-Boostrapping-AdaBoost > Enhanced Na ï ve Bayes macro-averaged F scoresmicro-averaged F scores

41 Contribution Co-Bootstrapping improves Boosting performance Does not require  as in ENB


Download ppt "Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04."

Similar presentations


Ads by Google