Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Similar presentations

Presentation on theme: "Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309."— Presentation transcript:

1 Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309

2  Introduction  Calculating Page Similarity  Finding Similar Pages ◦ Click Data Model (CDM) ◦ Query Constraint (QC) algorithm  Experimental Results  Discussion  Conclusion 2

3  Large labor cost of annotating the data  The aggregated click data across many users over time provides valuable information  Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents 3

4  “Two pages that tend to be clicked by the same user queries tend to be topically similar” 4 AB “How to tie a tie” “How to tie a neck tie knots ” “Tying a tie” Label as “Positive” (class “How-to”) Unknown Label “Positive” ?

5  A page is represented as a node in the similar graph  Normalize all the URLs e.g. the following 4 URLs are treated as the same (1) “” (2) “” (3) “” (4) “” 5

6  Each URL is represented as a vector of queries that users issued and clicked through to the page 6 Pantel & Lin (2002)

7  Compute the similarity between two pages using the cosine similarity of their respective feature vector  sim (p1,p2) > sim (p1,p3)  sim (p1,p2) > sim(p2,p3) Because p1 and p2 share more common queries than p3 7

8  What’s a “seed set” ? A set of some labeled data  Two algorithms for seed set expansion ◦ Click Data Model (CDM) ◦ Query Constraints (QC) algorithm 8

9  Two phases ◦ Updating score phase ◦ Filtering phase  Input ◦ S1 (positive set) ◦ S2 (negative set) ◦ G (click graph)  Output ◦ E1 (positive) ◦ E2 (negative)  Thresholds ◦ 0.1<T 1 <0.6 ◦ 0.6<T 2 <1.2 9

10  Additional Module that checks whether the common queries between two nodes have certain term patterns 10

11  Reduce the amount of human annotation effort by leveraging the click data  Build an expansion model with labeled training data and use it to select next round of training data 11

12  Click Data ◦ During December 2008 from Yahoo! Search engine ◦ Only the top 10 URLs are considered ◦ URLs with less than 10 clicks are excluded  Tree classification tasks ◦ How-to ◦ Adult ◦ review 12

13  Training sets ◦ 10,000 manually labeled positive and negative examples ◦ For “review” classifier, queries such as “digital camera reviews” or “baby swing reviews” ◦ For “How-to” classifier, queries such as “how to clean uggs” or “best way to loose weight”  Testing sets 13

14  Classifier ◦ Gradient Boosting Decision Tree (GBDT)  Features ◦ Textual, Link, URL, HTML, Other features  Metrics ◦ Area Under the ROC Curve (AUC) ( Fawcett, 2003 ) ◦ F score ◦ Accuracy 14

15  The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, +0.81 in Accuracy and +0.25% in AUC) 15

16  Reduce the manual labor by 50%  QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small 16

17  With 1000 and 2000 human labeled data, CDM performs worse than the baseline  QC (exclude pages that do not have “How-to” in query terms) 17

18  Baseline: Type A  CDM: Type C 18

19  From “How-to” Classifier  Seed 1  Seed 2 (human label from Expnd1)  Expand2 19

20  A random sample of 50 positive and 50 negative example from “how-to” classifier  Positive class has 82.3% precision whereas negative class has 83.6% precision 20

21  Is the proposed method always useful for web page classification ?  How can we improve the quality of automatically labeled data from unlabeled data ? 21

22  Present a method for improve webpage classification by leveraging click data to augment training data  Argument manually labeled data by modeling the similarity between pages in a click graph 22

23  Thank you very much  Questions & Answers 23

Download ppt "Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309."

Similar presentations

Ads by Google