Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA.

Similar presentations


Presentation on theme: "Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA."— Presentation transcript:

1 Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA

2 Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA2

3 Motivations SIGKDD-2007, San Jose, California, USA3

4 Motivations Page Generation Script (e.g., ASP, PHP, JSP) Database Encoding Wrapper Decoding SIGKDD-2007, San Jose, California, USA4

5 Related Work Some automatic or semi-automatic wrapper learning methods have been proposed e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. Page clustering for wrapper induction is considered a trivial task Manual: most of previous work Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA5

6 Problems (cont.) Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA6

7 7 (a): www.amazon.com/gp/product/B000BNLGJA/ (a): …/gp/product/B000BNLGJA/ (b): www.amazon.com/gp/product/B00007J8SC/ (b): …/gp/product/B00007J8SC/ (c): www.amazon.com/gp/product/B0000DD95R/ (c): …/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ (d): …/gp/product/B0000A1AT9/

8 Problems Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before Complex Templates Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA8

9 9 (c): www.amazon.com/gp/product/B0000DD95R/(d): www.amazon.com/gp/product/B0000A1AT9/

10 Our Proposed Approach Main ideas Similarity-based templates, instead of ground-truth templates Advantages Be more stable Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA10

11 Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA11

12 Problem Definition SIGKDD-2007, San Jose, California, USA12

13 System Overview SIGKDD-2007, San Jose, California, USA13

14 Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA14

15 Wrapper-DOM Distance Distance between a wrapper and a DOM tree Tree alignment Cost calculation SIGKDD-2007, San Jose, California, USA15

16 Wrapper-Oriented Page Clustering (WPC) SIGKDD-2007, San Jose, California, USA 16 (a) Level-1 Wrapper (b) Level-2 Wrapper(c) Level-3 Wrapper(d) Level-4 Wrapper

17 Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA17

18 Experiments Data 1700 product pages from Amazon.com (Amazon) Mixed 1000 pages from 10 shopping sites (M10) Target product records: (name, image, price) Settings 2-fold cross-validation Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA18

19 Effectiveness Test Amazon: 44 wrappers, F1: 94.88% vs. 78% M10: SIGKDD-2007, San Jose, California, USA19

20 WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA20

21 Stability Test Objective Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA21

22 Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA22

23 Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! SIGKDD-2007, San Jose, California, USA23 Please have a try in two weeks! http://blogs.msdn.com/xaw

24 Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA24

25 Conclusion Our system Takes a miscellaneous training set as input Conducts template detection and wrapper generation in a single step Can achieve a joint optimization under the criterion of extraction accuracy In the near future, We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA25

26 Contacts: Ruihua Song (rsong@microsoft.com) Shuyi Zheng (shzheng@cse.psu.edu) SIGKDD-2007, San Jose, California, USA26

27 Poster No. 11 Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA27

28 SIGKDD-2007, San Jose, California, USA28

29 Labeling Cost To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA29


Download ppt "Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA."

Similar presentations


Ads by Google