Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Similar presentations


Presentation on theme: "Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma."— Presentation transcript:

1 Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04

2 Web Forum Data An important information resource with a lot of human knowledge. These information include recreation, sports, games, computers, art, society, science, home, health; 20% pages on the search results are from forums

3 Understanding Forum Crawling Data Extraction Quality Assessment Quality Assessment WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW09, Automation Data Extraction WWW09, Automation Data Extraction SIGIR09 Quality Assessment SIGIR09 Quality Assessment

4 Challenge Leverage more site-level knowledge

5

6

7 Forum Sitemap A sitemap is a directed graph corresponding consisting of a set of vertices and the links Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

8 Page Clustering Forum pages are based on database & template Layout is robust to describe template – Layout can be characterized by the HTML elements in different DOM paths

9 Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables

10 Link Analysis A Link = URL Pattern + Location

11

12 Inner-Page Features The inclusion relation. Data records usually have inclusion relations. The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.

13 Inner-vertex Features

14 Inter-vertex Features

15

16 Problem Setting AuthorTitleContent

17 Formulas of list page Formulas for identifying list record Formulas for identifying list title

18 Formulas of post page Formulas for identifying post record Formulas for identifying post author

19 Formulas of post page Formulas for identifying post time Formulas for identifying post content

20

21 Markov Logic Networks An MLN can be viewed as a template for constructing Markov Random Fields. With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:

22 Markov Logic Networks Divide DOM tree elements into three categories : – Text element – Hyperlink element – Inner element Benefit – Reduce the number of possible groundings in inference. – Reduce the ambiguity and achieve better performance.

23 Experiments List PagesPost Pages

24 Experiments

25

26

27 Future works http://discussions.apple.com/

28 Conclusion A template-independent approach to extract structured data from web forum sites. we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap. http://research.microsoft.com/people/jmyang/


Download ppt "Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma."

Similar presentations


Ads by Google