Download presentation
Presentation is loading. Please wait.
Published byBrandon McFarland Modified over 10 years ago
1
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04
2
Web Forum Data An important information resource with a lot of human knowledge. These information include recreation, sports, games, computers, art, society, science, home, health; 20% pages on the search results are from forums
3
Understanding Forum Crawling Data Extraction Quality Assessment Quality Assessment WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW09, Automation Data Extraction WWW09, Automation Data Extraction SIGIR09 Quality Assessment SIGIR09 Quality Assessment
4
Challenge Leverage more site-level knowledge
7
Forum Sitemap A sitemap is a directed graph corresponding consisting of a set of vertices and the links Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
8
Page Clustering Forum pages are based on database & template Layout is robust to describe template – Layout can be characterized by the HTML elements in different DOM paths
9
Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables
10
Link Analysis A Link = URL Pattern + Location
12
Inner-Page Features The inclusion relation. Data records usually have inclusion relations. The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.
13
Inner-vertex Features
14
Inter-vertex Features
16
Problem Setting AuthorTitleContent
17
Formulas of list page Formulas for identifying list record Formulas for identifying list title
18
Formulas of post page Formulas for identifying post record Formulas for identifying post author
19
Formulas of post page Formulas for identifying post time Formulas for identifying post content
21
Markov Logic Networks An MLN can be viewed as a template for constructing Markov Random Fields. With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:
22
Markov Logic Networks Divide DOM tree elements into three categories : – Text element – Hyperlink element – Inner element Benefit – Reduce the number of possible groundings in inference. – Reduce the ambiguity and achieve better performance.
23
Experiments List PagesPost Pages
24
Experiments
27
Future works http://discussions.apple.com/
28
Conclusion A template-independent approach to extract structured data from web forum sites. we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap. http://research.microsoft.com/people/jmyang/
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.