Presentation on theme: "Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma."— Presentation transcript:
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia
Web Forum Data An important information resource with a lot of human knowledge. These information include recreation, sports, games, computers, art, society, science, home, health; 20% pages on the search results are from forums
Understanding Forum Crawling Data Extraction Quality Assessment Quality Assessment WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW08 iRobot: An Intelligent Crawler for Web Forums SIGIR08 Exploring Traversal Strategy KDD09 Incremental Crawling WWW09, Automation Data Extraction WWW09, Automation Data Extraction SIGIR09 Quality Assessment SIGIR09 Quality Assessment
Challenge Leverage more site-level knowledge
Forum Sitemap A sitemap is a directed graph corresponding consisting of a set of vertices and the links Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
Page Clustering Forum pages are based on database & template Layout is robust to describe template – Layout can be characterized by the HTML elements in different DOM paths
Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables
Link Analysis A Link = URL Pattern + Location
Inner-Page Features The inclusion relation. Data records usually have inclusion relations. The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.
Problem Setting AuthorTitleContent
Formulas of list page Formulas for identifying list record Formulas for identifying list title
Formulas of post page Formulas for identifying post record Formulas for identifying post author
Formulas of post page Formulas for identifying post time Formulas for identifying post content
Markov Logic Networks An MLN can be viewed as a template for constructing Markov Random Fields. With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:
Markov Logic Networks Divide DOM tree elements into three categories : – Text element – Hyperlink element – Inner element Benefit – Reduce the number of possible groundings in inference. – Reduce the ambiguity and achieve better performance.
Experiments List PagesPost Pages
Conclusion A template-independent approach to extract structured data from web forum sites. we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap.