Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14.

Similar presentations


Presentation on theme: "Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14."— Presentation transcript:

1 Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14 Summarized and Presented by Gihyun Gong, IDS Lab.

2 Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 2

3 Copyright  2008 by CEBT What is the Web Forum? 3

4 Copyright  2008 by CEBT What is the Web Forum?  Forum structures User Groups – Posters – Moderators – Admin Post Thread 4

5 Copyright  2008 by CEBT Characteristics  Structure ‘Thread’ base (not page base) Many related/shortcut links Pre-defined templates Links with permission control  Contents User-Created Content Frequently changes Weight and quality of ‘Reply’ Suitable for discuss about a topic 5

6 Copyright  2008 by CEBT Why we crawl the Web Forum?  Web forum is a huge resource of human knowledge Over 20% search results are from web forums Could get related outer-link(site) Topic based structure 6 Topic Posts …. Topic Posts …. Topic Posts …. Topic Posts …. Web Forum

7 Copyright  2008 by CEBT The Limitation of Generic Crawlers  In general crawling, each page is treated independently, and each link is treated indiscriminately Lead to more than 50% useless pages Ignore the relationships between pages from a same thread  Forum crawling needs a site-level perspective and a careful selection of links 7

8 Copyright  2008 by CEBT Related Work and limitation  Basic strategy Breadth-first – Hard to select a proper crawling depth for each site Shallow/Deep crawling strategy – Cannot ensure to access all valuable content / May result in too many duplicated and invalid pages  Enhanced strategy On-line page importance computation – Calculate the partial PageRank to estimate the importance of a page – Not suitable for forum sites Deep web crawling – Forums are a kind of deep Web, but this crawler focused on how to prepare appropriate queries to probe Focused Crawling – Semantic topics in forums are too diverse to be simply characterized with a list of terms 8

9 Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 9

10 Copyright  2008 by CEBT Web Forum Structure - Sitemap  Sitemap is a directed graph Set of vertices and links 10

11 Copyright  2008 by CEBT Web Forum Structure – Skeleton Link  Most important links supporting the structure 11

12 Copyright  2008 by CEBT Web Forum Structure – Page Flipping  A kind of loop-back links in the sitemap 12

13 Copyright  2008 by CEBT Site-Level Perspective  Understand the organization structure  Find an optimal Traversal strategy The site-level perspective of "forums.asp.net" 13

14 Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 14

15 Copyright  2008 by CEBT Process 15

16 Copyright  2008 by CEBT Random Sampling 16

17 Copyright  2008 by CEBT Random Sampling  Randomly sample some pages from a given site  Try to push as many as possible unseen URLs to queue, randomly pop a URL from the front or the end of the queue To cover as many as possible unseen URL patterns  1,000 pages will be enough 17

18 Copyright  2008 by CEBT Sitemap Construction 18

19 Copyright  2008 by CEBT Traversal Strategy Exploring 19

20 Copyright  2008 by CEBT Skeleton Links  Skeleton links are the most important links supporting the structure of a forum site  Crawlers crawl as many as possible unique pages in a given forum site by following skeleton links  Skeleton links point to all valuable pages without introducing redundant and val ueless  How to Identify? Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex – Level by level: Inspired by user browsing behavior – Find an optimal combination of links 20

21 Copyright  2008 by CEBT How to Identify Skeleton Links  Coverage  Informativeness 21

22 Copyright  2008 by CEBT Page-Flipping Links  Crawlers can completely download a long discussion thread divided into several pages by following page-flipping links  Page-flipping links are a kind of loop-back links in the sitemap. However, not all loop-back links are page-flipping ones  How to detect? For page-flipping links, if there is a path from page A to B, there must be a path follow the same type of links from B to A Page-flipping links have larger connectivity score 22 Page A Hyperlink Page B Hyperlink

23 Copyright  2008 by CEBT 23 An illustration of the characteristics of page-flipping links Connectivity = 722 / 890 = 0.81 Connectivity = 108 / 1153 = 0.09

24 Copyright  2008 by CEBT 24

25 Copyright  2008 by CEBT Crawling  From the given entry page  Map a new page to an existing layout vertex  Follow the explored traversal strategy for out-links from that page 25

26 Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 26

27 Copyright  2008 by CEBT Experimental Setup  Contract experiments in eight forums from diverse categories Mirror pages: Crawled by a real commerce crawler Structure-driven: Crawled by structure-driven crawler proposed in SIGI R’06 Our method: Crawled by crawler using our traversal strategy 27

28 Copyright  2008 by CEBT Evaluation Criteria 28 Coverage Informativeness

29 Copyright  2008 by CEBT Effectiveness and Efficiency  Effectiveness 29

30 Copyright  2008 by CEBT Effectiveness and Efficiency  Efficiency 30

31 Copyright  2008 by CEBT Evaluation of Page-Flipping Detection 31

32 Copyright  2008 by CEBT Conclusions  A complete solution to automatically explore an appropriate trave rsal strategy to a given target forum site is proposed Skeleton link identification Page-flipping link detection  More future work directions Incremental crawling Forum page segmentation 32


Download ppt "Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14."

Similar presentations


Ads by Google