Presentation on theme: "iRobot: An Intelligent Crawler for Web Forums"— Presentation transcript:
1 iRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei ZhangMicrosoft Research, AsiaMarch 22, 2017Hello, everyone. Thanks for coming this presentation.I’m Rui Cai from Microsoft Research Asia.In this presentation, I’ll introduce you our recent work on forum crawling.
2 Outline Motivation & Challenge iRobot – Our Solution Evaluation System OverviewModule DetailsEvaluationThis is the outline of this presentation.First, I’ll explain the motivations and challenges in forum crawling.Then, I’ll introduce iRobot, our current solution of forum crawling, for both the system design and module implementation.Final is some preliminary evaluation, to demonstrate the efficiency of our system.
3 Outline Motivation & Challenge iRobot – Our Solution Evaluation System OverviewModule DetailsEvaluationFirst, the motivation and challenge.
4 Why Web Forum is Important Forum is a huge resource of human knowledgePopular all over the worldContain any conceivable topics and issuesForum data can benefit many applicationsImprove quality of search resultVarious data mining on forum dataCollecting forum dataIs the basis of all forum related researchIs not a trivial task
5 Why Forum Crawling is Difficult Duplicate PagesForum is with complex in-site structureMany shortcuts for browsingInvalid PagesMost forums are with access controlSome pages can only be visited after registrationPage-flippingLong thread is shown in multiple pagesDeep navigation levels
6 The Limitation of Generic Crawlers In general crawling, each page is treated independentlyFixed crawling depthCannot avoid duplicates before downloadingFetch lots of invalid pages, such as login promptIgnore the relationships between pages from a same threadForum crawling needs a site-level perspective!
7 Statistics on Some Forums Around 50% crawled pages are uselessWaste of both bandwidth and storage
11 Traversal Path Selection OutlineMotivation & ChallengeOur Solution – iRobotSystem OverviewModule DetailsHow many kinds of pages?How do these pages link with each other?Which pages are valuable?Which links should be followed?EvaluationSitemap ConstructionTraversal Path Selection
12 Page Clustering Forum pages are based on database & template Layout is robust to describe templateRepetitive regions are everywhere on forum pagesLayout can be characterized by repetitive regions
17 Informativeness Evaluation Which kind of pages (nodes) are valuable?Some heuristic criteriaA larger node is more like to be valuablePage with large size are more like to be valuableA diverse node is more like to be valuableBased on content de-dup
25 Preserved Discussion Threads ForumsMirroredCrawled by iRobotCorrectly RecoveredBiketo158413131293Asp600536Baidu−Douban626037CQZG139313841311Tripadvisor326272Hoopchina29352829259387.6%94.5%
26 ConclusionsAn intelligent forum crawler based on site-level structure analysisIdentify page templates / valuable pages / link analysis / traversal path selectionSome modules can still be improvedMore automated & mature algorithms in SIGIR’08More future work directionsQueue managementRefresh strategies