iRobot: An Intelligent Crawler for Web Forums

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Advanced Piloting Cruise Plot.
Analysis of Computer Algorithms
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
UNITED NATIONS Shipment Details Report – January 2006.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
DCV: A Causality Detection Approach for Large- scale Dynamic Collaboration Environments Jiang-Ming Yang Microsoft Research Asia Ning Gu, Qi-Wei Zhang,
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft.
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Electronic Resources in the EUI Library
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
ZMQS ZMQS
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Richmond House, Liverpool (1) 26 th January 2004.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
ABC Technology Project
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
“Start-to-End” Simulations Imaging of Single Molecules at the European XFEL Igor Zagorodnov S2E Meeting DESY 10. February 2014.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
Do you have the Maths Factor?. Maths Can you beat this term’s Maths Challenge?
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
April 2003 ONLINE SERVICE DELIVERY Presentation. 2 What is Online Service Delivery? Vision The current vision of the Online Service Delivery program is.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
SIMOCODE-DP Software.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Januar MDMDFSSMDMDFSSS
REGISTRATION OF STUDENTS Master Settings STUDENT INFORMATION PRABANDHAK DEFINE FEE STRUCTURE FEE COLLECTION Attendance Management REPORTS Architecture.
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
TASK: Skill Development A proportional relationship is a set of equivalent ratios. Equivalent ratios have equal values using different numbers. Creating.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
CpSc 3220 Designing a Database
Traktor- og motorlære Kapitel 1 1 Kopiering forbudt.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,
Presentation transcript:

iRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia March 22, 2017 Hello, everyone. Thanks for coming this presentation. I’m Rui Cai from Microsoft Research Asia. In this presentation, I’ll introduce you our recent work on forum crawling.

Outline Motivation & Challenge iRobot – Our Solution Evaluation System Overview Module Details Evaluation This is the outline of this presentation. First, I’ll explain the motivations and challenges in forum crawling. Then, I’ll introduce iRobot, our current solution of forum crawling, for both the system design and module implementation. Final is some preliminary evaluation, to demonstrate the efficiency of our system.

Outline Motivation & Challenge iRobot – Our Solution Evaluation System Overview Module Details Evaluation First, the motivation and challenge.

Why Web Forum is Important Forum is a huge resource of human knowledge Popular all over the world Contain any conceivable topics and issues Forum data can benefit many applications Improve quality of search result Various data mining on forum data Collecting forum data Is the basis of all forum related research Is not a trivial task

Why Forum Crawling is Difficult Duplicate Pages Forum is with complex in-site structure Many shortcuts for browsing Invalid Pages Most forums are with access control Some pages can only be visited after registration Page-flipping Long thread is shown in multiple pages Deep navigation levels

The Limitation of Generic Crawlers In general crawling, each page is treated independently Fixed crawling depth Cannot avoid duplicates before downloading Fetch lots of invalid pages, such as login prompt Ignore the relationships between pages from a same thread Forum crawling needs a site-level perspective!

Statistics on Some Forums Around 50% crawled pages are useless Waste of both bandwidth and storage

Outline Motivation & Challenge Our Solution – iRobot Evaluation System Overview Module Details Evaluation

What is Site-Level Perspective? Understand the organization structure Find our an optimal crawling strategy The site-level perspective of "forums.asp.net"

iRobot: An Intelligent Forum Crawler

Traversal Path Selection Outline Motivation & Challenge Our Solution – iRobot System Overview Module Details How many kinds of pages? How do these pages link with each other? Which pages are valuable? Which links should be followed? Evaluation Sitemap Construction Traversal Path Selection

Page Clustering Forum pages are based on database & template Layout is robust to describe template Repetitive regions are everywhere on forum pages Layout can be characterized by repetitive regions

Page Clustering

Link Analysis URL Pattern can distinguish links, but not reliable on all the sites Location can also distinguish links A Link = URL Pattern + Location

Informativeness Evaluation Which kind of pages (nodes) are valuable? Some heuristic criteria A larger node is more like to be valuable Page with large size are more like to be valuable A diverse node is more like to be valuable Based on content de-dup

Traversal Path Selection Clean sitemap Remove valueless nodes Remove duplicate nodes Remove links to valueless / duplicate nodes Find an optimal path Construct a spanning tree Use depth as cost User browsing behaviors Identify page-flipping links Number, Pre/Next

Outline Motivation & Challenge iRobot – Our Solution Evaluation System Overview Module Details Evaluation

Evaluation Criteria Duplicate ratio Invalid ratio Coverage ratio

Effectiveness and Efficiency

Performance vs. Sampled Page#

Preserved Discussion Threads Forums Mirrored Crawled by iRobot Correctly Recovered Biketo 1584 1313 1293 Asp 600 536 Baidu − Douban 62 60 37 CQZG 1393 1384 1311 Tripadvisor 326 272 Hoopchina 2935 2829 2593 87.6% 94.5%

Conclusions An intelligent forum crawler based on site-level structure analysis Identify page templates / valuable pages / link analysis / traversal path selection Some modules can still be improved More automated & mature algorithms in SIGIR’08 More future work directions Queue management Refresh strategies

Thanks!