1 Block-based Web Search Deng Cai 1, Shipeng Yu 2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Slides:

Advertisements

Similar presentations

iRobot: An Intelligent Crawler for Web Forums

Advertisements

Constraint Satisfaction Problems

Technische Universität München + Hewlett Packard Laboratories Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling.

Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.

Copyright © 2003 Pearson Education, Inc.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 1 The Study of Body Function Image PowerPoint

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

UNITED NATIONS Shipment Details Report – January 2006.

Visualizing Information: Using WebTheme to Visualize Internet Search Results Karen Buxton and Mary Frances Lembo The Value of Information: American Society.

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Spectral Clustering Eyal David Image Processing seminar May 2008.

4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I

Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

Solve Multi-step Equations

Georg Buscher Georg Buscher, Andreas Dengel, Ludger van Elst German Research Center for AI (DFKI) Knowledge Management Department Kaiserslautern, Germany.

4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.

DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.

Randomized Algorithms Randomized Algorithms CS648 1.

Academic Advisor: Dr. Yuval Elovici Technical Advisor: Dr. Rami Puzis Team Members: Yakir Dahan Royi Freifeld Vitali Sepetnitsky 2.

ABC Technology Project

Page Replacement Algorithms

Chapter 10: Virtual Memory

Virtual Memory II Chapter 8.

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)

Green Eggs and Ham.

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.

© 2012 National Heart Foundation of Australia. Slide 2.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Web Design Principles 5th Edition

1 IPSI 2003 © 2003 T. Abou-Assaleh, N. Cercone, & V. Keselj An Overview of the Theory of Relaxed Unification Tony Abou-Assaleh Nick Cercone & Vlado Keselj.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Equal or Not. Equal or Not

Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,

Januar MDMDFSSMDMDFSSS

We will resume in: 25 Minutes.

Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

PSSA Preparation.

Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.

Chapter 13 Web Page Design Studio

RollCaller: User-Friendly Indoor Navigation System Using Human-Item Spatial Relation Yi Guo, Lei Yang, Bowen Li, Tianci Liu, Yunhao Liu Hong Kong University.

Application of Ensemble Models in Web Ranking

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Block-based Web Search Deng Cai 1*, Shipeng Yu 2*, Ji-Rong Wen *, Wei-Ying Ma * SIGIR ’ 04 * Microsoft Research Asia Beijing, China {jrwen,

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Chapter 6: Information Retrieval and Web Search

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

CSCE822 Data Mining and Warehousing

Presentation transcript:

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich Presented by Hong Cheng

2 2 Problems in Traditional IR Term-Document Irrelevance Problem –Noisy terms –Multiple topics Variant Document Length Problem –Length normalization is important Passage Retrieval in traditional IR –Partition the document to several passages –Solve the problem in some sense –Has three types of passages: discourse, semantic, window –Fixed-window passage is shown to be robust

3 3 Problems in Web IR Noisy information –Navigation –Decoration –Interaction –…–… Multiple topics –May contain text as well as images or links Noisy Information Multiple Topics

4 4 Problems in Web IR (Cont.) Variant Document Length Problem Conclusion: in web IR all the problems of traditional IR remain and are more severe! TREC-2&4TREC-4&5WT10g.GOV Number of doc524,929556,0771,692,0961,247,753 Text size (Mb)2,0592,13410,19018,100 Median length (Kb) Average length (Kb)

5 5 Challenges in Web IR New characteristics of web pages –Two-Dimensional Logical Structure –Visual Layout Presentation Page segmentation methods can be achieved –Obtain blocks from web pages –Block-based web search is possible Space Color Font Style Font Size Separator

6 6 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

7 7 Web Page Segmentation Approaches Fixed-length approach (FixedPS) –Traditional window-based passage retrieval DOM-based approach (DomPS) –Like the natural paragraph in traditional passage retrieval Vision-based Web Page Segmentation (VIPS) –Achieve a semantic partition to some extent Combined Approach (CombPS) –Combined VIPS & Fixed-length Web Page Segmentation FixedPSDomPSVIPSCombPS Passage Retrieval WindowDiscourseSemantic Semantic Window

8 8 Fixed-length Page Segmentation (FixedPS) A block contains words of fixed-length Traditional window-based methods can be applied Approaches –Overlapped windows (e.g. Callan, SIGIR94) –Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR97) Results –A simple but robust approach –Do not consider semantic information

9 9 DOM-based Page Segmentation (DomPS) Rely on the DOM structure to partition the page –DOM: Document-Object Model Current approaches –Only base on tags (e.g. Crivellari et al, TREC 9) –Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR01) Results –Similar to discourse in passage retrieval –DOM represents only part of the semantic structure –Imprecise content structure

10 VIPS Algorithm Motivation –Topics can be distinguished with visual cues in many cases –Utilize the two-dimensional structure of web pages Goal –Extract the semantic structure of a web page to some extent, based on its visual presentation Procedure –Top-down partition the web page based on the separators Result –A tree structure, each node in the tree corresponds to a block in the page –Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception

11 VIPS: An Example Microsoft Technical Report MSR-TR

12 Combined Approach (CombPS) VIPS solves the problems of noisy information and multi-topics FixedPS can deal with the variant document length problem Combine these two: –Partition the webpage using VIPS –Divide the blocks containing more words than pre-defined window length Block length after segment 50,000 pages using VIPS chosen from the WT10g data set

13 Web Page Segmentation Summarization Fixed-length approach (FixedPS) –traditional passage retrieval DOM-based approach (DomPS) –Like the natural paragraph in traditional passage retrieval Vision-based Web Page Segmentation (VIPS) –Achieve a semantic partition to some extent Combined Approach (CombPS) –Combined VIPS & Fixed-length Web Page Segmentation FixedPSDomPSVIPSCombPS Passage Retrieval WindowDiscourseSemantic Semantic Window

14 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

15 Block Retrieval Similar to traditional passage retrieval Retrieve blocks instead of full documents Combine the relevance of blocks with relevance of documents Goal: –Verify if page segmentation can deal with both the length normalization and multiple-topic problems

16 Block-level Query Expansion Similar to passage-level pseudo-relevance feedback Expansion terms are selected from top blocks instead of top documents Goal: –Testify if page segmentation can benefit the selection of query terms through increasing term correlations within a block, and thus improve the final performance

17 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

18 Experiments Methodology –Fixed-length window approach (FixedPS) Overlapped window with size of 200 words –DOM-based approach (DomPS) Iterate the DOM tree for some structural tags A block is constructed and identified by such leaf tag Free text between two tags is treated as a special block –Vision-based approach (VIPS) The permitted degree of coherence is set to 0.6 All the leaf nodes are extracted as visual blocks –The combined approach (CombPS) VIPS then FixedPS –Full document approach (FullDoc) No segmentation is performed

19 Experiments (Cont.) Dataset –TREC 2001 Web Track WT10g corpus (1.69 million pages), crawled at queries (topics ) –TREC 2002 Web Track.GOV corpus (1.25 million pages), crawled at queries (topics ) Retrieval System –Okapi, with weighting function BM2500 Preprocessing –Standard stop-word list –Do not use stemming and phrase information Tune parameters in BM2500 to achieve best baselines Evaluation criteria:

20 Experiments on Block Retrieval Steps: 1.Do original document retrieval –Obtain a document rank DR 2.Analyze top N (1000 here) documents to get a block set 3.Do block retrieval on the block set (same as Step 1 but replace the document with block) –Obtain a block rank BR –Documents are re-ranked by the single-best block in each document 4.Combine the BR and DR to get a new rank of document – – is the tuning parameter

21 Block Retrieval on TREC 2001 and TREC 2002 Page Segmentation BaselineBR onlyBR + DR best DomPS FixedPS VIPS CombPS Page Segmentation BaselineBR onlyBR + DR best DomPS FixedPS VIPS CombPS Result on TREC 2001 Result on TREC 2002

22 Experiments on Block-level Query Expansion Steps: 1.Same steps as block retrieval –Do original document retrieval to get DR –Analyze top N (1000 here) documents to get a block set –Do block retrieval on the block set to get BR 2.Select some expansion terms based on top blocks –10 expansion terms in our experiments –Number of top blocks is a tuning parameter 3.Document retrieval with the expanded query –Modify the term weights before final retrieval

23 Query Expansion on TREC 2001 and TREC 2002 Page Segmentation Baseline Query Expansion (best) FullDoc % DomPS % FixedPS % VIPS % CombPS % Result on TREC 2001 Result on TREC 2002 Page Segmentation Baseline Query Expansion (best) FullDoc % DomPS % FixedPS % VIPS % CombPS %

24 Discussions FullDoc can only obtain a low and insignificant result –The baseline is low, so many top ranked documents are actually irrelevant DomPS is not good and very unstable –The segmentation is too detailed –Semantic block can hardly be detected and expansion terms are not good FixedPS is stable and good –Similar result as the case in traditional IR –A window may miss the real semantic blocks VIPS is very good –Top blocks usually have very good quality –Length normalization is still a problem CombPS is almost the best method in all experiments –More than just a tradeoff

25 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

26 Conclusion Page segmentation is effective for improving web search –Block Retrieval –Block-level Query Expansion Plain-text retrieval Fixed-windows partition Web information retrieval Semantic partition (VIPS) Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis

27 Thanks!

28 Block Retrieval on TREC 2001 (Average Precision) Page Segmentation BaselineBR onlyBR + DR best DomPS FixedPS VIPS CombPS Result on TREC 2001 (Average Precision)

29 Query Expansion on TREC 2001 (Average Precision) Page Segmentatio n Baseline Query Expansion (best) FullDoc % DomPS % FixedPS % VIPS % CombPS % Result on TREC 2001 (Average Precision)

30 Summarization on Block Retrieval DomPS seems to be the worst and most unstable method –The produced blocks are too detailed –Blocks can not be mapped to a single semantic part within pages FixedPS is stable but not very good –Similar result as the case in traditional IR –It lacks semantic partition and fails to find best semantic blocks VIPS is very good and stable –Semantic partition is important to web context, especially to newly crawled web pages (e.g., TREC 2002) –The inability to deal with varying length problem results a poor performance for VIPS in somehow old data set CombPS is a very good tradeoff between VIPS and FixedPS

31 Summarization on Query Expansion FullDoc could only obtain a relatively low and insignificant result –The baseline is low, so many top ranked documents are actually irrelevant DomPS fails to obtain a significant improvement over baseline –The segmentation is too detailed, so expansion terms are not very good VIPS is very good using small number of blocks –Top blocks usually have very good quality –VIPS can provide semantic partition and good expansion terms FixedPS is very stable and good –Very stable when number of blocks increases –A window may cover contents from different semantic regions, thus noisy terms will likely to be introduced CombPS is the best method in both data sets –More than just a tradeoff