1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Slides:



Advertisements
Similar presentations
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Advertisements

Chapter 5: Introduction to Information Retrieval
Linh Harvesting useful data from researchers’ homepages.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Information Retrieval in Practice
Webpage Understanding: an Integrated Approach
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Block-based Web Search Deng Cai 1*, Shipeng Yu 2*, Ji-Rong Wen *, Wei-Ying Ma * SIGIR ’ 04 * Microsoft Research Asia Beijing, China {jrwen,
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Queensland University of Technology
CSCE822 Data Mining and Warehousing
Lecture 12: Relevance Feedback & Query Expansion - II
Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich

2 2 Problems in Traditional IR Term-Document Irrelevance Problem –Noisy terms –Multiple topics Variant Document Length Problem –Length normalization is important Passage Retrieval in traditional IR –Partition the document to several passages –Solve the problem in some sense –Has three types of passages: discourse, semantic, window –Fixed-window passage is shown to be robust

3 3 Problems in Web IR Noisy information –Navigation –Decoration –Interaction –…–… Multiple topics –May contain text as well as images or links Noisy Information Multiple Topics

4 4 Problems in Web IR (Cont.) Variant Document Length Problem Conclusion: in web IR all the problems of traditional IR remain and are more severe! TREC-2&4TREC-4&5WT10g.GOV Number of doc524,929556,0771,692,0961,247,753 Text size (Mb)2,0592,13410,19018,100 Median length (Kb) Average length (Kb)

5 5 Challenges in Web IR New characteristics of web pages –Two-Dimensional Logical Structure –Visual Layout Presentation Page segmentation methods can be achieved –Obtain blocks from web pages –Block-based web search is possible Space Color Font Style Font Size Separator

6 6 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

7 7 Web Page Segmentation Approaches Fixed-length approach (FixedPS) –Traditional window-based passage retrieval DOM-based approach (DomPS) –Like the natural paragraph in traditional passage retrieval Vision-based Web Page Segmentation (VIPS) –Achieve a semantic partition to some extent Combined Approach (CombPS) –Combined VIPS & Fixed-length Web Page Segmentation FixedPSDomPSVIPSCombPS Passage Retrieval WindowDiscourseSemantic Semantic Window

8 8 Fixed-length Page Segmentation (FixedPS) A block contains words of fixed-length Traditional window-based methods can be applied Approaches –Overlapped windows (e.g. Callan, SIGIR’94) –Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR’97) Results –A simple but robust approach –Do not consider semantic information

9 9 DOM-based Page Segmentation (DomPS) Rely on the DOM structure to partition the page –DOM: Document-Object Model Current approaches –Only base on tags (e.g. Crivellari et al, TREC 9) –Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR’01) Results –Similar to discourse in passage retrieval –DOM represents only part of the semantic structure –Imprecise content structure

10 VIPS Algorithm Motivation –Topics can be distinguished with visual cues in many cases –Utilize the two-dimensional structure of web pages Goal –Extract the semantic structure of a web page to some extent, based on its visual presentation Procedure –Top-down partition the web page based on the separators Result –A tree structure, each node in the tree corresponds to a block in the page –Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception

11 VIPS: An Example Microsoft Technical Report MSR-TR

12 Combined Approach (CombPS) VIPS solves the problems of noisy information and multi-topics FixedPS can deal with the variant document length problem Combine these two: –Partition the webpage using VIPS –Divide the blocks containing more words than pre-defined window length Block length after segment 50,000 pages using VIPS chosen from the WT10g data set

13 Web Page Segmentation Summarization Fixed-length approach (FixedPS) –traditional passage retrieval DOM-based approach (DomPS) –Like the natural paragraph in traditional passage retrieval Vision-based Web Page Segmentation (VIPS) –Achieve a semantic partition to some extent Combined Approach (CombPS) –Combined VIPS & Fixed-length Web Page Segmentation FixedPSDomPSVIPSCombPS Passage Retrieval WindowDiscourseSemantic Semantic Window

14 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

15 Block Retrieval Similar to traditional passage retrieval Retrieve blocks instead of full documents Combine the relevance of blocks with relevance of documents Goal: –Verify if page segmentation can deal with both the length normalization and multiple-topic problems

16 Block-level Query Expansion Similar to passage-level pseudo-relevance feedback Expansion terms are selected from top blocks instead of top documents Goal: –Testify if page segmentation can benefit the selection of query terms through increasing term correlations within a block, and thus improve the final performance

17 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

18 Experiments Methodology –Fixed-length window approach (FixedPS) Overlapped window with size of 200 words –DOM-based approach (DomPS) Iterate the DOM tree for some structural tags A block is constructed and identified by such leaf tag Free text between two tags is treated as a special block –Vision-based approach (VIPS) The permitted degree of coherence is set to 0.6 All the leaf nodes are extracted as visual blocks –The combined approach (CombPS) VIPS then FixedPS –Full document approach (FullDoc) No segmentation is performed

19 Experiments (Cont.) Dataset –TREC 2001 Web Track WT10g corpus (1.69 million pages), crawled at queries (topics ) –TREC 2002 Web Track.GOV corpus (1.25 million pages), crawled at queries (topics ) Retrieval System –Okapi, with weighting function BM2500 Preprocessing –Standard stop-word list –Do not use stemming and phrase information Tune parameters in BM2500 to achieve best baselines Evaluation criteria:

20 Experiments on Block Retrieval Steps: 1.Do original document retrieval –Obtain a document rank DR 2.Analyze top N (1000 here) documents to get a block set 3.Do block retrieval on the block set (same as Step 1 but replace the document with block) –Obtain a block rank BR –Documents are re-ranked by the single-best block in each document 4.Combine the BR and DR to get a new rank of document – – is the tuning parameter

21 Block Retrieval on TREC 2001 and TREC 2002 Page Segmentation BaselineBR onlyBR + DR best DomPS FixedPS VIPS CombPS Page Segmentation BaselineBR onlyBR + DR best DomPS FixedPS VIPS CombPS Result on TREC 2001 Result on TREC 2002

22 Experiments on Block-level Query Expansion Steps: 1.Same steps as block retrieval –Do original document retrieval to get DR –Analyze top N (1000 here) documents to get a block set –Do block retrieval on the block set to get BR 2.Select some expansion terms based on top blocks –10 expansion terms in our experiments –Number of top blocks is a tuning parameter 3.Document retrieval with the expanded query –Modify the term weights before final retrieval

23 Query Expansion on TREC 2001 and TREC 2002 Page Segmentation Baseline Query Expansion (best) FullDoc % DomPS % FixedPS % VIPS % CombPS % Result on TREC 2001 Result on TREC 2002 Page Segmentation Baseline Query Expansion (best) FullDoc % DomPS % FixedPS % VIPS % CombPS %

24 Discussions FullDoc can only obtain a low and insignificant result –The baseline is low, so many top ranked documents are actually irrelevant DomPS is not good and very unstable –The segmentation is too detailed –Semantic block can hardly be detected and expansion terms are not good FixedPS is stable and good –Similar result as the case in traditional IR –A window may miss the real semantic blocks VIPS is very good –Top blocks usually have very good quality –Length normalization is still a problem CombPS is almost the best method in all experiments –More than just a tradeoff

25 Outline Motivation Page segmentation approaches Web search using page segmentation –Block Retrieval –Block-level Query Expansion Experiments and Discussions Conclusion

26 Conclusion Page segmentation is effective for improving web search –Block Retrieval –Block-level Query Expansion Plain-text retrieval  Fixed-window’s partition Web information retrieval  Semantic partition (VIPS) Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis

27 Thanks!