Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Slides:



Advertisements
Similar presentations
A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
Advertisements

Group Recommendation: Semantics and Efficiency
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Architecture of a Search Engine
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
A Hybrid Caching Strategy for Streaming Media Files Jussara M. Almeida Derek L. Eager Mary K. Vernon University of Wisconsin-Madison University of Saskatchewan.
A Case for Delay-conscious Caching of Web Documents Peter Scheuermann, Junho Shim, Radek Vingralek Department of Electrical and Computer Engineering Northwestern.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Tag-based Social Interest Discovery
Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…
Information Retrieval in Practice
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Information Retrieval in Practice
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Information Retrieval in Practice
Text Based Information Retrieval
The Impact of Replacement Granularity on Video Caching
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Query Caching in Agent-based Distributed Information Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Efficient Cache-Supported Path Planning on Roads
Query processing: phrase queries and positional indexes
Large Scale Findability Analysis
Information Retrieval and Web Design
Presentation transcript:

ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras Ricardo Baeza-Yates The 31st Annual International ACM SIGIR Conference Singapore, 21 July 2008

Motivation Caching – crucial for WSE to save resources Results caching: Is efficient with real queries But its hit rate is limited due to singletons How to increase the hit-rate further? – index pruning

Contents ResIn architecture Original query stream vs. query stream after the results cache (misses) Static pruned index: Term pruning Document pruning A combination of both Conclusion

ResIn architecture We study Results Caching and Index Pruning together Query processing: 1. from the main index Back end query result Front Term cache Main Index Broker Top results Top results Top results query query query We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers

ResIn architecture We study Results Caching and Index Pruning together Query processing: 2. from the results cache Back end Results cache hit miss query result Front Term cache Main Index Broker query We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers

ResIn architecture We study Results Caching and Index Pruning together Query processing: 3. from the pruned index Back end miss hit query result Front Term cache Main Index Broker Pruned index query query Results miss cache hit We study Results Caching and Index Pruning together … to reduce latency and load on back-end servers

Original query stream (all queries) vs Original query stream (all queries) vs. query stream after the results cache (misses)

All queries vs. Misses: Experimental setup Real query log to test results cache and generate a “miss-log”: Original query log all queries “Miss-log” misses … Q185’000’000: last query Q1: britney spears … Q1: britney spears Q2: sigir 2007 Results cache (LRU) Q2: sigir 2007 miss Q3: britney spears 185M queries from yahoo.co.uk Q4: sigir 2008 Q4: sigir 2008 hit Q3: britney spears …

All queries vs. Misses: Number of terms in a query Average number of terms for all queries = 2.4 Most single term queries are hits in the results cache Queries with many terms are unlikely to be hits , for misses = 3.2

All queries vs. Misses: Query result size distribution Randomly selected 2000 queries from all queries and misses: Avg. result size for misses is ~100 times smaller than for all queries Approx. half of the misses returns less than 5000 results – SMALL! Similar results with a “small” UK document collection (78M)

All queries vs. Misses: Term popularity distribution Each point -> avg. popularity of 1000 consecutive terms The order of terms for misses is the same as for all queries Terms which were popular before the results cache remain popular after Log sizes: 185M – all queries, 41M - misses

Static index pruning

Static pruned index Smaller version of the main index, returns: the top-k response that is the same as the main index’s, or a miss otherwise. Assumes Boolean query processing Types of pruning: Term pruning – full posting lists for selected terms Document pruning – truncated posting lists Term+Document pruning – combination of both Full index Term pruning Document pruning T+D pruning t1 t1 t1 t1 t2 t2 t2 t2 t3 t3 t3 t3 t4 t4 t4 t4 Posting list

Term Pruning: Performance Term pruning based on profit(t)=popularity(t)/df(t) Answers a query if all query terms are in the pruned index Performs well for all queries For misses as well: e.g., can process almost 50% of the queries with 25% of the index UK document collection, 78M documents:

Result Caching + Term Pruning Results caching performance is independent of the collection size results cache capacity is up to 10% of the full index size

Term pruning: Frequent terms in misses MinDF (df of the least frequent query term) correlates to the result size MaxDF (df of the most frequent query term) is high for most of the misses Many misses contain at least one frequent term => the term pruned index has to include large posting lists Gleb Flavio Vassilis Ricardo MinDF MaxDF •••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••• ••••••••••

Document pruning Based on Fagin’s top-k intersection algorithm Keeps postings with high scores only: Sufficient to compute top-k results for some queries Determining correctness of the result requires computing of a scoring threshold – LATENCY! Top-2 results: t1 D1 D5 D3 D2 D4 … D1 D2 t2 D2 D1 D5 … Score threshold: s(D2,t1)+s(D1,t2)+s(D2,t3) t3 D4 D1 D2 D3 … Posting list, sorted by score

Document pruning: Experimental setup Scoring function: pr(d) – query independent score of the document d (pagerank) ω, k – normalization constants: ω=[0,10,20] k=1 We try different values of PLLmax – maximum Posting List Length and choose the one that maximizes the hit rate We only look at the upper bound for the hit rate: Whether the original top-10 results found in the top portions of all PLs?

Document pruning: performance Doc. pruning needs high pagerank weights It performs better for All queries than for Misses

Term+Document pruning: performance T+D pruning is the best but expensive (high latency) profit2 is better than profit1 Improvement is marginal for misses unless the pagerank weight is very high

Conclusions Results caching: Lesson learned: Index pruning: delivers good hit rates with a constant capacity but hit rate is limited because of singletons Index pruning: has no limit on hit rate, but the pruned index size grows with the doc. collection – more expensive Static index pruning: addition to results caching, not replacement Term pruning performs well for misses also => “compatible” with results cache Document pruning: all queries - OK, misses - only with high pagerank weights Term+Document pruning slightly improves over document pruning Important to consider the interaction between the components Lesson learned:

Last slide Thank you Questions?