« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Parallel and Distributed IR

EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.

Information Retrieval

Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Chapter 5: Information Retrieval and Web Search

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Chapter 6: Information Retrieval and Web Search

YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.

Evolution of Web from a Search Engine Perspective Saket Singam

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

1 CS 430: Information Discovery Lecture 5 Ranking.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Neighborhood - based Tag Prediction

HITS Hypertext-Induced Topic Selection

An Empirical Study of Learning to Rank for Entity Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Structure and Content Scoring for XML

Structure and Content Scoring for XML

Presentation transcript:

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A. Ntoulas, J. Cho Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline Introduction Architecture Optimal size of pruned index Pruning policies Experimental evaluation Conclusions

Introduction Observation: approximately 80% of the users examine at most the first 3 batches of the results. That is, 80% of the users typically view at most 30 to 60 results for every query that they issue to a search engine Contribution: is a new answer computation algorithm that guarantees that the top-matching pages (according to the search- engine’s ranking metric) are always placed at the top of search results, even though we are computing the first batch of answers from the pruned index most of the time

Two-tier Index Architecture

Architecture Definition 1 (Correctness indicator function): Given a query q, the p-index IP returns the answer A together with a correctness indicator function C. C is set to 1 if A is guaranteed to be identical (i.e. same results in the same order) to the result computed from the full index IF. If it is possible that A is different, C is set to 0 Question 1: How can we compute the correctness indicator function C? Question 2: How should we prune the index so as to realize maximum cost saving?

Pruned Index Size Observation: the cost of the two-tier architecture depends on two important parameters: the size of the p-index and the fraction of the queries that can be handled by the 1st tier index alone Theorem 1: The cost for handling the query load Q is minimal when the size of the p-index, s, satisfies d f(s) / ds = 1, where s is the fraction of p-index relative to full-index

Pruning Policies 1) Term (keyword)-locality: the search engine will be able to answer a significant fraction of user queries even if it can handle only a few popular keywords (and possibly those that constitute the majority of query load 2) Document-locality: as long as search engines can compute the first few top-k answers correctly, users often will not notice that the search engine actually has not computed the correct answer for the remaining results (unless the users explicitly request them). This is what actual commericial search engines do!

Pruning Policies: ranking function assumptions The exact ranking function that search engines employ is a closely guarded secret Query-dependent relevance: captures how relevant the query is to every document (cosine distance metric) Query-independent document quality: measures the overall “quality” of a document D independent of the particular query issued by the user (e.g. PageRank, Hits) Paper adopts as ranking function a linear combination of the two above factors (which should be monotonic)

Pruning Policies: term (horizontal) pruning Problem 2 (optimal keyword pruning): Given the query load Q and a goal index size s · |I F | for the pruned index, select the inverted lists I P = {I(t 1 ),..., I(t h )} such that |I P | ≤ s · | I F | and the fraction of queries that I P can answer (expressed by f(s)) is maximized. Theorem 2: The problem of calculating the optimal keyword pruning is NP-hard (proven reducible to knapsack or bin-packing problem) Therefore paper implements a greedy policy by keeping the items with the maximum benefit per unit cost

Pruning Policies: document (vertical) pruning Global and local - based document pruning algorithms. Neither guarantees the basic paper assumption (theorem 3)

Combined policy: extended keyword – specific document pruning For every inverted list, paper picks two theshold values. This policy, when combined with a correct selected monotonic ranking function guarantees paper assumption (theorem 4)

Experimental setup Dataset: 130M pages crawled from www (on March 2004). Seed is ODP homepage Total uncompressed size of web pages: ~2TB Full inverted index size: ~ 1.2TB Query set available: ~ 450M queries, only a fraction of them (~5%) processed 1 (average # of terms/query is 2) Selected ranking function: r(D, q) = pr norm (D) + tr norm (D, q) 1 issued to web site

Term vs document pruning performance 1) ~73% of the queries can be answered using 30% of the original index. Furthermore, using keyword pruning only, the optimal index size is s = ) for all index sizes larger than 40%, authors guarantee the correct answer for about 70% of the queries. Optimal index size here (doc-pruning) is s=0.20 3) For p-index sizes 20% then keyword-pruning performs much better

Combination 1)First apply term-pruning and subsequently doc-pruning 2)For p-index sizes smaller than 50%, combination does relatively well

Conclusions authors provided a framework for new pruning techniques and answer computation algorithms that guarantee that the top matching pages are always placed at the top of search results in the correct order term-pruned index can guarantee 73% of the queries with a size of 30% of the full index document-pruned index can guarantee 68% of the queries with the same size combination of the two pruning algorithms guarantees 60% of the queries with an index size of 16%

Any questions? Thank you for your attention!