Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Review: Search problem formulation
Introduction to Algorithms Quicksort
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
Computational Methods for Management and Economics Carla Gomes
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CpSc 881: Information Retrieval. 2 Why is ranking so important? Problems with unranked retrieval Users want to look at a few results – not thousands.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Faster TF-IDF David Kauchak cs458 Fall 2012 adapted from:
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Jessie Zhao Course page: 1.
Order Statistics. Order statistics Given an input of n values and an integer i, we wish to find the i’th largest value. There are i-1 elements smaller.
BackTracking CS335. N-Queens The object is to place queens on a chess board in such as way as no queen can capture another one in a single move –Recall.
Priority Queues and Binary Heaps Chapter Trees Some animals are more equal than others A queue is a FIFO data structure the first element.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Review: Tree search Initialize the frontier using the starting state While the frontier is not empty – Choose a frontier node to expand according to search.
15.053Tuesday, April 9 Branch and Bound Handouts: Lecture Notes.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Adversarial Search 2 (Game Playing)
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CSE 554 Lecture 5: Contouring (faster)
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval and Web Search
Top-K documents Exact retrieval
BackTracking CS255.
Information Retrieval in Practice
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Modern Information Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS276 Lecture 7: Scoring and results assembly
CS276 Information Retrieval and Web Search
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient scoring

Introduction to Information Retrieval Today’s focus  Retrieval – get docs matching query from inverted index  Scoring+ranking  Assign a score to each doc  Pick K highest scoring docs  Our emphasis today will be on doing this efficiently, rather than on the quality of the ranking 2

Introduction to Information Retrieval Background  Score computation is a large (10s of %) fraction of the CPU work on a query  Generally, we have a tight budget on latency (say, 250ms)  CPU provisioning doesn’t permit exhaustively scoring every document on every query  Today we’ll look at ways of cutting CPU usage for scoring, without compromising the quality of results (much)  Basic idea: avoid scoring docs that won’t make it into the top K 3

Introduction to Information Retrieval Recap: Queries as vectors  Vector space scoring  We have a weight for each term in each doc  Represent queries as vectors in the space  Rank documents according to their cosine similarity to the query in this space  Or something more complex: BM25, proximity, …  Vector space scoring is  Entirely query dependent  Additive on term contributions – no conditionals etc.  Context insensitive (no interactions between query terms) Ch. 6

Introduction to Information Retrieval TAAT vs DAAT techniques  TAAT = “Term At A Time”  Scores for all docs computed concurrently, one query term at a time  DAAT = “Document At A Time”  Total score for each doc (incl all query terms) computed, before proceeding to the next  Each has implications for how the retrieval index is structured and stored 5

Introduction to Information Retrieval Efficient cosine ranking  Find the K docs in the collection “nearest” to the query  K largest query-doc cosines.  Efficient ranking:  Choosing the K largest cosine values efficiently.  Can we do this without computing all N cosines? Sec. 7.1

Introduction to Information Retrieval Safe vs non-safe ranking  The terminology “safe ranking” is used for methods that guarantee that the K docs returned are the K absolute highest scoring documents  (Not necessarily just under cosine similarity)  Is it ok to be non-safe?  If it is – then how do we ensure we don’t get too far from the safe solution?  How do we measure if we are far? 7

Introduction to Information Retrieval Non-safe ranking  Covered in depth in Coursera video (number 7)  Non-safe ranking may be okay  Ranking function is only a proxy for user happiness  Documents close to top K may be just fine  Index elimination  Only consider high-idf query terms  Only consider docs containing many query terms  Champion lists  High/low lists, tiered indexes  Order postings by g(d) (query-indep. quality score) 8

Introduction to Information Retrieval SAFE RANKING 9

Introduction to Information Retrieval Safe ranking  When we output the top K docs, we have a proof that these are indeed the top K  Does this imply we always have to compute all N cosines?  We’ll look at pruning methods  So we only fully score some J documents  Do we have to sort the J cosine scores? 10

Introduction to Information Retrieval Computing the K largest cosines: selection vs. sorting  Typically we want to retrieve the top K docs (in the cosine ranking for the query)  not to totally order all docs in the collection  Can we pick off docs with K highest cosines?  Let J = number of docs with nonzero cosines  We seek the K best of these J Sec. 7.1

Introduction to Information Retrieval Use heap for selecting top K  Binary tree in which each node’s value > the values of children  Takes 2J operations to construct, then each of K “winners” read off in O(log J) steps.  For J=1M, K=100, this is about 10% of the cost of sorting Sec

Introduction to Information Retrieval WAND scoring  An instance of DAAT scoring  Basic idea reminiscent of branch and bound  We maintain a running threshold score – e.g., the K th highest score computed so far  We prune away all docs whose cosine scores are guaranteed to be below the threshold  We compute exact cosine scores for only the un-pruned docs 13 Broder et al. Ef fi cient Query Evaluation using a Two-Level Retrieval Process.

Introduction to Information Retrieval Index structure for WAND  Postings ordered by docID  Assume a special iterator on the postings of the form “go to the first docID greater than or equal to X”  Typical state: we have a “finger” at some docID in the postings of each query term  Each finger moves only to the right, to larger docIDs  Invariant – all docIDs lower than any finger have already been processed, meaning  These docIDs are either pruned away or  Their cosine scores have been computed 14

Introduction to Information Retrieval Upper bounds  At all times for each query term t, we maintain an upper bound UB t on the score contribution of any doc to the right of the finger  Max (over docs remaining in t’s postings) of w t (doc) 15 t finger UB t = w t (38) As finger moves right, UB drops

Introduction to Information Retrieval Pivoting  Query: catcher in the rye  Let’s say the current finger positions are as below 16 catcher rye in the UB catcher = 2.3 UB rye = 1.8 UB in = 3.3 UB the = 4.3 Threshold = 6.8 PivotPivot

Introduction to Information Retrieval Prune docs that have no hope  Terms sorted in order of finger positions  Move fingers to 589 or right 17 catcher rye in the UB catcher = 2.3 UB rye = 1.8 UB in = 3.3 UB the = 4.3 Threshold = 6.8 PivotPivot Hopeless docs Update UB’s

Introduction to Information Retrieval Compute 589’s score if need be  If 589 is present in enough postings, compute its full cosine score – else some fingers to right of 589  Pivot again … 18 catcher rye in the

Introduction to Information Retrieval WAND summary  In tests, WAND leads to a 90+% reduction in score computation  Better gains on longer queries  Nothing we did was specific to cosine ranking  We need scoring to be additive by term  WAND and variants give us safe ranking  Possible to devise “careless” variants that are a bit faster but not safe (see summary in Ding+Suel 2011)  Ideas combine some of the non-safe scoring we considered 19