Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Algoritmi per IR Ranking. The big fight: find the best ranking...
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
CpSc 881: Information Retrieval. 2 Why is ranking so important? Problems with unranked retrieval Users want to look at a few results – not thousands.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Faster TF-IDF David Kauchak cs458 Fall 2012 adapted from:
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Vector Space Models.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
The Vector Space Models (VSM)
Locality-sensitive hashing and its applications
Information Retrieval and Web Search
Ch 6 Term Weighting and Vector Space Model
CSE 538 MRS BOOK – CHAPTER VII
Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Basic Information Retrieval
Representation of documents and queries
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
From frequency to meaning: vector space models of semantics
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Query processing: phrase queries and positional indexes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS276 Lecture 7: Scoring and results assembly
Information Retrieval and Web Design
CS276 Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Document ranking Text-based Ranking (1° generation)

Doc is a binary vector Binary vector X,Y in {0,1}D Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Doc is a binary vector Binary vector X,Y in {0,1}D Score: overlap measure What’s wrong ?

Normalization Dice coefficient (wrt avg #terms): Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Normalization Dice coefficient (wrt avg #terms): Jaccard coefficient (wrt possible terms): NO, triangular OK, triangular

What’s wrong in binary vect? Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" What’s wrong in binary vect? Overlap matching doesn’t consider: Term frequency in a document Talks more of t ? Then t should be weighted more. Term scarcity in collection of commoner than baby bed Length of documents score should be normalized

A famous “weight”: tf-idf Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A famous “weight”: tf-idf tf = Number of occurrences of term t in doc d t,d where nt = #docs containing term t n = #docs in the indexed collection log n idf t ø ö ç è æ = Vector Space model

Why distance is a bad idea Sec. 6.3 Why distance is a bad idea

Easy to Spam An example cos(a) = v  w / ||v|| * ||w|| Computational Problem #pages .it ≈ a few billions # terms ≈ some mln #ops ≈ 1015 1 op/ns ≈ 1015 ns ≈ 1 week !!!! t1 t2 document v w term 1 2 4 term 2 term 3 3 1 cos(a) = 2*4 + 0*0 + 3*1 / sqrt{ 22 + 32 } * sqrt{ 42 + 12 }  0,75  40° Paolo Ferragina, Web Algorithmics

cosine(query,document) Sec. 6.3 cosine(query,document) Dot product qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. See Law of Cosines (Cosine Rule) wikipedia page

Cos for length-normalized vectors For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized.

Sec. 7.1.2 Storage For every term t, we have in memory the length nt of its posting list, so the IDF is implicitly available. For every docID d in the posting list of term t, we store its frequency tft,d which is tipically small and thus stored with unary/gamma.

Computing cosine score We could restrict to docs in the intersection

Vector spaces and other operators Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Vector spaces and other operators Vector space OK for bag-of-words queries Clean metaphor for similar-document queries Not a good combination with operators: Boolean, wild-card, positional, proximity First generation of search engines Invented before “spamming” web search

Approximate retrieval Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Top-K documents Approximate retrieval

Speed-up top-k retrieval Sec. 7.1.1 Speed-up top-k retrieval Costly is the computation of the cos() Find a set A of contenders, with k < |A| << N Set A does not necessarily contain all top-k, but has many docs from among the top-k Return the top-k docs in A, according to the score The same approach is also used for other (non-cosine) scoring functions Will look at several schemes following this approach

Sec. 7.1.2 How to select A’s docs Consider docs containing at least one query term (obvious… as done before!). Take this further: Only consider docs containing many query terms Only consider high-idf query terms Champion lists: top scores Fancy hits: for complex ranking functions Clustering

Approach #1: Docs containing many query terms For multi-term queries, compute scores for docs containing several of the query terms Say, at least q-1 out of q terms of the query Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversal

Many query terms Scores only computed for docs 8, 16 and 32. Antony 3 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128 Caesar 1 2 3 5 8 13 21 34 Calpurnia 13 16 32 Scores only computed for docs 8, 16 and 32.

Approach #2: High-idf query terms only Sec. 7.1.2 Approach #2: High-idf query terms only High-IDF means short posting lists = rare term Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Only accumulate ranking for the documents in those posting lists

Approach #3: Champion Lists Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Approach #3: Champion Lists Preprocess: Assign to each term, its m best documents Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically.

Approach #4: Fancy-hits heuristic Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Approach #4: Fancy-hits heuristic Preprocess: Assign docID by decreasing PR weight, sort by docID = decr PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest Idea: a document that scores high should be in FH or in the front of IL Search for a t-term query: First FH: Take the common docs of their FH compute the score of these docs and keep the top-k docs. Then IL: scan ILs and check the common docs Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score becomes smaller than some threshold.

Sec. 7.1.4 Modeling authority Assign to each document a query-independent quality score in [0,1] to each document d Denote this by g(d) Thus, a quantity like the number of citations (?) is scaled into [0,1]

Champion lists in g(d)-ordering Sec. 7.1.4 Champion lists in g(d)-ordering Can combine champion lists with g(d)-ordering Or, maintain for each term a champion list of the r>k docs with highest g(d) + tf-idftd G(d) may be the PageRank Seek top-k results from only the docs in these champion lists

Approach #5: Clustering Sec. 7.1.6 Approach #5: Clustering Query Leader Follower

Cluster pruning: preprocessing Sec. 7.1.6 Cluster pruning: preprocessing Pick N docs at random: call these leaders For every other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ N followers.

Cluster pruning: query processing Sec. 7.1.6 Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek K nearest docs from among L’s followers.

Why use random sampling Sec. 7.1.6 Why use random sampling Fast Leaders reflect data distribution

Sec. 7.1.6 General variants Have each follower attached to b1=3 (say) nearest leaders. From query, find b2=4 (say) nearest leaders and their followers. Can recur on leader/follower construction.