CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Vector Space Model : TF - IDF
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CS276A Information Retrieval Lecture 7. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Information Retrieval Lecture 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Introduction to Information Retrieval Scores in a Complete Search System CSE 538 MRS BOOK – CHAPTER VII 1.
Faster TF-IDF David Kauchak cs458 Fall 2012 adapted from:
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
Basic Information Retrieval
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
CS276 Lecture 7: Scoring and results assembly
CS276 Information Retrieval and Web Search
Presentation transcript:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011

CSCI 5417 Today 9/8 Review basic Vector Space Model TF*IDF weighting Cosine scoring Ranked retrieval More efficient scoring/retrieval

CSCI 5417 Summary: tf x idf (or tf.idf) Assign a tf.idf weight to each term i in each document d Weight increases with the number of occurrences within a doc And increases with the rarity of the term across the whole corpus

CSCI 5417 Real-valued Term Vectors Still Bag of words model Each is a vector in ℝ M Here log-scaled tf.idf

CSCI 5417 Documents as Vectors Each doc j can now be viewed as a vector of wf values, one component for each term So we have a vector space Terms are axes Docs live in this space Number of dimensions is the size of the dictionary And for later… Terms (rows) are also vectors Docs are the dimensions

CSCI 5417 Intuition Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ

CSCI 5417 Cosine Similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. t 1 d 2 d 1 t 3 t 2 θ

8/18/20158 The Vector Space Model Queries are just short documents Take the freetext query as short document Return the documents ranked by the closeness of their vectors to the query vector.

8/18/20159 Cosine Similarity Similarity between vectors d 1 and d 2 captured by the cosine of the angle x between them. t 1 d 2 d 1 t 3 t 2 θ Why not Euclidean distance?

8/18/ Cosine similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. Cosine of angle between two vectors The denominator involves the lengths of the vectors. Normalization

8/18/ Normalized vectors For normalized vectors, the cosine is simply the dot product:

CSCI 5417 So... Basic ranked retrieval scheme is to Treat queries as vectors Compute the dot-product of the query with all the docs Return the ranked list of docs for that query.

But... What do we know about the document vectors? What do we know about query vectors? 8/18/201513

CSCI 5417 Scoring (1) N documents. Each gets a score. (2) Get the lengths for later use (3) Iterate over the query terms (6) Accumulate the scores for each doc, a term at a time (9) Normalize by doc vector length

CSCI 5417 Scoring

Note That approach is know as term at a time scoring... For obvious reasons An alternative is document at a time scoring where First you do a Boolean AND or OR to derive a candidate set of docs Then you loop over those docs scoring each in turn by looping over the query terms Pros and cons to each one. CSCI 5417

Speeding that up Two basic approaches... Optimize the basic approach by focusing on the fact that queries are short and simple Make the cosines faster Make getting the top K efficient Give up on the notion of finding the best K results from the total N That is, let’s approximate the top K and not worry if we miss some docs that should be in the top K

CSCI 5417 More Efficient Scoring Computing a single cosine efficiently Choosing the K largest cosine values efficiently. Can we do this without computing all N cosines? Or doing a sort of N things

CSCI 5417 Efficient Cosine Ranking What we’re doing in effect: solving the K-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes support this well

CSCI 5417 Special case – unweighted queries No weighting on query terms Assume each query term occurs only once TF is 1 Ignore IDF in the query weight

CSCI 5417 Faster cosine: unweighted query

CSCI 5417 Computing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) not to totally order all docs in the collection K << N Use a heap

Quiz Quiz 1 is September 27 Chapters 1-4, 6-9, and 12 will be covered As of today you should have read Chapters 1-4, 6 and 7. I’ll provide relevant page ranges Material in the book not covered in class will be on the quiz. Old quizzes are posted Try to work through them; mail me if you get stuck CSCI 5417

Approximation Cosine (-ish) scoring is still to expensive. Can we avoid all this computation? Yes, but may sometimes get it wrong a doc not in the top K may creep into the list of K output docs And a doc that should be there isn’t there Not such a bad thing

CSCI 5417 Cosine Similarity is a Convenient Fiction User has a task and a query formulation Cosine matches docs to query Thus cosine is just a proxy for user happiness If we get a list of K docs “close” to the top K by cosine measure, we should be ok

CSCI 5417 Generic approach Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs from set A Think of A as eliminating likely non- contenders

CSCI 5417 Candidate Elimination Basic cosine algorithms consider all docs containing at least one query term Because of the way we loop over the query terms For each query term For each doc in that terms postings To cut down on this we could short-circuit the outer loop or the inner loop or both

CSCI 5417 High-IDF Query Terms Only For a query such as catcher in the rye Only accumulate scores from catcher and rye in and the contribute little to the scores and don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs  these (many) docs get eliminated from A

CSCI 5417 Docs containing many query terms For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversal

CSCI of 4 query terms Brutus Caesar Calpurnia Antony Scores only computed for docs8, 16 and 32.

CSCI 5417 Champion Lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings Call this the champion list for t AKA fancy list or top docs for t At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from among these

CSCI 5417 Early Termination When processing query terms look at them in order of decreasing idf High idf terms likely to contribute most to score As we update the score contribution from each query term, stop if doc scores are relatively unchanged

Next time Start on evaluation CSCI 5417