Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Hinrich Schütze and Christina Lioma
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
CpSc 881: Information Retrieval
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Information Retrieval Lecture 6. Recap of lecture 4 Query expansion Index construction.
Scoring and Ranking 198:541. Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
CS347 Lecture 3 April 16, 2001 ©Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Information Retrieval Lecture 7. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval Lecture 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Lecture 5 Term and Document Frequency CS 6633 資訊檢索 Information Retrieval and Web Search 資電 125 Based on ppt files by Hinrich Schütze.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval and Web Mining Lecture 6. This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Vector Space Models.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
CS276 Information Retrieval and Web Search Lecture 7: Vector space retrieval.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval and Web Search Lecture 7: Vector space retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
The Vector Space Models (VSM)
Ch 6 Term Weighting and Vector Space Model
Information Retrieval and Web Search
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
Presentation transcript:

Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Boolean Search Strength  Docs either match or not.  Good for expert users with precise understanding of their needs and the corpus. Weakness  Not good for (the majority of) users with poor Boolean formulation of their needs.  Applications may consume 1000’s of results, but most users don’t want to wade through 1000’s of results – cf. use of Web search engines.

Beyond Boolean Search Solution: Ranking  We wish to return in order the documents most likely to be useful to the searcher.  How can we rank/order the docs in the corpus with respect to a query?  Assign a score – say in [0,1] – for each doc on each query.

Document Scoring Idea: More is Better  If a document talks about a topic more, then it is a better match.  That is to say, a document is more relevant if it has more relevant terms.  This leads to the problem of term weighting.

Bag-Of-Words (BOW) Model Term-Document Count Matrix  Each document corresponds to a vector in ℕ v, i.e., a column below. The matrix element A(i,j) is the number of occurrences of the i -th term in the j -th doc.

Bag-Of-Words (BOW) Model Simplification  In the BOW model,  the doc John is quicker than Mary.  is indistinguishable from the doc Mary is quicker than John.

Term Frequency (TF) Digression: Terminology  WARNING: In a lot of IR literature, “frequency” is used to mean “count”. Thus term frequency in IR literature is used to mean the number of occurrences of a term in a document not divided by document length (which would actually make it a frequency). We will conform to this misnomer: in saying term frequency we mean the number of occurrences of a term in a document.

Term Frequency (TF) What is the relative importance of  0 vs. 1 occurrence of a term in a doc,  1 vs. 2 occurrences,  2 vs. 3 occurrences, ……? Can just use raw tf. While it seems that more is better, a lot isn’t proportionally better than a few.  So another option commonly used in practice:

The score of a document d for a query q  0 if no query terms in document  wf can be used instead of tf in the above Term Frequency (TF)

Is TF good enough for weighting?  Ignorance of document length Long docs are favored because they’re more likely to contain query terms. This can be fixed to some extent by normalizing for document length. [talk later]

Term Frequency (TF) Is TF good enough for weighting?  Ignorance of term rarity in corpus Consider the query ides of march.  Julius Caesar has 5 occurrences of ides, while no other play has ides.  march occurs in over a dozen.  All the plays contain of. By this weighting scheme, the top-scoring play is likely to be the one with the most ofs.

Document/Collection Frequency Which of these tells you more about a doc?  5 occurrences of of?  5 occurrences of march?  5 occurrences of ides? We’d like to attenuate the weight of a common term. But what is “common”?  Collection Frequency (CF) the number of occurrences of the term in the corpus  Document Frequency (DF) the number of docs in the corpus containing the term

Document/Collection Frequency DF may be better than CF WordCFDF try insurance So how do we make use of DF?

Inverse Document Frequency (IDF) Could just be the reciprocal of DF ( idf i = 1/df i ). But by far the most commonly used version is:

Inverse Document Frequency (IDF) Prof Karen Spark JonesKaren Spark Jones

TFxIDF TFxIDF weighting scheme combines:  Term Frequency (TF) measure of term density in a doc  Inverse Document Frequency (IDF) measure of informativeness of a term: its rarity across the whole corpus

TFxIDF Each term i in each document d is assigned a TFxIDF weight What is the weight of a term that occurs in all of the docs? Increases with the number of occurrences within a doc. Increases with the rarity of the term across the whole corpus.

Term-Document Matrix (Real-Valued) The matrix element A(i,j) is the log-scaled TFxIDF weight. Note: can be > 1.

Vector Space Model Docs  Vectors  Each doc j can now be viewed as a vector of TFxIDF values, one component for each term. So we have a vector space  Terms are axes  Docs live in this space  May have 20,000+ dimensions even with stemming

Vector Space Model Prof Gerard SaltonGerard Salton The SMART information retrieval system

Vector Space Model First application: Query-By-Example (QBE)  Given a doc d, find others “like” it.  Now that d is a vector, find vectors (docs) “near” it. Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ

Vector Space Model Queries  Vectors  Regard a query as a (very short) document.  Return the docs ranked by the closeness of their vectors to the query, also represented as a vector.

Desiderata for Proximity If d 1 is near d 2, then d 2 is near d 1. If d 1 near d 2, and d 2 near d 3, then d 1 is not far from d 3. No doc is closer to d than d itself.

Euclidean Distance Distance between d j and d k is Why is this not a great idea?  We still haven’t dealt with the issue of length normalization: long documents would be more similar to each other by virtue of length, not topic.  However, we can implicitly normalize by looking at angles instead.

Cosine Similarity Vector Normalization  A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L 2 norm  This maps vectors onto the unit sphere.  Then longer documents don’t get more weight.

Cosine Similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. This means normalization.

Cosine Similarity The similarity between d j and d k is captured by the cosine of the angle between their vectors. No triangle inequality for similarity. t 1 d 2 d 1 t 3 t 2 θ

Cosine Similarity - Exercise Rank the following by decreasing cosine similarity:  Two docs that have only frequent words (the, a, an, of) in common.  Two docs that have no words in common.  Two docs that have many rare words in common (wingspan, tailfin).

Cosine Similarity - Exercise Show that, for normalized vectors, Euclidean distance measure gives the same proximity ordering as the cosine similarity measure.

Cosine Similarity - Example Docs  Austen's Sense and Sensibility (SaS)  Austen's Pride and Prejudice (PaP)  Bronte's Wuthering Heights (WH) cos(SaS, PaP) = x x x = cos(SaS, WH) = x x x = 0.889

Vector Space Model - Summary What’s the real point of using vector space?  Every query can be viewed as a (very short) doc.  Every query becomes a vector in the same space as the docs.  Can measure each doc’s proximity to the query.  It provides a natural measure of scores/ranking – no longer Boolean. Docs (and queries) are expressed as bags of words.

Vector Space Model - Exercise How would you augment the inverted index built in previous lectures to support cosine ranking computations? Walk through the steps of serving a query using the Vector Space Model.

Efficient Cosine Ranking Computing a single cosine  For every term t, with each doc d, Add tf t,d to postings lists. Some tradeoffs on whether to store term count, term weight, or weighted by IDF.  At query time, accumulate component-wise sum.

Efficient Cosine Ranking Computing the k largest cosines  Search as a k NN problem Find the k docs “nearest” to the query (with largest query-doc cosines) in the vector space. Do not need to totally order all docs in the corpus.  Use heap for selecting top k docs Binary tree in which each node’s value > values of children Takes 2n operations to construct, then each of k “winners” read off in with log n steps. For n=1M, k=100, this is about 10% of the cost of complete sorting.

Efficient Cosine Ranking Heuristics  Avoid computing cosines from query to each of n docs, but may occasionally get an answer wrong.  For example, cluster pruning.

Take Home Messages TFxIDF Vector Space Model  docs and queries as vectors  cosine similarity  efficient cosine ranking