Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Text Similarity David Kauchak CS457 Fall 2011.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
CpSc 881: Information Retrieval
Scoring and Ranking 198:541. Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
CS347 Lecture 3 April 16, 2001 ©Prabhakar Raghavan.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Information Retrieval Lecture 7. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval Lecture 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
TF-IDF David Kauchak cs458 Fall 2012 adapted from:
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Introduction to Digital Libraries Searching
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Vector Space Models.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS276 Information Retrieval and Web Search Lecture 7: Vector space retrieval.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval and Web Search Lecture 7: Vector space retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
The Vector Space Models (VSM)
Ch 6 Term Weighting and Vector Space Model
Information Retrieval and Web Search
Information Retrieval and Web Search
Representation of documents and queries
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
Presentation transcript:

Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live in this space even with stemming, may have 20,000+ dimensions The corpus of documents gives us a matrix, which we could also view as a vector space in which words live – transposable data

Why turn docs into vectors? Query-by-example Given a doc D, find others “like” it. Now that D is a vector, => Given a doc, find vectors (docs) “near” it. Intuition: t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ Postulate: Documents that are “close together” in vector space talk about the same things.

The vector space model Query as vector: We regard query as short document We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. Developed in the SMART system (Salton, c. 1970) Standard used by TREC participants and web IR systems t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ

Desiderata for Proximity Symmetry: If d 1 is near d 2, then d 2 is near d 1. Triangular inequality: If d 1 near d 2, and d 2 near d 3, then d 1 is not far from d 3. Identity: No doc is closer to d than d itself. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ

First cut Distance between d 1 and d 2 is the length of the vector |d 1 – d 2 |. Euclidean distance Why is this not a great idea? We still haven’t dealt with the issue of length normalization Long documents would be more similar to each other by virtue of length, not topic However, we can implicitly normalize by looking at angles instead

Cosine similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. Note – this is similarity, not distance You do not have triangular inequality, but good enough t 1 d 2 d 1 t 3 t 2 θ

Cosine similarity A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L 2 norm This maps vectors onto the unit sphere: Then, Longer documents don’t get more weight

Cosine similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. Normalization

Normalized vectors For normalized vectors, the cosine is simply the dot product:

Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights. tf weights cos(SAS, PAP) =.996 x x x 0.0 = cos(SAS, WH) =.996 x x x.254 = 0.889

Cosine similarity exercises Rank the following by decreasing cosine similarity: Two docs that have only frequent words (the, a, an, of) in common. Two docs that have no words in common. Two docs that have many rare words in common (wingspan, tailfin).

Queries in the vector space model Central idea: the query as a vector: We regard the query as short document We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. Note that d q is very sparse!

Summary What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same space as the docs. Can measure each doc’s proximity to it. Natural measure of scores/ranking – no longer Boolean. Is it the only thing that people use? No; if you know that some words have special meaning in your own corpus, not justified by tf.idf, then manual editing of weights may take place E.g., “benefits” might be re-weighted by hand in the corpus of an HR department

Digression: spamming indices This was all invented before the days when people were in the business of spamming web search engines. Consider: Indexing a sensible passive document collection vs. An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scores Vector space similarity may not be as useful in this context.

Issues to consider How would you augment the inverted index to support cosine ranking computations? Walk through the steps of serving a query. The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivial