Representation of documents and queries

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Vector Methods Classical IR
7CCSMWAL Algorithmic Issues in the WWW
CS 430: Information Discovery
Section 7.12: Similarity By: Ralucca Gera, NPS.
Information Retrieval and Web Search
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Vector Methods Classical IR
4. Boolean and Vector Space Retrieval Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

Representation of documents and queries Why do this? Want to compare documents Want to compare documents with queries Want to retrieve and rank documents with regards to a specific query A document representation permits this in a consistent way (type of conceptualization)

Boolean queries Document is relevant to a query of the query is in the document. Document is either relevant or not relevant to the query. What about relevance ranking – partial relevance. Vector model deals with this.

Matching - similarity Define methods of similarity and matching for documents and queries Use similarity and matching for ranking

Measures of similarity Retrieve the most similar documents to a query Equate similarity to relevance Most similar are the most relevant This measure is one of “lexical similarity” The matching of text or words

Document space Documents are organized in some manner - exist as points in a document space Documents treated as text, etc. Match query with document Query similar to document space Query not similar to document space and becomes a characteristic function on the document space Documents most similar are the ones we retrieve Reduce this a computable measure of similarity

Query similar to document space Query is a point in document space Documents “near” to the query are the ones we want. Near: Distance Lying in similar direction as other documents Others

Document Clustering Term 1 Term 2

Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

Representation of Documents Consider now only text documents Words are tokens (primitives) Why not letters? Stop words? How do we represent words? Even for video, audio, etc documents, we often use words as part of the representation

Documents as Vectors Documents are represented as “bags of words” Example? Represented as vectors when used computationally A vector is like an array of floating point values Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Vector Space Model Documents and queries are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

The Vector-Space Model Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| Each term i in a document or query j is given a real-valued weight, wij. Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

The Vector-Space Model 3 terms, t1, t2, t3 for all documents Vectors can be written differently d1 = (weight of t1, weight of t2, weight of t3) d1 = (w1,w2,w3) d1 = w1,w2,w3 or d1 = w1 t1 + w2 t2 + w3 t3

Example - documents and queries D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here   Q1: here Q2: to be

Definitions Documents vs terms Treat documents and queries as the same 4 docs and 2 queries => 6 rows Vocabulary in alphabetical order – dimension 7 be, forever, here, not, or, there, to => 7 columns 6 X 7 doc-term matrix 4 X 4 doc-doc matrix (exclude queries) 7 X 7 term-term matrix (exclude queries)

Document Collection A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : Dn w1n w2n … wtn Queries are treated just like documents!

Assigning Weights to Terms wij is the weight of term j in document i Binary Weights Raw term frequency tf x idf Deals with Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

doc-terms matrix Binary wts be forever here not or there to Doc 1 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2

doc-doc matrix wts terms that overlap 3 2 Doc 3 Doc 4

term-term matrix wts be forever here not or there to - 1 2 4

doc-terms tf wts be forever here not or there to Doc 1 2 1 Doc 2 Doc 3 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2

doc-doc matrix wts term freq 5 3 2 Doc 3 Doc 4

Term Weights: Term Frequency More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}

Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf x idf weight to each term in each document A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well.

TF x IDF (term frequency-inverse document frequency) wij = tfij [log2 (N/nj) + 1] wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Di N = number of Documents in collection nj = number of Documents where term Tj occurs at least once Red text is the Inverse Document Frequency measure idfj

IDF (inverse document frequency) idfj = log2 (N/nj) + 1 log2 (N/nj) + 1 = log N - log nj + 1 Recall n is 1 or greater Moderates effect of document size

Inverse Document Frequency (idf) idf provides high values for rare words and low values for common words Double the number of documents, what happens? For a collection of 10000 documents

Inverse Document Frequency idfj modifies only the columns not the rows! log2 (N/nj) + 1 = log N - log nj + 1 Consider only the documents, not the queries! N = 4

tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 4 N = 4

tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 ni 4 N/n

tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 ni 4 N/n idf 3

tf-idf wt calculation be forever here not or there to Doc 1 2 Doc 2 3 Doc 2 3 Doc 3 1 Doc 4

(Doc + queries) terms tf-idf wts be forever here not or there to Doc 1 2 Doc 2 3 Doc 3 1 Doc 4 Q 1 Q 2

Document Similarity With a query what do we want to retrieve? Relevant documents Similar documents Query should be similar to the document? Innate concept – want a document without your query terms?

Similarity Measures Queries are treated like documents Documents are ranked by some measure of closeness to the query Closeness is determined by a Similarity Measure s Ranking is usually s(1) > s(2) > s(3)

Document Similarity Types of similarity Text Content Authors Date of creation Images Etc.

Similarity Measure - Inner Product Similarity between vectors for the document di and query q can be computed as the vector inner product: s = sim(dj,q) = dj•q = wij · wiq where wij is the weight of term i in document j and wiq is the weight of term i in the query For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). For weighted term vectors, it is the sum of the products of the weights of the matched terms.

Inner Product binary doc1 doc2 doc3 doc4 Q1 1 Q2 2 For binary, tf and tfidf models binary doc1 doc2 doc3 doc4 Q1 1 Q2 2

Binary wts be forever here not or there to Doc 1 1 Doc 2 Doc 3 Doc 4 Doc 2 Doc 3 Doc 4 Q 1 Q 2

Inner Product tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2 For binary, tf and tfidf models tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2

doc-terms tf wts be forever here not or there to Doc 1 2 1 Doc 2 Doc 3 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2

Inner Product tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2 For binary, tf and tfidf models tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2

(Doc + queries) terms for tf-idf wts be forever here not or there to Doc 1 2 Doc 2 3 Doc 3 1 Doc 4 Q 1 Q 2

Inner Product binary doc1 doc2 doc3 doc4 Q1 1 Q2 2 tf doc1 doc2 doc3 1 Q2 2 tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2 D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here   Q1: here Q2: to be tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2

Properties of Inner Product The inner product is unbounded. Favors long documents with a large number of unique terms. Measures how many terms matched but not how many terms are not matched.

Cosine Similarity Measure 2 t3 t1 t2 D1 D2 Q 1 Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. CosSim(dj, q) =

Binary wts normalized be forever here not or there to Doc 1 .5 Doc 2 Doc 2 .45 Doc 3 .58 Doc 4 Q 1 1 Q 2 .71

Cosine Measure binary doc1 doc2 doc3 doc4 Q1 .45 .5 Q2 .71 .64 .82 For binary, tf and tfidf models binary doc1 doc2 doc3 doc4 Q1 .45 .5 Q2 .71 .64 .82

docXterms tf wts normalized be forever here not or there to Doc 1 .64 .32 Doc 2 .60 .30 Doc 3 .58 Doc 4 .5 Q 1 1 Q 2 .71

Cosine Measure tf doc1 doc2 doc3 doc4 Q1 .30 .5 Q2 .90 .85 .82 .71 For binary, tf and tfidf models tf doc1 doc2 doc3 doc4 Q1 .30 .5 Q2 .90 .85 .82 .71

(Doc + queries) terms tf-idf wts normalized be forever here not or there to Doc 1 .5 Doc 2 .4 .6 Doc 3 .41 .82 Doc 4 .26 .77 .58 Q 1 .71 Q 2

Cosine Measure tfidf doc1 doc2 doc3 doc4 Q1 .28 .41 Q2 .71 .57 .58 .37 For binary, tf and tfidf models tfidf doc1 doc2 doc3 doc4 Q1 .28 .41 Q2 .71 .57 .58 .37

Cosine Measure bnry d1 d2 d3 d4 Q1 .45 .5 Q2 .71 .64 .82 tf d1 d2 d3 .45 .5 Q2 .71 .64 .82 tf d1 d2 d3 d4 Q1 .30 .5 Q2 .90 .85 .82 .71 D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here   Q1: here Q2: to be tfidf d1 d2 d3 d4 Q1 .28 .41 Q2 .71 .57 .58 .37

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Properties of similarity or matching metrics is the similarity measure Symmetric (Di,Dk) = (Dk,Di) s is close to 1 if similar s is close to 0 if different Others?

Similarity Measures A similarity measure is a function which computes the degree of similarity between a pair of vectors or documents since queries and documents are both vectors, a similarity measure can represent the similarity between two documents, two queries, or one document and one query There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!) With similarity measure between query and documents it is possible to rank the retrieved documents in the order of presumed importance it is possible to enforce certain threshold so that the size of the retrieved set can be controlled the results can be used to reformulate the original query in relevance feedback (e.g., combining a document vector with the query vector)