1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Vector Methods Classical IR Thanks to: SIMS W. Arms.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Plan for Today’s Lecture(s)
Vector Methods Classical IR
Text Based Information Retrieval
CS 430: Information Discovery
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Vector Methods Classical IR
CS 430: Information Discovery
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1

2 Course Administration Wednesday evenings are 7:30 to 8:30, except the Midterm Examination may run 1.5 hours.

3 2. Similarity Ranking Methods Query DocumentsIndex database Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query

4 Similarity Ranking Methods Methods that look for matches (e.g., Boolean) assume that a document is either relevant to a query or not relevant. Similarity ranking methods: measure the degree of similarity between a query and a document. Query Documents Similar Similar: How similar is document to a request?

5 Evaluation: Precision and Recall Precision and recall measure the results of a single query using a specific search system applied to a specific set of documents. Matching methods: Precision and recall are single numbers. Ranking methods: Precision and recall are functions of the rank order.

6 Evaluating Ranking: Recall and Precision If information retrieval were perfect... Every document relevant to the original information need would be ranked above every other document. With ranking, precision and recall are functions of the rank order. Precision(n): fraction (or percentage) of the n most highly ranked documents that are relevant. Recall(n) : fraction (or percentage) of the relevant items that are in the n most highly ranked documents.

7 Precision and Recall with Ranking Example "Your query found 349,871 possibly relevant documents. Here are the first eight." Examination of the first 8 finds that 5 of them are relevant.

8 Graph of Precision with Ranking: P(r) Precision P(r) Rank r Relevant? Y N Y Y N Y N Y 1/1 1/2 2/3 3/4 3/5 4/6 4/7 5/8

9 Term Similarity: Example Problem: Given two text documents, how similar are they? [Methods that measure similarity do not assume exact matches.] Example Here are three documents. How similar are they? d 1 ant ant bee d 2 dog bee dog hog dog ant dog d 3 cat gnu dog eel fox Documents can be any length from one word to thousands. A query is a special type of document.

10 Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The lengths of the documents (b) The number of terms in common (c) Whether the terms are common or unusual (d) How many times each term appears Term Similarity: Basic Concept

11 TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of different terms used to index a set of documents. Vector Document i is represented by a vector. Its magnitude in dimension j is t ij, where: t ij > 0 if term j occurs in document i t ij = 0 otherwise t ij is the weight of term j in document i.

12 A Document Represented in a 3-Dimensional Term Vector Space t1t1 t2t2 t3t3 d1d1 t 13 t 12 t 11

13 Basic Method: Incidence Matrix (No Weighting) documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog d d d Weights: t ij = 1 if document i contains term j and zero otherwise 3 vectors in 8-dimensional term vector space

14 Basic Vector Space Methods: Similarity Similarity The similarity between two documents is a function of the angle between their vectors in the term vector space.

15 Two Documents Represented in 3-Dimensional Term Vector Space t1t1 t2t2 t3t3 d1d1 d2d2 

16 Vector Space Revision x = (x 1, x 2, x 3,..., x n ) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x| 2 = x x x x n 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1.x 2 = x 11 x 21 + x 12 x 22 + x 13 x x 1n x 2n Cosine of the angle between the vectors x 1 and x 2: cos (  ) = x 1.x 2 |x 1 | |x 2 |

17 Example 1 No Weighting ant bee cat dog eel fox gnu hog length d  2 d  4 d  5

18 Example 1 (continued) d 1 d 2 d 3 d d d Similarity of documents in example:

19 Weighting Methods: tf and idf Term frequency (tf) A term that appears several times in a document is weighted more heavily than a term that appears only once. Inverse document frequency (idf) A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

20 Example 2 Weighting by Term Frequency (tf) ant bee cat dog eel fox gnu hog length d  5 d  19 d  5 Weights: t ij = frequency that term j occurs in document i documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu

21 Example 2 (continued) d 1 d 2 d 3 d d d Similarity of documents in example: Similarity depends upon the weights given to the terms. [Note differences in results from Example 1.]

22 Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document d i, w ik = 0 if term k occurs in document d i, w ik is greater than zero (w ik is called the weight of term k in document d i ) Similarity between d i and d j is defined as:  w ik w jk |d i | |d j | Where d i and d j are the corresponding weighted term vectors k=1 n cos(d i, d j ) =

23 Approaches to Weighting Boolean information retrieval: Weight of term k in document d i : w(i, k) = 1 if term k occurs in document d i w(i, k) = 0 otherwise General weighting methods Weight of term k in document d i : 0 < w(i, k) <= 1 if term k occurs in document d i w(i, k) = 0 otherwise (The choice of weights for ranking is the topic of Lecture 4.)

24 Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity above a threshold, e.g., similarity > Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

25 Simple Example of Ranking (Weighting by Term Frequency) ant bee cat dog eel fox gnu hog length q 1 1 √2 d  5 d  19 d  5 query qant dog documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu

26 Calculate Ranking d 1 d 2 d 3 q 2/√10 5/√38 1/√ Similarity of query to documents in example: If the query q is searched against this document set, the ranked results are: d 2, d 1, d 3

27 Contrast of Ranking with Matching With matching, a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents Encourages long queries, to have as many dimensions as possible Benefits from large numbers of index terms Benefits from queries with many terms, not all of which need match the document

28 Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Then the ends of the vectors all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface

29 Results of a Search x x x x x x x  hits from search x documents found by search  query

30 Relevance Feedback (Concept) x x x x o o o   hits from original search x documents identified as non-relevant o documents identified as relevant  original query reformulated query 

31 Document Clustering (Concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.