CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Basic IR: Modeling Basic IR Task: Slightly more complex:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Learning for Text Categorization
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CSE3201/CSE4500 Term Weighting.
Vector Space Models.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Plan for Today’s Lecture(s)
Vector Methods Classical IR
7CCSMWAL Algorithmic Issues in the WWW
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Searching and Indexing
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
From frequency to meaning: vector space models of semantics
Vector Methods Classical IR
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

CS 430: Information Discovery Lecture 4 Vector Methods

Course Administration • Assignment 1 should be posted tomorrow. Submission instructions will be added on Monday.

Vector Space Methods Problem: Given two text documents, how similar are they? Vector space methods that measure similarity do not assume exact matches. Example Here are three documents. How similar are they? d1 ant ant bee d2 dog bee dog hog dog ant dog d3 cat gnu dog eel fox Documents can be any length from one word to thousands. One document may be a query.

Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors.

Three Terms Represented in 3 Dimensions  t1

Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx1n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

Basic Method: Incidence Array (No Weighting) terms in d1 -> ant ant bee terms in d2 -> dog bee dog hog dog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog length d1 1 1 2 d2 1 1 1 1 4 d3 1 1 1 1 1 5 Weights: tij = 1 if document i contains term j and zero otherwise

Example 1 (continued) Similarity of documents in example: d1 d2 d3

Vector Similarity Computation: Summary Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, tij = 0 if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i) Similarity between di and dj is defined as  tiktjk |di| |dj| n cos(di, dj) = k=1

Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity more than a threshold, t, e.g., t = 0.50. Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

Contrast with Boolean Searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document

Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

Results of a Search x x hits from search x x x x x  x x x x x documents found by search  query

Relevance Feedback (Concept) hits from original search x x o  x x  o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query 

Document Clustering (Concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

Categories of Weighting Term Frequency A term that appears many times within a document is likely to be a better discriminator than a term that appears only once. Document Frequency A term that appears in many documents is likely to be a less good discriminator than one that appears in few documents. Document Length Appearance of a term in a short documents is likely to be a better discriminator than one that appears in a long document.

Term Weighting: Term Frequency Similarity calculated from an incidence matrix, without weighting, measures the occurrences of terms, but no other characteristics of the documents. Definition: The term frequency is the number of times that it occurs in a document) Notation: tf A frequency matrix weighs each term by the number of times that it occurs in a document. Similarity calculated from a frequency matrix is likely to provide more information about a document than without weights.

Frequency Matrix (Weighting by Term Frequency) terms in d1 -> ant ant bee terms in d2 -> dog bee dog hog dog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length d1 2 1 5 d2 1 1 4 1 19 d3 1 1 1 1 1 5 Weights: tij = frequency that term j occurs in document i

Example 2 (continued) Similarity of documents in example: d1 d2 d3 Similarity depends upon the weights given to the terms. [Note differences in results from Example 1 with no weighting.]