1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Hinrich Schütze and Christina Lioma
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
IR 6 Scoring, term weighting and the vector space model.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Plan for Today’s Lecture(s)
Vector Methods Classical IR
CS 430: Information Discovery
Representation of documents and queries
CS 430: Information Discovery
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Vector Methods Classical IR
CS 430: Information Discovery
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2

2 Course Administration New Teaching Assistant Veselin (Ves) Stoyanov Wednesday Classes The grading for participation allows you to miss a couple of classes. The material in the discussion classes will be part of the examinations.

3 Course Administration Assignment 1 Programs must run from command prompt in CSUGLAB. Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work. Since the volume of data is small, data structures that are implemented entirely within memory are expected.

4 Choice of Weights ant bee cat dog eel fox gnu hog q ? ? d 1 ? ? d 2 ? ? ? ? d 3 ? ? ? ? ? query qant dog documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu What weights lead to the best information retrieval?

5 Methods for Selecting Weights Empirical Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.) Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

6 Weighting Term Frequency (tf) Suppose term j appears f ij times in document i. What weighting should be given to a term j? Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

7 Term Frequency: Free-text Document Length of document Simple method (as illustrated in Lecture 3) is to use f ij as the term frequency....but, in free-text documents, terms are likely to appear more often in long documents. Therefore f ij should be scaled by some variable related to document length. i

8 Term Frequency: Free-text Document A standard method for free-text documents Scale f ij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents. Let m i = max (f ij ) i.e., m i is the maximum frequency of any term in document i. Term frequency (tf): tf ij = f ij / m i when f ij > 0 Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. i

9 Weighting Inverse Document Frequency (idf) Suppose term j appears f ij times in document i. What weighting should be given to a term j? Inverse Document Frequency: Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

10 Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is n j. A possible method might be to use n/n j as the inverse document frequency. A standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idf j = log 2 (n/n j ) + 1 n j > 0 Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

11 Example of Inverse Document Frequency Example n = 1,000 documents term jn j idf j A B C D1, From: Salton and McGill

12 Full Weighting: A Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) * (inverse document frequency) A standard tf.idf weighting scheme, for free text documents, is: t ij = tf ij * idf j = (f ij / m i ) * (log 2 (n/n j ) + 1) when n j > 0

13 Structured Text Structured text Structured texts, e.g., queries, catalog records or abstracts, have different distribution of terms from free-text. A modified expression for the term frequency is: tf ij = K + (1 - K)*f ij / m i when f ij > 0 K is a parameter between 0 and 1 that can be tuned for a specific collection. Query To weigh terms in the query, Salton and Buckley recommend K equal to 0.5. i

14 Summary: Similarity Calculation The similarity between query q and document i is given by:  t qk t ik |d q | |d i | Where d q and d i are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by: t qk = ( *f qk / m q )*(log 2 (n/n k ) + 1) when f qk > 0 t ik = (f ik / m i ) * (log 2 (n/n k ) + 1) when f ik > 0 k=1 n cos(d q, d i ) =

15 Discussion of Similarity The choice of similarity measure is widely used and works well on a wide range of documents, but has no theoretical basis. 1.There are several possible measures other that angle between vectors 2.There is a choice of possible definitions of tf and idf 3.With fielded searching, there are various ways to adjust the weight given to each field. The definitions on Slide 12 can be considered the standard.

16

17 Jakarta Lucene Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. The technology is suitable for nearly any application that requires full-text search, especially cross-platform. Jakarta Lucene is an open source project available for free download from Apache Jakarta. Versions are also available is several other languages, including C++. The original author was Doug Cutting.

18

19 Similarity and DefaultSimilarity public abstract class Similarity The score of query q for document d is defined in terms of these methods as follows: score(q, d) = ∑ tf(t in d)*idf(t)*getBoost(t.field in d)* lengthNorm(t.field in d)*coord(q, d)*queryNorm(q) public class DefaultSimilarity extends Similarity t in q

20 Class DefaultSimilarity tf public float tf(float freq) Implemented as: sqrt(freq) lengthNorm public float lengthNorm(String fieldName, int numTerms) Implemented as: 1/sqrt(numTerms) Parameters: numTokens - the total number of tokens contained in fields named fieldName of document idf public float idf(int docFreq, int numDocs) Implemented as: log(numDocs/(docFreq+1)) + 1

21 Class DefaultSimilarity coord public float coord(int overlap, int maxOverlap) Implemented as overlap / maxOverlap. Parameters: overlap - the number of query terms matched in the document maxOverlap - the total number of terms in the query getBoost returns the boost factor for hits on any field of this document (set elsewhere) queryNorm does not affect ranking, but rather just attempts to make scores from different queries comparable.