Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Project Management: The project is due on Friday inweek13.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval Chap. 02: Modeling - Part 2 Slides from the text book author, modified by L N Cassel September 2003.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Recuperação de Informação B Modern Information Retrieval Cap. 2: Modeling Section 2.8 : Alternative Probabilistic Models September 20, 1999.
Plan for Today’s Lecture(s)
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
CS 430: Information Discovery
Recuperação de Informação B
Retrieval Performance Evaluation - Measures
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Modeling in Information Retrieval - Classical Models
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Modeling (Chap. 2) Modern Information Retrieval Spring 2000

Introduction Traditional IR systems adopt index terms to index, retrieve documents An index term is simply any word that appears in text of documents Retrieval based on index terms is simple u premise is that semantics of documents and user information can be expressed through set of index terms

n Key Question u semantics in document (user request) lost when text replaced with set of words u matching between documents and user request done in very imprecise space of index terms (low quality retrieval) u problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)

Taxonomy of IR Models Three classic models  Boolean  documents and queries represented as sets of index terms  Vector  documents and queries represented as vectors in t-dimensional space  Probabilistic  document and query representations based on probability theory

Basic Concepts Classic models consider that each document is described by index terms Index term is a (document) word that helps in remembering document ’ s main themes  index terms used to index and summarize document content  in general, index terms are nouns (because meaning by themselves)  index terms may consider all distinct words in a document collection

Distinct index terms have varying relevance when describing document contents Thus numerical weights assigned to each index term of a document Let k i be index term, d j document, and w i, j  0 be weight for pair (k i, d j ) Weight quantifies importance of index term for describing document semantic contents

Definition (pp. 25) n Let t be no. of index terms in system and k i be generic index term. n K = {k 1, …, k t } is set of all index terms. n A weight w i, j > 0 associated with each index term k i of document d j. n For index term that does not appear in document text, w i, j = 0. n Document d j associated with index term vector j represented by j = (w 1, j, w 2, j, …w t, j )

Boolean Model Simple retrieval model based on set theory and Boolean algebra framework is easy to grasp by users (concept of set is intuitive) Queries specified as Boolean expressions which have precise semantics

Drawbacks Retrieval strategy is binary decision (document is relevant/non-relevant)  prevents good retrieval performance not simple to translate information need into Boolean expression (difficult and awkward to express) dominant model with commercial DB systems

Boolean Model (Cont.) Considers that index terms are present or absent in document index term weights are binary, I.e. w i, j  {0,1} query q composed of index terms linked by not, and, or query is Boolean expression which can be represented as DNF

Boolean Model (Cont.) n Query [q=k a  (k b   k c )] can be written in DNF as [ dnf = (1,1,1)  (1,1,0)  (1,0,0)] u each component is binary weighted vector associated with tuple (k a, k b, k c ) u binary weighted vectors are called conjunctive components of dnf

Boolean Model (cont.) n Index term weight variables are all binary, I.e. w i,j  {0,1} n query q is a Boolean expression n Let dnf be DNF for query q n Let cc be any conjunctive components of dnf n Similarity of document d j to query q is u sim(d j,q) = 1 if  cc | ( cc  dnf )  (  ki,g i ( j ) = g i ( cc )) where g i ( j ) = w i, j u sim(d j,q) = 0 otherwise

Boolean Model (Cont.) n If sim(d j,q) = 1 then Boolean model predict that document d j is relevant to query q (it might not be) n Otherwise, prediction is that document is not relevant n Boolean model predicts that each document is either relevant or non- relevant n no notion of partial match

n Main advantages u clean formalism u simplicity n Main disadvantages u exact matching lead to retrieval of too few or too many documents n index term weighting can lead to improvement in retrieval performance

Vector Model n Assign non-binary weights to index terms in queries and documents n term weights used to compute degree of similarity between document and user query n by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents u ranked document answer set a lot more precise (than answer set by Boolean model)

Vector Model (Cont.) n Weight w i, j for pair (k i, d j ) is positive and non-binary n index terms in query are also weighted n Let w i, q be weight associated with pair [k i, q ], where w i, q  0 n query vector defined as = (w 1, q, w 2, q, …, w t, q ) where t is total no. of index terms in system n vector for document d j is represented by j = (w 1, j, w 2, j, …, w t, j )

Vector Model (Cont.) n Document d j and user query q represented as t-dimensional vectors. n evaluate degree of similarity of d j with regard to q as correlation between vectors j and. n Correlation can be quantified by cosine of angle between these two vectors u sim(dj,q) =

Vector Model (Cont.) n Sim(q,d j ) varies from 0 to +1. n Ranks documents according to degree of similarity to query n document may be retrieved even if it partially matches query u establish a threshold on sim(d j,q) and retrieve documents with degree of similarity above threshold

Index term weights n Documents are collection C of objects n User query is set A of objects n IR problem is to determine which documents are in set A and which are not (I.e. clustering problem) n In clustering problem u intra-cluster similarity (features which better describe objects in set A) u inter-cluster similarity (features which better distinguish objects in set A from remaining objects in collection C

n In vector model, intra-cluster similarity quantified by measuring raw frequency of term k i inside document d j ( tf factor ) u how well term describes document contents n inter-cluster dissimilarity quantified by measuring inverse of frequency of term k i among documents in collection ( idf factor) u terms which appear in many documents are not very useful for distinguishing relevant document from non-relevant one

Definition (pp.29) n Let N be total no. of documents in system n let ni be number of documents in which index term k i appears n let freq i, j be raw frequency of term k i in document d j u no. of times term k i mentioned in text of document d j n Normalized frequency f i, j of term k i in d j n f i, j =

n Maximum computed over all terms mentioned in text of document d j n if term k i does not appear in document d j then f i, j = 0 n let idf i, inverse document frequency for k i be u idf i = log n best known term weighting scheme u w i, j = f i, j  log

n Advantages of vector model u term weighting scheme improves retrieval performance u retrieve documents that approximate query conditions u sorts documents according to degree of similarity to query n Disadvantage u index terms are mutually independent

Probabilistic Model n Given user query, there is set of documents containing exactly relevant documents. u Ideal answer set n given description of ideal answer set, no problem in retrieving its documents n querying process is process of specifying properties of ideal answer set u the properties are not exactly known u there are index terms whose semantics are used to characterize these properties

Probabilistic Model (Cont.) n These properties not known at query time n effort has to be made to initially guess what they (I.e. properties) are n initial guess generate preliminary probabilistic description of ideal answer set to retrieve first set of documents n user interaction initiated to improve probabilistic description of ideal answer set

n User examine retrieved documents and decide which ones are relevant n this information used to refine description of ideal answer set n by repeating this process, such description will evolve and be closer to ideal answer set

Fundamental Assumption n Given user query q and document d j in collection, probabilistic model estimate probability that user will find document d j relevant u assumes that probability of relevance depends on query and document representations only u assumes that there is subset of all documents which user prefers as answer set for query q u such ideal answer set is labeled R u documents in set R are predicted to be relevant to query

n Given query q, probabilistic model assigns to each document d j the ratio P(d j relevant-to q)/P(d j non-relevant-to q) u measure of similarity to query u odds of document d j being relevant to query q

n Index term weight variables are all binary I.e. w i, j  {0,1}, w i, q  {0,1} n query q is subset of index terms n let R be set of documents known (initially guessed) to be relevant n let be complement of R n let P(R| j ) be probability that document d j is relevant to query q n let P( | j ) be probability that document d j not relevant to query q.

n Similarity sim(d j,q) of document d j to query q is ratio n sim(d j,q) = n sim(d j,q) ~ n sim(d j,q) ~ w i, q  w i, j 

n How to compute P(k i |R) and P(k i | ) initially ? u assume P(k i |R) is constant for all index terms k i (typically 0.5) u P(k i |R) = 0.5 u assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection u P(k i | ) = n i /N where n i is no. of documents containing index term k i ; N is total no. of doc.

n Let V be subset of documents initially retrieved and ranked by model n let V i be subset of V composed of documents in V with index term k i n P(k i |R) approximated by distribution of index term k i among doc. retrieved u P(k i |R) = V i / V n P(k i | ) approximated by considering all non-retrieved doc. are not relevant u P(k i | ) =

n Advantages u documents ranked in decreasing order of their probability of being relevant n Disadvantages u need to guess initial separation of relevant and non-relevant sets u all index term weights are binary u index terms are mutually independent