Basic IR: Modeling Basic IR Task: Slightly more complex:

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Motivation and Outline
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Digital Libraries Searching
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Recuperação de Informação B Cap. 02: Modeling (Set Theoretic Models) 2.6 September 08, 1999.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Latent Semantic Indexing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Advanced Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS 430: Information Discovery
Recuperação de Informação B
Models for Retrieval and Browsing - Structural Models and Browsing
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Models for Retrieval and Browsing - Fuzzy Set, Extended Boolean, Generalized Vector Space Models Berlin Chen 2003 Reference: 1. Modern Information Retrieval,
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Berlin Chen Department of Computer Science & Information Engineering
CS 430: Information Discovery
Presentation transcript:

Basic IR: Modeling Basic IR Task: Slightly more complex: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models.

Concepts: Term-Document Incidence Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise. Queries satisfied how? Problems? search segment select semantic … MIR 1 AI

Concepts: Term Frequency To support document ranking, need more than just term incidence. Term frequency records number of times a given term appears in each document. Intuition: More times a term appears in a document the more central it is to the topic of the document.

Concept: Term Weight Weights represent the importance of a given term for characterizing a document. wij is a weight for term i in document j.

Mapping Task and Document Type to Model Index Terms Full Text Full Text + Structure Searching (Retrieval) Classic Structured Surfing (Browsing) Flat Hypertext Structure Guided

IR Models from MIR text s e Adhoc r Filtering T a k Browsing Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Flat Structure Guided Hypertext from MIR text

Classic Models: Basic Concepts Ki is an index term dj is a document t is the total number of docs K = (k1, k2, …, kt) is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

Classic: Boolean Model Based on set theory: map queries with Boolean operations to set operations Select documents from term-document incidence matrix Pros: Cons:

Exact Matching Ignores… term frequency in document term scarcity in corpus size of document ranking

Vector Model Vector of term weights based on term frequency Compute similarity between query and document where both are vectors vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) Similarity is the cosine of the angle between the vectors.

Cosine Measure j dj q  Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 from MIR notes

How to Set Wij Weights? TF-IDF Within document: Term-Frequency tf measures term density within a document Across document: Inverse Document Frequency idf measures informativeness or rarity of term across corpus.

TF * IDF Computation What happens as number of occurrences in a document increases? What happens as term becomes more rare?

TF * IDF TF may be normalized. IDF is computed tf(i,d) = freq(i,d) / max(freq(l,d)) IDF is computed normalized to size of corpus as log to make TF and IDF values comparable IDF requires a static corpus.

How to Set Wi,q Weights? Create Vector directly from query Use modified tf-idf

The Vector Model: Example k1 k2 k3 Which document seems to best match the query? What would we expect the ranking to be? from MIR notes

The Vector Model: Example (cont.) k1 k2 k3 Compute Tf-IDF Vector for each document For first document: K1: ((2/2)*(log (7/5)) = .33 K2: (0*(log (7/4))) = 0 K3: ((1/2)*(log (7/3))) = .42 for rest: [.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0] TF-IDF for first document… k1 is 2* log(7/5)=.67, k2 is 0 * log(7/4)=0, k3 is 1 * log(7/3)=.84 [.67 0 .84] normalized it is k1= (2/2)*log(7/5)=.33, k2=0, k3=(1/4)*log(7/3)=.21 To match query, from MIR notes

The Vector Model: Example (cont.) k1 k2 k3 2. Compute the Tf-IDF for the query [1 2 3]: K1: (.5 + ((.5 * 1)/3))*(log (7/5))) K2: (.5 + ((.5 * 2)/3))*(log (7/4))) K3: (.5 + ((.5 * 3)/3))*(log (7/3))) which is: [.22 .47 .85]

The Vector Model: Example (cont.) k1 k2 k3 3. Compute the Sim for each document: D1: D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43 |D1| = sqrt((.33^2) + (.42^2)) = .53 |q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0 sim = .43 / (.53 * 1.0) = .81 D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47

Vector Model Implementation Issues Sparse TermXDocument matrix Store term count, term weight, or weighted by idfi ? What if the corpus is not fixed (e.g., the Web)? What happens to IDF? How to efficiently compute Cosine for large index?

Heuristics for Computing Cosine for Large Index Select from only non-zero cosines Focus on non-zero cosines for rare (high idf) words Pre-compute document adjacency for each term, pre-compute k nearest docs for a t term query, compute cosines from query to union of t pre-computed lists, choose top k

The TFIDF Vector Model: Pros/Cons term-weighting improves quality cosine ranking formula sorts documents according to degree of similarity to the query Cons: assumes independence of index terms