1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
ISP 433/533 Week 2 IR Models.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
What is coming… n Today: u Probabilistic models u Improving classical models F Latent Semantic Indexing F Relevance feedback (Chapter 5) n Monday Feb 5.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval Chap. 02: Modeling - Part 2 Slides from the text book author, modified by L N Cassel September 2003.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Recuperação de Informação B Cap. 02: Modeling (Set Theoretic Models) 2.6 September 08, 1999.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Recuperação de Informação B Modern Information Retrieval Cap. 2: Modeling Section 2.8 : Alternative Probabilistic Models September 20, 1999.
Representation of documents and queries
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Recuperação de Informação B
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval

2 Data sources Information Model QueriesAnswer Data source system Physical Data source management system Processing of queries/updates Access to stored data

3 Storing and accessing textual information How is the information stored? - high level How is the information retrieved?

4 Storing textual information Text (IR) Semi-structured data Data models (DB) Rules + Facts (KB) structureprecision

5 Storing textual information - Text - Information Retrieval search using words conceptual models: boolean, vector, probabilistic, … file model: flat file, inverted file,...

6 WORDHITSLINK DOC# LINKDOCUMENTS receptor cloning adrenergic … … … … … … … … … … … … …… …… …… Doc1 Doc2 … Inverted filePostings fileDocument file IR - File model: inverted files

7 IR – File model: inverted files Controlled vocabulary Stop list Stemming

8 IR - formal characterization Information retrieval model: (D,Q,F,R) D is a set of document representations Q is a set of queries F is a framework for modeling document representations, queries and their relationships R associates a real number to document- query-pairs (ranking)

9 IR - conceptual models Classic information retrieval Boolean model Vector model Probabilistic model

10 Boolean model cloningadrenergicreceptor Doc1 Doc2 ( 1 1 0) (0 1 0) yes no yes --> no Document representation

11 Boolean model Q1: cloning and (adrenergic or receptor) Queries : boolean (and, or, not) Queries are translated to disjunctive normal form (DNF) DNF: disjunction of conjunctions of terms with or without ‘not’ Rules: not not A --> A not(A and B) --> not A or not B not(A or B) --> not A and not B (A or B) and C --> (A and C) or (B and C) A and (B or C) --> (A and B) or (A and C) (A and B) or C --> (A or C) and (B or C) A or (B and C) --> (A or B) and (A or C)

12 Boolean model Q1: cloning and (adrenergic or receptor) --> (cloning and adrenergic) or (cloning and receptor) (cloning and adrenergic) or (cloning and receptor) --> (cloning and adrenergic and receptor) or (cloning and adrenergic and not receptor) or (cloning and receptor and adrenergic) or (cloning and receptor and not adrenergic) --> (1 1 1) or (1 1 0) or (1 1 1) or (0 1 1) --> (1 1 1) or (1 1 0) or (0 1 1) DNF is completed + translated to same representation as documents

13 Boolean model Q1: cloning and (adrenergic or receptor) --> (1 1 0) or (1 1 1) or (0 1 1) Result: Doc1 Q2: cloning and not adrenergic --> (0 1 0) or (0 1 1) Result: Doc2 cloningadrenergicreceptor Doc1 Doc2 ( 1 1 0) (0 1 0) yes no yes --> no

14 Boolean model Advantages based on intuitive and simple formal model (set theory and boolean algebra) Disadvantages binary decisions - words are relevant or not - document is relevant or not, no notion of partial match

15 Boolean model Q3: adrenergic and receptor --> (1 0 1) or (1 1 1) Result: empty cloningadrenergicreceptor Doc1 Doc2 ( 1 1 0) (0 1 0) yes no yes --> no

16 Vector model (simplified) Doc1 (1,1,0) Doc2 (0,1,0) cloning receptor adrenergic Q (1,1,1) sim(d,q) = d. q |d| x |q|

17 Vector model Introduce weights in document vectors (e.g. Doc3 (0, 0.5, 0)) Weights represent importance of the term for describing the document contents Weights are positive real numbers Term does not occur -> weight = 0

18 Vector model Doc1 (1,1,0) Doc3 (0,0.5,0) cloning receptor adrenergic Q4 (0.5,0.5,0.5) sim(d,q) = d. q |d| x |q|

19 Vector model How to define weights? tf-idf d j (w 1,j, …, w t,j ) w i,j = weight for term k i in document d j = f i,j x idf i

20 Vector model How to define weights? tf-idf term frequency freq i,j : how often does term k i occur in document d j ? normalized term frequency: f i,j = freq i,j / max l freq l,j

21 Vector model How to define weights? tf-idf document frequency : in how many documents does term k i occur? N = total number of documents n i = number of documents in which k i occurs inverse document frequency idf i : log (N / n i )

22 Vector model How to define weights for query? recommendation: q= (w 1,q, …, w t,j ) w i,q = weight for term k i in q = ( f i,q) x idf i

23 Vector model Advantages - term weighting improves retrieval performance - partial matching - ranking according to similarity Disadvantage - assumption of mutually independent terms?

24 Probabilistic model weights are binary (wi,j = 0 or wi,j = 1) R: the set of relevant documents for query q Rc: the set of non-relevant documents for q P(R|dj): probability that dj is relevant to q P(Rc|dj): probability that dj is not relevant to q sim(d j,q) = P(R|d j ) / P(Rc|d j )

25 Probabilistic model sim(d j,q) = P(R|d j ) / P(Rc|d j ) (Bayes’ rule, independence of index terms, take logarithms, P(k i |R) + P(not k i |R) = 1) --> SIM(dj,q) == SUM w i,q x w i,j x (log(P(k i |R) / (1- P(k i |R))) + log((1- P(k i |Rc)) / P(k i |Rc))) t i=1

26 Probabilistic model How to compute P(ki|R) and P(ki|Rc)? - initially: P(ki|R) = 0.5 and P(ki|Rc) = ni/N - Repeat: retrieve documents and rank them V: subset of documents (e.g. r best ranked) Vi: subset of V, elements contain ki P(ki|R) = |Vi| / |V| and P(ki|Rc) = (ni-|Vi|) /(N-|V|)

27 Probabilistic model Advantages: - ranking of documents with respect to probability of being relevant Disadvantages: - initial guess about relevance - all weights are binary - independence assumption?

28 IR - measures Precision = number of found relevant documents total number of found documents Recall = number of found relevant documents total number of relevant documents

29 Literature Baeza-Yates, R., Ribeiro-Neto, B., Modern Information Retrieval, Addison-Wesley, 1999.