1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

Slides:

Advertisements

Similar presentations

Boolean and Vector Space Retrieval Models

Advertisements

Chapter 5: Introduction to Information Retrieval

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Multimedia Database Systems

Basic IR: Modeling Basic IR Task: Slightly more complex:

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.

Modern Information Retrieval Chapter 1: Introduction

Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.

The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)

Motivation and Outline

IR Models: Overview, Boolean, and Vector

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

ISP 433/533 Week 2 IR Models.

IR Models: Structural Models

Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.

Information Retrieval Modeling CS 652 Information Extraction and Integration.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.

Modeling Modern Information Retrieval

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Vector Space Model CS 652 Information Extraction and Integration.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

IR Models: Review Vector Model and Probabilistic.

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Chapter 5: Information Retrieval and Web Search

Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.

Querying Structured Text in an XML Database By Xuemei Luo.

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.

Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.

IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:

1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

Information Retrieval

The Boolean Model Simple model based on set theory

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.

Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.

Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.

1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.

Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.

Information Retrieval on the World Wide Web

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Multimedia Information Retrieval

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

CSE 635 Multimedia Information Retrieval

Chapter 5: Information Retrieval and Web Search

4. Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models

Recuperação de Informação B

Information Retrieval and Web Design

Recuperação de Informação B

Berlin Chen Department of Computer Science & Information Engineering

Information Retrieval and Web Design

Advanced information retrieval

Presentation transcript:

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

2 Chapter 2– Part1 Information Retrieval Models

3 Introduction Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun). Advantages: Simple The semantic of the documents and of the user information need can be naturally expressed through sets of index terms.

4 IR Models  Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

5 A taxonomy of information retrieval models Retrieval: Ad hoc Filtering Classic Models Browsing USERTASKUSERTASK Boolean Vector Probabilistic Structured Models Non-overlapping lists Proximal Nodes Flat Structured Guided Hypertext Browsing Fuzzy Extended Boolean Set Theoretic Algebraic Generalized Vector Lat. Semantic Index Neural Networks Inference Network Belief Network Probabilistic

6 Basic Concept Each document is described by a set of representative keywords called index term. Index term: is simply a word whose semantic help in remembering the documents main themes. Consider a collection of one hundred documents, a word which appears in each of the one hundred document is completely useless as an index term, because it does not tell us anything about which documents are the user interested in. But the word which appear in just 5 document is quite useful, because it narrow down considerably the space of documents which might be of interest to the user.

7 Retrieval Strategy An IR strategy is a technique by which a relevance measure is obtained between a query and a document. Manual Systems Boolean, Fuzzy Set Automatic Systems Vector Space Model Language Models Latent Semantic Indexing

8 Basic Concepts In the classic models each document is described by a set of representative keywords called index terms index terms are mainly nouns index term weights are usually assumed to be mutually independent

9 Boolean Model Binary decision criterion Data retrieval model A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors Advantage clean formalism, simplicity Disadvantage exact matching may lead to retrieval of too few or too many documents

10 Boolean IR Documents composed of TERMS(words, stems) Express result in set-theoretic terms Doc’s containing term A term B term C Doc’s containing term A term B term C A AND B(A AND B) OR C - Pre 1970’s - Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)

11 Information Retrieval Models An information retrieval model is a formal framework that supports all the major phases of the information retrieval process, including: Item (document) representation User need representation Matching of needs to items Ranking of retrieved items An information retrieval model is analogous to a database model (relation, object-oriented, semi- structured, etc.).

12 A Generic Model D: A set of document representations. Q: A set of user-need representations (queries). R : DxQ gives Real-Numbers A function that assigns each document and each query a real number that represents the ranking(relevance) of the document with respect to the query.

13 Common Preprocessing Steps Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). Break into tokens (keywords) on whitespace. Stem tokens to “root” words computational  compute Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases (possibly using a domain specific dictionary). Build inverted index (keyword  list of docs containing it).

14 Common Assumptions The three models are all based on document representations that are sets of index terms. Index terms (keywords): Words (mostly nouns) extracted from a document to summarize its contents. Method: Could extract all distinct words, only those that appearin a global lexicon, etc. (to be discussed in Topic 8). Weights: Not all extracted index terms carry the same importance in summarizing the contents of a document. Let t be the number of terms in the entire system. Let kibe a term, let djbe a document. wi,j >= is a weight associated with the pair (ki, dj). wi,j = 0 when kidoes not appear in dj. Document dj is associated with index term vector dj = (w1j, w2j,…, wtj).

15 Common Assumptions (cont.) A naïve assumption is made that the terms in a document are mutually independent. Term independence: The appearance of one term in a document is unrelated to the appearance of another. This assumption is made for simplicity of calculations, but is often wrong. Example: The terms computer and network are not independent. If the term computer appears, the probability that the term network appears is higher than if the term computer does not appear. Whereas independence requires P(network|computer) = P(network). Nonetheless, it has been demonstrated that performance is still good under this naïve assumption.

16 The Boolean Model Document: A document is a set of index terms without weights; or, equivalently, with binaryterm weights: wi,j = 0 or wi,j = 1. i.e., a keyword is either presentin or absentfrom a document. Query: A query is a Boolean expression of index terms. i.e., index terms connected with and, or and not, according to the usual conventions. Ranking: With each index term ki we associate a set D ki of the documents in which ki appears: D ki = {dj| wij = 1} and the Boolean expression is converted to a set-theoretic expression: Each term ki is substituted by the set Dki The Boolean operators and(^), or(V) and not(¥) are substituted, respectively, by the set operators intersection, union andcomplement. The documents in the resulting set are relevant, all others are non- relevant.

17 The Boolean Model (cont.) Example: Terms: K1, …,K8. Documents: D1= {K1, K2, K3, K4, K5} D2 = {K1, K2, K3, K4} D3= {K2, K4, K6, K8} D4= {K1, K3, K5, K7} D5= {K4, K5, K6, K7, K8} D6 = {K1, K2, K3, K4} Query: K1 AND (K2 OR NOT K3) Answer: {D1, D2, D4, D6} 3({D1, D2, D3, D6} 4{D3, D5}) = {D1, D2, D6}

18 The Boolean Model (cont.) Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Reasonably efficient implementations possible for normal queries. Simple and clean formalism. Adopted by many early commercial IR systems. A document is either relevant or non-relevant to a query (i.e., no “strong” or “weak” relevance). Hence, a query splits the collection into two distinct sets of documents: relevant and non-relevant, and there is no ranking from most relevant to least relevant. May lead to answers with too few or too many documents.

19 Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Very rigid: AND means all; OR means any. Difficult to express complex user requests. Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified?

20 Example Assume that we have 5 documents D1,D2,D3,D4,D5 and the following terms found in each document. D1:{play,sport,football,swimming,university,just} D2:{jordan,news,sport,university} D3:{football,yarmouk,university,sport,play} D4:{univesity,just,sport,jordan} D5:{play,jordan,university,football}

21 continue The following query will be used to search for relevant document. “Yarmouk” AND(“Univesity” OR NOT ”Just”) Translate the query into disjunctive normal form dfn dfn1:Yarmouk AND NOT University AND NOT Just=(100) dfn2:Yarmouk AND University AND NOT Just=(110) dfn3:Yarmouk AND University AND Just=(111) The set of vector represent the query are Qdfn={(111),(110),(100)}

22 continue K1K2K3 K1^(K2 V ~K3)Disjunct K1^~K2^~K K1^K2^~K K1^K2^K3

23 continue pla y spor t footb all swimm ing jorda n new s yarmoukunivesityjust D D D D D

24 finally To find the similarity between document and query, we search for conjunctive component such that every key term in the query has the same value in on of the dfn component as it does in the document

25 Query key term matrex YarmoukUniversityjust Dfn1100 Dfn2110 Dfn3111

26 The result The query returns the document D3 as the relevant document