Presentation is loading. Please wait.

Presentation is loading. Please wait.

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.

Similar presentations


Presentation on theme: "INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240."— Presentation transcript:

1 INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240

2 Introduction Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.

3 History of Modern IR For over 4000 years, humans have been designing tools to improve information storage and retrieval. Vannevar Bush 1945 paper: “As We May Think” The 1 st automated information retrieval systems (1950s and 1960) SMART (the System for the Manipulation and Retrieval of Text  conceived at Harvard University and flourished at Cornell University  under the leadership of Gerard Salton  the first practical implementation of an IR system The basic theoretical foundations of SMART still play a major role in today’s IR systems.

4 Modern Information Retrieval Document representation  Using keywords  Relative weight of keywords Query representation  Keywords  Relative importance of keywords

5 Retrieval Models Retrieval models match query with documents to:  separate documents into relevant an non-relevant class  rank the documents according to the relevance

6 Retrieval Models Boolean model Vector space model Probabilistic models

7 Boolean Retrieval Model One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1

8 Boolean Model: Advantages Simplicity and efficiency of implementation Binary values can be stored using bits  reduced storage requirements  retrieval using bitwise operations is efficient Boolean retrieval was adopted by many commercial bibliographic systems Boolean queries are akin to database queries Bibliographic systems:  database systems, instead of information retrieval systems

9 Boolean Model: Disadvantages A document is either relevant or nonrelevant to the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for users Boolean queries retrieve too few or too many documents.  K0 and K4 retrieved only 1 out of 6 documents  K0 or K4 retrieved 5 out of a possible 6 documents

10 Vector Space Model Both the documents and queries as vectors A weight based on the frequency in the document: More sophisticated weighting schemes will be studied later

11 VSM versus Boolean Model Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/nonrelevant classes Users can go through the ranked list until their information needs are met

12 Probabilistic Retrieval Model Sparck-Jones (1976): classical probabilistic retrieval model, also known as the binary independence retrieval model Formulates IR in probabilistic framework

13 Comments on Probabilistic Retrieval Probabilistic independence model is not realistic Two-stage retrieval is more complicated Performance gain over VSM is debatable

14 Evaluation of Retrieval Performance Precision VS. Recall F-measure Average precision

15 Precision and Recall

16

17 F measure

18 Average Precision


Download ppt "INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240."

Similar presentations


Ads by Google