# INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.

## Presentation on theme: "INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240."— Presentation transcript:

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240

Introduction Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.

History of Modern IR For over 4000 years, humans have been designing tools to improve information storage and retrieval. Vannevar Bush 1945 paper: “As We May Think” The 1 st automated information retrieval systems (1950s and 1960) SMART (the System for the Manipulation and Retrieval of Text  conceived at Harvard University and flourished at Cornell University  under the leadership of Gerard Salton  the first practical implementation of an IR system The basic theoretical foundations of SMART still play a major role in today’s IR systems.

Modern Information Retrieval Document representation  Using keywords  Relative weight of keywords Query representation  Keywords  Relative importance of keywords

Retrieval Models Retrieval models match query with documents to:  separate documents into relevant an non-relevant class  rank the documents according to the relevance

Retrieval Models Boolean model Vector space model Probabilistic models

Boolean Retrieval Model One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1

Boolean Model: Advantages Simplicity and efficiency of implementation Binary values can be stored using bits  reduced storage requirements  retrieval using bitwise operations is efficient Boolean retrieval was adopted by many commercial bibliographic systems Boolean queries are akin to database queries Bibliographic systems:  database systems, instead of information retrieval systems

Boolean Model: Disadvantages A document is either relevant or nonrelevant to the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for users Boolean queries retrieve too few or too many documents.  K0 and K4 retrieved only 1 out of 6 documents  K0 or K4 retrieved 5 out of a possible 6 documents

Vector Space Model Both the documents and queries as vectors A weight based on the frequency in the document: More sophisticated weighting schemes will be studied later

VSM versus Boolean Model Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/nonrelevant classes Users can go through the ranked list until their information needs are met

Probabilistic Retrieval Model Sparck-Jones (1976): classical probabilistic retrieval model, also known as the binary independence retrieval model Formulates IR in probabilistic framework

Comments on Probabilistic Retrieval Probabilistic independence model is not realistic Two-stage retrieval is more complicated Performance gain over VSM is debatable

Evaluation of Retrieval Performance Precision VS. Recall F-measure Average precision

Precision and Recall

F measure

Average Precision

Download ppt "INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240."

Similar presentations