CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Motivation and Outline
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Project Management: The project is due on Friday inweek13.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides.
Chapter 6: Information Retrieval and Web Search
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Data Integration and Information Retrieval Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 18, 2008 Some slides.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Automated Information Retrieval
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model

The elusive user satisfaction Ranking Correctness of Query Processing Coverage NER Stemming MWE Crawling Indexing

What happens in IR q 1 q 2 … q n // Search Box, q i are query terms I1I1 I2I IkIk Index Table D1D1 D2D DmDm Documents Ranked List Note: High ranked relevant document = user information need getting satisfied !

Search Box User Index Table / Documents Relevance/Feedbac k Database, algorithm, datastructur e

How to check quality of retrieval (P, R, F) Three parameters Precision P = |A ^ O|/|O| Recall R = |A ^ O| / |A| F-score = 2PR/(P+R) Harmonic mean Actual(A) Obtained(O) A ^ O All the above formula are very general. We haven’t considered that the documents retrieved are ranked and thus the above expressions need to be modified.

P, R, F (contd.) Precision is easy to calculate, Recall is not. Given a known set of pair of Relevance judgement →{0, 1} (Human evaluation)

Relation between P & R Precision P Recall R P is inversely related to R (unless additional knowledge is given)

Precision at rank k Choose the top k documents, see how many of them are relevant out of them. P k = (# of relevant documents)/k Mean Average Precision (MAP) = D1D1 D2D DkDk DmDm Documents

Sample Exercise D 1 : Delhi is the capital of India. It is a large city. D 2 : Mumbai, however is the commercial capital with million dollars inflow & outflow. D 3 : There is rivalry for supremacy between the two cities. The words in red constitute the useful words from each sentence. The other words (those in black) are very common and thus do not add to the information content of the sentence. Vocabulary: unique red words, 11 in number; each doc will be represented by a 11-tuple vector: each component 1 or 0

IR Basics (mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press )

Definition of IR Model An IR model is a quadrupul [D, Q, F, R(q i, d j )] Where, D: documents Q: Queries F: Framework for modeling document, query and their relationships R(.,.): Ranking function returning a real no. expressing the relevance of d j with q i

Index Terms Keywords representing a document Semantics of the word helps remember the main theme of the document Generally nouns Assign numerical weights to index terms to indicate their importance

Introduction Docs Information Need Index Terms doc query Ranking match

Classic IR Models - Basic Concepts The importance of the index terms is represented by weights associated to them Let –t be the number of index terms in the system –K= {k 1, k 2, k 3,... k t } set of all index terms –k i be an index term –d j be a document –w ij is a weight associated with (k i,d j ) –w ij = 0 indicates that term does not belong to doc –vec(d j ) = (w 1j, w 2j, …, w tj ) is a weighted vector associated with the document dj –g i (vec(dj)) = w ij is a function which returns the weight associated with pair (k i,d j )

The Boolean Model Simple model based on set theory Only AND, OR and NOT are used Queries specified as boolean expressions –precise semantics –neat formalism –q = k a  (k b   k c ) Terms are either present or absent. Thus, w ij  {0,1} Consider –q = k a  (k b   k c ) –vec(q dnf ) = (1,1,1)  (1,1,0)  (1,0,0) –vec(q cc ) = (1,1,0) is a conjunctive component

The Boolean Model q = k a  (k b   k c ) sim(q,d j ) = 1 if  vec(q cc ) | (vec(q cc )  vec(q dnf ))  (  k i, g i (vec(d j )) = g i (vec(q cc ))) 0 otherwise (1,1,1) (1,0,0) (1,1,0) KaKa KbKb KcKc

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

The Vector Model Define: –w ij > 0 whenever k i  d j –w iq >= 0 associated with the pair (k i,q) – vec(d j ) = (w 1j, w 2j,..., w tj ) vec(q) = (w 1q, w 2q,..., w tq ) In this space, queries and documents are represented as weighted vectors

The Vector Model Sim(q,d j ) = cos(  ) = [vec(d j )  vec(q)] / |d j | * |q|= [  w ij * w iq ] / |d j | * |q| Since w ij > 0 and w iq > 0, 0 <= sim(q,d j ) <=1 A document is retrieved even if it matches the query terms only partially i j dj q 

The Vector Model Sim(q,dj) = [  w ij * w iq ] / |d j | * |q| How to compute the weights w ij and w iq ? A good weight must take into account two effects: –quantification of intra-document contents (similarity) tf factor, the term frequency within a document –quantification of inter-documents separation (dissi- milarity) idf factor, the inverse document frequency –wij = tf(i,j) * idf(i)

The Vector Model Let, –N be the total number of docs in the collection –n i be the number of docs which contain k i –freq(i,j) raw frequency of k i within d j A normalized tf factor is given by –f(i,j) = freq(i,j) / max l (freq(l,j)) –where the maximum is computed over all terms which occur within the document dj The idf factor is computed as –idf(i) = log (N/n i ) –the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i.

The Vector Model The best term-weighting schemes use weights which are give by –w ij = f(i,j) * log(N/n i ) –the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion is –w iq = (0.5 + [0.5 * freq(i,q) / max l (freq(l,q)]) * log(N/ni) The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.

The Vector Model Advantages: –term-weighting improves quality of the answer set –partial matching allows retrieval of docs that approximate the query conditions –cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: –assumes independence of index terms; not clear that this is bad though

The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3

The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3

The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3