Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.

Lecture 1: Overview of IR Maya Ramanath

Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result for the query “maya ramanath”? OR: How good is Google?

Lectures Overview (this lecture) Retrieval Models Retrieval Evaluation Why DB and IR?

Information Retrieval “An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non- existence) and whereabouts of documents relating to his request.” “Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).”

Basic Terms TermDefinition DocumentA sequence/set of terms, expressing ideas about one or more topics, usually in natural language Corpus/CollectionA set of documents Information needCorresponds to an innate idea of information/knowledge that the user is currently looking for Term/Keyword/PhraseA semantic unit, a word, phrase or potentially root of a word QueryThe expression of the information need by the user RelevanceA measure of how well the retrieved documents satisfy the user’s information need

What is a retrieval system? Source: Hiemstra, D. (2009) Information Retrieval Models, in Information Retrieval: Searching in the 21st Century (eds A. Göker and J. Davies), John Wiley & Sons, Ltd, Chichester, UK.

Retrieval Models Source and Further Reading: Hiemstra, D. (2009) Information Retrieval Models, in Information Retrieval: Searching in the 21st Century (eds A. Göker and J. Davies), John Wiley & Sons, Ltd, Chichester, UK.

2 kinds of models No Ranking – Boolean models – Region models Ranking – Vector space model – Probabilistic models – Language models

Boolean Model Based on set theory Simple query language Ex: information AND (retrieval OR management) retrieval management information

Vector Space Model (1/2) Based on the notion of “similarity” between query and document – Query is the representation of the document that you want to retrieve – Compare similarity between query and document Luhn’s formulation: The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information.

Vector Space Model (2/2) Document Query We will study more in the next lecture

Probabilistic IR (1/2) Based on probability theory – Specifically, we would like to estimate the probability of relevance The Probability Ranking Principle If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Probabilistic IR (2/2) Ranking of documents based on Odds We will study more in the next lecture

Language Models (1/3) Based on generative models for documents and queries Documents, Query: Samples of an underlying probabilistic process Estimate the parameters of this process Measure how close the distributions are (KL- divergence) – “Closeness” gives a measure of relevance

Language Models (2/3) d2d2 d1d1 q Documents Query

Language Models (3/3) The Maximum Likelihood Estimator + smoothing We will study more in the next lecture

Evaluation (Which system is best?)

Benchmarking IR Systems (1/2) Why do we need to benchmark? To benchmark an IR system – Efficiency – Quality Results Power of interface Ease of use, etc.

Benchmarking IR Systems (2/2) Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided by experts, identified from real search logs, etc. Relevance judgements – For a given query, is the document relevant?

Precision, Recall, F-Measure Precision Recall F-Measure: Weighted harmonic mean of Precision and Recall

That’s it for today!

Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.

Similar presentations

Presentation on theme: "Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.

Similar presentations

Presentation on theme: "Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result."— Presentation transcript:

Similar presentations

About project

Feedback