Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Basic IR: Modeling Basic IR Task: Slightly more complex:
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
CpSc 881: Information Retrieval
Information Retrieval
Modern Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Hinrich Schütze and Christina Lioma
CS/Info 430: Information Retrieval
Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Information Retrieval
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
| 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction.
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Information Retrieval and Web Search
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Homework #1 J. H. Wang Oct. 24, 2011.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Web-based Information Architecture 01: Boolean Retrieval Hongfei Yan School of EECS, Peking University 2/27/2013.
Take-away Administrativa
Large Scale Search: Inverted Index, etc.
Lecture 1: Introduction and the Boolean Model Information Retrieval
7CCSMWAL Algorithmic Issues in the WWW
Slides from Book: Christopher D
Text Based Information Retrieval
Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Boolean Retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Basic Information Retrieval
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
PageRank GROUP 4.
Boolean Retrieval.
CS276 Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Information retrieval 1 Boolean retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Term Incidence matrix

Incidence Matrix Antony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Antony Brutus Caesar Calpurnia Cleopatra mercy worser

Boolean Retrieval Model Query :Brutus AND Caesar AND NOT Calpurnia AND AND = The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words. By documents we mean whatever units we have decided to build a retrieval system over.

AD HOC Retrieval the group of documents over which we perform retrieval as the collection. Or CORPUS AD HOC RETRIEVAL :a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user- initiated query. information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need.

EFFECTIVENESS Relevance: A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need. Precision: What fraction of the returned results are relevant to the information need? Recall: What fraction of the relevant documents in the collection were returned by the system?

INVERTED INDEX an INVERTED INDEX: index always maps back from terms to the parts of a document where they occur. dictionary of terms (sometimes also referred to as a vocabulary or lexicon; we use dictionary for the data structure and vocabulary for the set of terms). Then for each term, we have a list that records which documents the term occurs in. the positions in the document) – is conventionally called a posting The list is then called a postings list (or inverted list), and all the postings lists taken together are referred to as the postings

Building an inverted index 1. Collect the documents to be indexed 2. Tokenize the text, turning each document into a list of tokens 3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms 4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is here also the length of each postings list).

Exercise 1.1 Draw the inverted index that would be built for the following document collection.  Doc 1 new home sales top forecasts  Doc 2 home sales rise in july  Doc 3 increase in home sales in july  Doc 4 july new home sales rise Exercise 1.2 Consider these documents:  Doc 1 breakthrough drug for schizophrenia  Doc 2 new schizophrenia drug  Doc 3 new approach for treatment of schizophrenia  Doc 4 new hopes for schizophrenia patients

Processing Boolean queries Brutus AND Calpurnia 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists

Query optimization Query optimization: is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.

FREE TEXT QUERIES free text queries: that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Extended Boolean retrieval models proximity operator: is a way of specifying that two terms in a query close to each other

Ad hoc search over unstructured documents 1- retrieval that is tolerant to spelling mistakes and inconsistent choice of words. 2- search for compounds or phrases that denote a concept (index has to be augmented) 3- to accumulate evidence, giving more weight to documents that have a term several times. term frequency :the number of times a term occurs in a document. 4- to order (or “rank”) the returned results.