Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Advertisements

CS276A Text Retrieval and Mining Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
CpSc 881: Information Retrieval
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Text Retrieval and Text Databases Based on Christopher and Raghavan’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Information Retrieval and Web Search
1 CS276 Information Retrieval and Web Search Lecture 1: Introduction.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1 Information Retrieval Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval cs160 Introduction David Kauchak adapted from:
Introduction to Information Retrieval Boolean Retrieval.
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Web-based Information Architecture 01: Boolean Retrieval Hongfei Yan School of EECS, Peking University 2/27/2013.
Take-away Administrativa
Information Retrieval : Intro
Large Scale Search: Inverted Index, etc.
CS122B: Projects in Databases and Web Applications Winter 2017
Lecture 1: Introduction and the Boolean Model Information Retrieval
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
정보 검색 특론 Information Retrieval and Web Search
Information Retrieval and Web Search
Boolean Retrieval.
CS122B: Projects in Databases and Web Applications Winter 2017
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
6. Implementation of Vector-Space Retrieval
Boolean Retrieval.
Introduction to Information Retrieval
CS276 Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Inverted Index Construction
Presentation transcript:

Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014

Outlines We will discuss two more topics. –Boolean retrieval –Posting list 2

Posting Lists 3

4 Tokenizing And Preprocessing 4

5 Generate Posting 5

6 Sort Postings 6

7 Creating Postings Lists, Determine Document Frequency 7

8 Split the Result into Dictionary and Postings File 8 dictionary postings

Inverted Index system computer database science D 2, 4 D 5, 2 D 1, 3 D 7, 4 Index terms df D j, tf j Term List (a.k.a. Dictionary) Postings lists     

Creating an Inverted Index Create an empty index term list I; For each document, D, in the document set V For each (non-zero) token, T, in D: If T is not already in I Insert T into I; Find the location for T in I; If (T, D) is in the posting list for T increase its term frequency for T; Else Create (T, D); Add it to the posting list for T;

In-Class Work Creating an inverted index for the following documents –“The quick brown fox jumps over the lazy dog” –“The quick fox run” –“The lazy dog sleep” Assume: –“the” and “over” are stopwords that are not indexed –various forms of verbs are not indexed separately, only the original form (stem) is indexed 11

Boolean Retrieval Processing Boolean queries Query optimization 12

13 Simple Conjunctive Query (two terms) Consider the query: BRUTUS AND CALPURNIA To find all matching documents using inverted index: – Locate BRUTUS in the dictionary (term list) – Retrieve its postings list from the postings file – Locate CALPURNIA in the dictionary – Retrieve its postings list from the postings file – Intersect the two postings lists – Return intersection to user 13

14 Intersecting Two Posting Lists The complexity is linear in the length of the postings lists. Note: This only works if postings lists are sorted. 14

15 Algorithm of Intersecting Two Posting Lists 15 Assume posting lists are sorted by docID

16 Query Processing: In-Class Work Compute hit list for ((paris AND NOT france) OR lear) 16

17 Boolean Queries The Boolean retrieval model can answer any query that is a Boolean expression. –Boolean queries are queries that use AND, OR and NOT to join query terms. –Views each document as a set of terms. –Precise: either document matches condition or not, nothing in between. Primary commercial retrieval tool for three decades Many professional searchers (e.g., lawyers) still like Boolean queries. –You know exactly what you are getting. Many search systems you use are also Boolean: spotlight, , intranet etc. 17

Outline Processing Boolean queries Query optimization 18

19 Query Optimization Consider a query that is an and of n terms, n > 2 For each of the terms, get its postings list, then and them together Example query: BRUTUS AND CALPURNIA AND CAESAR What is the best order for processing this query? That is, should we process BRUTUS first? CALPURNIA first? Or CEASAR first? 19

20 Query Optimization Example query: BRUTUS AND CALPURNIA AND CAESAR Simple and effective optimization: Process in order of increasing frequency Start with the shortest postings list, then keep cutting further In this example, first CAESAR, then CALPURNIA, then BRUTUS 20

21 Optimized Intersection Algorithm for Conjunctive Queries 21

22 More General Optimization Example query: (MADDING OR CROWD) and (IGNOBLE OR STRIFE) Get frequencies for all terms Estimate the size of each or by the sum of its frequencies (conservative) Process in increasing order of or sizes 22