CS122B: Projects in Databases and Web Applications Winter 2017

Slides:



Advertisements
Similar presentations
Information Retrieval
Advertisements

Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
CS276A Text Retrieval and Mining Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to IR Systems: Supporting Boolean Text Search 198:541.
Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Introduction to Information Retrieval Introduction to Information Retrieval cs458 Introduction David Kauchak adapted from:
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
Introduction to Information Retrival Slides are adapted from stanford CS276.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Text Retrieval and Text Databases Based on Christopher and Raghavan’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Information Retrieval and Web Search
1 CS276 Information Retrieval and Web Search Lecture 1: Introduction.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1 Information Retrieval Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval cs160 Introduction David Kauchak adapted from:
Introduction to Information Retrieval Boolean Retrieval.
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
CS 6633 資訊檢索 Information Retrieval and Web Search Lecture 1: Boolean retrieval Based on a ppt file by HinrichSchütze.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Take-away Administrativa
Information Retrieval : Intro
Large Scale Search: Inverted Index, etc.
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
정보 검색 특론 Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Boolean Retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Web-Based Information Retrieval System
CS122B: Projects in Databases and Web Applications Winter 2017
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Boolean Retrieval.
Introduction to Information Retrieval
Introducing Information Retrieval and Web Search
CS276 Information Retrieval and Web Search
Introducing Information Retrieval and Web Search
Inverted Index Construction
Introducing Information Retrieval and Web Search
Presentation transcript:

CS122B: Projects in Databases and Web Applications Winter 2017 Professor Chen Li Department of Computer Science UC Irvine Notes 06: Inverted Index Slides borrowed from Prof. Manning at Stanford

Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible

Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 What happens if the word Caesar is added to document 14?

Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers 2 4 8 16 32 64 128 Dictionary Brutus Calpurnia Caesar 1 2 3 5 8 13 21 34 13 16 Postings

Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 13 16 1

Query processing Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Brutus 1 2 3 5 8 13 21 34 Caesar

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 34 128 2 4 8 16 32 64 1 3 5 13 21 4 8 16 32 64 128 Brutus Caesar 2 8 1 2 3 5 8 13 21 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Boolean queries: Exact match Boolean Queries are queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you’re getting.

Beyond term search What about phrases? Proximity: Find Gates NEAR Microsoft. Need index to capture position information in docs. More later. Zones in documents: Find documents with (author = Ullman) AND (text contains automata).

Other Challenges Stemming Tokenization Stop words Synonyms Especially hard for non-Latin languages E.g., Chinese, Japanese Stop words Synonyms