Information Retrieval

Slides:



Advertisements
Similar presentations
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
CS276A Text Retrieval and Mining Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Ricerca dell’Informazione nel Web Prof. Stefano Leonardi.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CpSc 881: Information Retrieval
1 Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof Raymond Mooney (UTexas), Prof. Joydeep Ghosh (UT ECE) who.
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to IR Systems: Supporting Boolean Text Search 198:541.
Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
1 CSCI 5417 Information Retrieval Systems Lecture 1 8/23/2011 Introduction.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introducing Information Retrieval and Web Search
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Introduction to Information Retrieval Introduction to Information Retrieval cs458 Introduction David Kauchak adapted from:
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
Introduction to Information Retrival Slides are adapted from stanford CS276.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Text Retrieval and Text Databases Based on Christopher and Raghavan’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
1 CS276 Information Retrieval and Web Search Lecture 1: Introduction.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1 Information Retrieval Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval cs160 Introduction David Kauchak adapted from:
Introduction to Information Retrieval Boolean Retrieval.
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
CS 6633 資訊檢索 Information Retrieval and Web Search Lecture 1: Boolean retrieval Based on a ppt file by HinrichSchütze.
Take-away Administrativa
Information Retrieval : Intro
Large Scale Search: Inverted Index, etc.
CS122B: Projects in Databases and Web Applications Winter 2017
Lecture 1: Introduction and the Boolean Model Information Retrieval
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
정보 검색 특론 Information Retrieval and Web Search
Boolean Retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CS122B: Projects in Databases and Web Applications Winter 2017
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Boolean Retrieval.
CS276 Information Retrieval and Web Search
Introducing Information Retrieval and Web Search
Inverted Index Construction
Presentation transcript:

Information Retrieval For the MSc Computer Science Programme Lecture 1 Introduction to Information Retrieval (Manning et al. 2007) Chapter 1 Dell Zhang Birkbeck, University of London

What is IR? IR is about search engines. Database System VS Structured Data (Tables) SQL Queries VS Search Engine Unstructured Data (Text) Keyword Queries Structured data has been the big commercial success (e.g., Oracle) but unstructured data is now becoming dominant in a large and increasing range of activities.

Search is Cool Text everywhere Much more than text books, documents (ms-word, pdf, etc.), articles (journal, magazine, newspaper, etc.), Web pages, emails, SMS, chat, … Much more than text music, photos, videos, … Much more than the Web personal enterprise, institutional and domain-specific

Search is Cool Not only search/finding, but also organization and mining clustering classification …… http://commons.wikimedia.org/wiki/Image:Bookshelf.jpg For example, given thousaunds of CVs, …

Video: How Big Can You Think? Search is Hot Most people’s means of information access In the 1990s: other people Nowadays: Web search “92% of Internet users say the Internet is a good place to go for getting everyday information” – (2004 Pew Internet Survey) The Search Engine War Google, Yahoo, Microsoft, … Video: How Big Can You Think?

Data: Unstructured vs. Structured in 1996

Data: Unstructured vs. Structured in 2006

Free Textbook Online Head of Yahoo! Research C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-csli.stanford.edu/~schuetze/ information-retrieval-book.html Don’t remember. Search! Another reason for you to come to this module.

A Simple Search Engine http://www.rhymezone.com/shakespeare/ “to be, or not to be” http://www.rhymezone.com/shakespeare/ Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible How do you search in a book?

Boolean Queries Boolean Queries are queries using AND, OR and NOT together with query terms. Each document is viewed as a set of words. Precise: a document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers still like Boolean queries - you know exactly what you’re getting. For example, http://www.westlaw.com/ .

Term-Document Incidence 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia

Incidence Vectors So we have a 0/1 vector for each term. To answer the query take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. 110100 & 110111 & 101111 = 100100

Query Results Antony and Cleopatra, Act III, Scene ii …… Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii ……. Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Bigger Corpora Consider n = 1M documents each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are m = 500K distinct terms among these.

Bigger Corpora The 500K x 1M matrix has half-a-trillion 0’s and 1’s, but no more than one billion 1’s. The matrix is extremely sparse. Can’t build the matrix straightforwardly. What’s a better representation? Only record the 1 positions. Why?

Inverted Index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 What happens if the word Caesar is added to document 14?

Inverted Index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers 2 4 8 16 32 64 128 Dictionary Brutus Calpurnia Caesar 1 2 3 5 8 13 21 34 13 16 Postings (sorted by docID)

Inverted Index - Construction Tokenization Token Stream Friends Romans Countrymen Linguistic Preprocessing Modified Tokens friend roman countryman Indexing Inverted Index 2 4 13 16 1 Documents (to be indexed) Friends, Romans, countrymen.

Linguistic Preprocessing Case-folding Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization. Removal of stopwords Very common words like the, of, to, etc. List of stopwords (stop lists) http://www.dcs.gla.ac.uk/idom/ir_resources/ linguistic_utils/stop_words http://www.ranks.nl/tools/stopwords.html

Linguistic Preprocessing Stemming Reduce terms to their “roots” before indexing, e.g., automate(s), automatic, automation all reduced to automat. Porter’s stemmer Implementations in C, Java, Perl, Python, etc. http://www.tartarus.org/~martin/PorterStemmer/ ……

Indexing Input a sequence of (term, docID) pairs Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sort by terms. This is the core indexing step,

Merge Add multiple term entries in a single document. frequency information. Why? Will discuss later.

Split the result into a dictionary file and a postings file.

The index we just built Where do we pay in storage? How do we process a query?

Query Processing Consider processing the query: Brutus AND Caesar Locate Brutus in the dictionary; Retrieve its postings. Locate Caesar in the dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Brutus 1 2 3 5 8 13 21 34 Caesar

Query Processing - Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries. In other words, if the posting list lengths are x and y, the merge takes O(x+y) operations. 2 34 128 2 4 8 16 32 64 1 3 5 13 21 4 8 16 32 64 128 Brutus Caesar 2 8 1 2 3 5 8 13 21 34 What’s crucial: postings must be sorted by docID.

Query Processing - Exercise Adapt the merge algorithm for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)?

Query Processing - Exercise What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “linear” time? The time complexity is linear in what? Can we do better?

Query Processing - Exercise Extend the merge algorithm to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size? [Hint] Begin with the case of a Boolean formula query, then each query term appears only once in the query.

Take Home Messages IR is about search engines. Free textbook online! Search is hot! Search is cool! Free textbook online! http://www.dcs.bbk.ac.uk/~dell/teaching/ir/ Inverted Index dictionary + postings Boolean Search query processing