Presentation on theme: "Information Retrieval Lecture 1 Introduction to Information Retrieval (Manning et al. 2007) Chapter 1 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:
Information Retrieval Lecture 1 Introduction to Information Retrieval (Manning et al. 2007) Chapter 1 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London
What is IR? IR is about search engines. Search Engine Unstructured Data (Text) Keyword Queries Database System Structured Data (Tables) SQL Queries VS Structured data has been the big commercial success (e.g., Oracle) but unstructured data is now becoming dominant in a large and increasing range of activities.
Search is Cool Text everywhere books, documents (ms-word, pdf, etc.), articles (journal, magazine, newspaper, etc.), Web pages, emails, SMS, chat, … Much more than text music, photos, videos, … Much more than the Web personal enterprise, institutional and domain-specific
Search is Cool Not only search/finding, but also organization and mining clustering classification …… For example, given thousaunds of CVs, … http://commons.wikimedia.org/wiki/Image:Bookshelf.jpg
Search is Hot Most peoples means of information access In the 1990s: other people Nowadays: Web search 92% of Internet users say the Internet is a good place to go for getting everyday information – (2004 Pew Internet Survey) The Search Engine War Google, Yahoo, Microsoft, … Video: How Big Can You Think?
Data: Unstructured vs. Structured in 1996
Data: Unstructured vs. Structured in 2006
Free Textbook Online C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-csli.stanford.edu/~schuetze/ information-retrieval-book.html http://www-csli.stanford.edu/~schuetze/ information-retrieval-book.html Head of Yahoo! Research Dont remember. Search! Another reason for you to come to this module.
A Simple Search Engine http://www.rhymezone.com/shakespeare/ Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeares plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible to be, or not to be How do you search in a book?
Boolean Queries Boolean Queries are queries using AND, OR and NOT together with query terms. Each document is viewed as a set of words. Precise: a document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers still like Boolean queries - you know exactly what youre getting. For example, http://www.westlaw.com/.http://www.westlaw.com/
Term-Document Incidence 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia
Incidence Vectors So we have a 0/1 vector for each term. To answer the query take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 & 110111 & 101111 = 100100
Query Results Antony and Cleopatra, Act III, Scene ii …… Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. …… Hamlet, Act III, Scene ii ……. Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. ……
Bigger Corpora Consider n = 1M documents each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are m = 500K distinct terms among these.
Bigger Corpora The 500K x 1M matrix has half-a-trillion 0s and 1s, but no more than one billion 1s. The matrix is extremely sparse. Cant build the matrix straightforwardly. Whats a better representation? Only record the 1 positions. Why?
Inverted Index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar 12358132134 248163264128 1316 What happens if the word Caesar is added to document 14?
Inverted Index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar 248163264128 2358132134 1316 1 Dictionary Postings (sorted by docID)
Inverted Index - Construction Tokenization Token Stream Friends RomansCountrymen Linguistic Preprocessin g Modified Tokens friend romancountryman Indexing Inverted Index friend roman countryman 24 2 13 16 1 Documents (to be indexed) Friends, Romans, countrymen.
Linguistic Preprocessing Case-folding Often best to lower case everything, since users will use lowercase regardless of correct capitalization. Removal of stopwords Very common words like the, of, to, etc. List of stopwords (stop lists) http://www.dcs.gla.ac.uk/idom/ir_resources/ linguistic_utils/stop_words http://www.dcs.gla.ac.uk/idom/ir_resources/ linguistic_utils/stop_words http://www.ranks.nl/tools/stopwords.html
Linguistic Preprocessing Stemming Reduce terms to their roots before indexing, e.g., automate(s), automatic, automation all reduced to automat. Porters stemmer Implementations in C, Java, Perl, Python, etc. http://www.tartarus.org/~martin/PorterStemmer/ ……
Input a sequence of (term, docID) pairs I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexing
Sort by terms. This is the core indexing step,
Merge multiple term entries in a single document. Add frequency information. Why? Will discuss later.
Split the result into a dictionary file and a postings file.
Where do we pay in storage? The index we just built How do we process a query?
Query Processing Consider processing the query: Brutus AND Caesar Locate Brutus in the dictionary; Retrieve its postings. Locate Caesar in the dictionary; Retrieve its postings. Merge the two postings: 128 34 248163264123581321 Brutus Caesar
34 12824816 3264 12 3 581321 Query Processing - Merge 128 34 248163264123581321 Brutus Caesar 2 8 Whats crucial: postings must be sorted by docID. Walk through the two postings simultaneously, in time linear in the total number of postings entries. In other words, if the posting list lengths are x and y, the merge takes O(x+y) operations.
Query Processing - Exercise Adapt the merge algorithm for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)?
Query Processing - Exercise What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in linear time? The time complexity is linear in what? Can we do better?
Query Processing - Exercise Extend the merge algorithm to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size? [Hint] Begin with the case of a Boolean formula query, then each query term appears only once in the query.
Take Home Messages IR is about search engines. Search is hot! Search is cool! Free textbook online! http://www.dcs.bbk.ac.uk/~dell/teaching/ir/ Inverted Index dictionary + postings Boolean Search query processing