2 Information Retrieval For the MSc Computer Science ProgrammeLecture 1Introduction to Information Retrieval (Manning et al. 2007)Chapter 1Dell ZhangBirkbeck, University of London
3 What is IR? IR is about search engines. Database System VS Structured Data (Tables)SQL QueriesVSSearch EngineUnstructured Data (Text)Keyword QueriesStructured data has been the big commercial success (e.g., Oracle) but unstructured data is now becoming dominant in a large and increasing range of activities.
4 Search is Cool Text everywhere Much more than text books, documents (ms-word, pdf, etc.), articles (journal, magazine, newspaper, etc.), Web pages, s, SMS, chat, …Much more than textmusic, photos, videos, …Much more than the Webpersonalenterprise, institutional and domain-specific
5 Search is CoolNot only search/finding, but also organization and miningclusteringclassification……For example, given thousaunds of CVs, …
6 Video: How Big Can You Think? Search is HotMost people’s means of information accessIn the 1990s: other peopleNowadays: Web search“92% of Internet users say the Internet is a good place to go for getting everyday information” – (2004 Pew Internet Survey)The Search Engine WarGoogle, Yahoo, Microsoft, …Video: How Big Can You Think?
9 Free Textbook OnlineHead of Yahoo! ResearchC. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Pressinformation-retrieval-book.htmlDon’t remember. Search!Another reason for you to come to this module.
10 A Simple Search Engine http://www.rhymezone.com/shakespeare/ “to be, or not to be”Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans near countrymen) not feasibleHow do you search in a book?
11 Boolean QueriesBoolean Queries are queries using AND, OR and NOT together with query terms.Each document is viewed as a set of words.Precise: a document matches condition or not.Primary commercial retrieval tool for 3 decades.Professional searchers still like Boolean queries - you know exactly what you’re getting.For example, .
12 Term-Document Incidence 1 if play contains word, 0 otherwiseBrutus AND Caesar but NOT Calpurnia
13 Incidence Vectors So we have a 0/1 vector for each term. To answer the querytake the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.& & =
14 Query Results Antony and Cleopatra, Act III, Scene ii ……Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,When Antony found Julius Caesar dead,He cried almost to roaring; and he weptWhen at Philippi he found Brutus slain.Hamlet, Act III, Scene ii…….Lord Polonius: I did enact Julius Caesar I was killed i' theCapitol; Brutus killed me.
15 Bigger Corpora Consider n = 1M documents each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation6GB of data in the documents.Say there are m = 500K distinct terms among these.
16 Bigger Corpora The 500K x 1M matrix has half-a-trillion 0’s and 1’s,but no more than one billion 1’s.The matrix is extremely sparse.Can’t build the matrix straightforwardly.What’s a better representation?Only record the 1 positions.Why?
17 Inverted IndexFor each term T, we must store a list of all documents that contain T.Do we use an array or a list for this?Brutus248163264128Calpurnia12358132134Caesar1316What happens if the word Caesar is added to document 14?
18 Inverted Index Linked lists generally preferred to arrays Dynamic space allocationInsertion of terms into documents easySpace overhead of pointers248163264128DictionaryBrutusCalpurniaCaesar123581321341316Postings(sorted by docID)
19 Inverted Index - Construction TokenizationToken StreamFriendsRomansCountrymenLinguistic PreprocessingModified TokensfriendromancountrymanIndexingInverted Index2413161Documents(to be indexed)Friends, Romans, countrymen.
20 Linguistic Preprocessing Case-foldingOften best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization.Removal of stopwordsVery common words like the, of, to, etc.List of stopwords (stop lists)linguistic_utils/stop_words
21 Linguistic Preprocessing StemmingReduce terms to their “roots” before indexing, e.g., automate(s), automatic, automation all reduced to automat.Porter’s stemmerImplementations in C, Java, Perl, Python, etc.……
22 Indexing Input a sequence of (term, docID) pairs Doc 1 Doc 2 I did enact JuliusCaesar I was killedi' the Capitol;Brutus killed me.So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious
24 Merge Add multiple term entries in a single document. frequency information.Why?Will discuss later.
25 Splitthe result intoa dictionary file and a postings file.
26 The index we just built Where do we pay in storage? How do we process a query?
27 Query Processing Consider processing the query: Brutus AND Caesar Locate Brutus in the dictionary;Retrieve its postings.Locate Caesar in the dictionary;“Merge” the two postings:248163264128Brutus12358132134Caesar
28 Query Processing - Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries. In other words, if the posting list lengths are x and y, the merge takes O(x+y) operations.234128248163264135132148163264128BrutusCaesar2812358132134What’s crucial: postings must be sorted by docID.
29 Query Processing - Exercise Adapt the merge algorithm for the queries:Brutus AND NOT CaesarBrutus OR NOT CaesarCan we still run through the merge in time O(x+y)?
30 Query Processing - Exercise What about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)Can we always merge in “linear” time?The time complexity is linear in what?Can we do better?
31 Query Processing - Exercise Extend the merge algorithm to an arbitrary Boolean query.Can we always guarantee execution in time linear in the total postings size?[Hint] Begin with the case of a Boolean formula query, then each query term appears only once in the query.
32 Take Home Messages IR is about search engines. Free textbook online! Search is hot! Search is cool!Free textbook online!Inverted Indexdictionary + postingsBoolean Searchquery processing