Presentation on theme: "1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an."— Presentation transcript:
1 Web Search and Information Retrieval
2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers)
3 Structured vs unstructured data Structured data : information in “tables” EmployeeManagerSalary SmithJones50000 ChangSmith IvySmith Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith.
4 Unstructured data Typically refers to free text Allows Keyword-based queries including operators More sophisticated “concept” queries, e.g., find all web pages dealing with drug abuse
5 Ultimate Focus of IR Satisfying user information need Emphasis is on retrieval of information (not data) Predicting which documents are relevant, and then linearly ranking them.
6 SIGIR 2005 Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task
7 The classic search model Corpus TASK Info Need Query Verbal form Results SEARCH ENGINE Query Refinement Get rid of mice in a politically correct way Info about removing mice without killing them How do I trap mice alive? mouse trap Mis-conceptionMis-translationMis-formulation
8 Boolean Queries Some simple query examples Documents containing the word “Java” Documents containing the word “Java” but not the word “coffee” Documents containing the phrase “Java beans” or the term “API” Documents where “Java” and “island” occur in the same sentence The last two queries are called proximity queries
9 Before processing the queries … Documents in the collection should be tokenized in a suitable manner Documents in the collection should be tokenized in a suitable manner We need to decide what terms should be put in the index We need to decide what terms should be put in the index
10 Tokens and Terms
11 Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below
12 Why tokenization is difficult – even in English Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. Tokenize this sentence
13 One word or two? (or several) fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares
14 Tokenization: language issues Chinese and Japanese have no spaces between words: 莎拉波娃現在居住在美國東南部的佛羅里達。 Not always guaranteed a unique tokenization
15 Ambiguous segmentation in Chinese The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.
16 Normalization Need to “normalize” terms in indexed text as well as query terms into the same form. Example: We want to match U.S.A. and USA Two general solutions We most commonly implicitly define equivalence classes of terms. Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient
17 Case folding Reduce all letters to lower case exception: upper case in mid-sentence? Fed vs. fed Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…
18 Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form
19 Stemming Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress
20 Porter algorithm Most common algorithm for stemming English Results suggest that it is at least as good as other stemming options Phases are applied sequentially Each phase consists of a set of commands. Sample command: Delete final “ement” if what remains is longer than 1 character replacement → replac cement → cement Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
21 Porter stemmer: A few rules Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat
22 Other stemmers Other stemmers exist, e.g., Lovins stemmer Single-pass, longest suffix removal (about 250 rules) Full morphological analysis – at most modest benefits for retrieval Do stemming and other normalizations help? English: very mixed results. Helps recall for some queries but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational Definitely useful for Spanish, German, Finnish, …
23 Thesauri Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile color = colour Rewrite to form equivalence classes Index such equivalences When the document contains automobile, index it under car as well (usually, also vice-versa) Or expand query? When the query contains automobile, look under car as well
24 Stop words(1) stop words = extremely common words which would appear to be of little value in helping select documents matching a user need They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with Without suitable compression techniques, it needs a lot of space to index stop words. Stop word elimination used to be standard in older IR systems.
25 Stop words(2) But the trend is away from doing this: Good compression techniques mean the space for including stopwords in a system is very small Good query optimization techniques mean you pay little at query time for including stop words. You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” ‘can’ as a verb is not very useful for keyword queries, but ‘can’ as a noun could be central to a query Most web search engines index stop words
26 The information contains in Doc1&&2 can be represented in the right table. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Start to process Boolean queries(1)
27 Start to process Boolean queries(2) The table mentioned above is called POSTING By using a table like this, it is simple to answer the queries using SQL Documents containing the word “Java” select did from POSTING where tid=‘jave’ Documents containing the word “Java” but not the word “coffee” (select did from POSTING where tid= ‘java’) except (select did from POSTING where tid=‘coffee’)
28 Start to process Boolean queries(3) Documents containing the phrase “Java beans” or the term “API” With D_JAVA(did, pos) as (select did, pos from POSTING where tid=‘java’), D_BEANS(did, pos) as (select did, pos from POSTING where tid=‘beans’), D_JAVABEANS(did) as (select D_JAVA.did from D_JAVA, D_BEANS where D_JAVA.did= D_BEANS.did and D_JAVA.pos+1=D_BEANS.pos), D_API(did) as (select did from POSTING where tid=‘api’), (select did from D_JAVABEANS) union (select did from D_API) Documents where “Java” and “island” occur in the same sentence If sentence terminators are well defined, one can keep a sentence counter and maintain sentence positions as well as token positions in the POSTING table.
29 Is it efficient? Although the three-column table makes it easy to write keyword queries, it wastes a great deal of space. To reduce the storage space Document-term matrix -> term-document matrix Inverted index For each term T, we must store a list of all documents that contain T.
30 Inverted index: the basic concept
31 Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings lists Sorted by docID Posting
32 Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: Brutus Caesar
The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
34 Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Index construction
35 Sort by terms. External sort is used N-way merge sort Large scale indexer Core indexing step.
36 Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.
37 The result is split into a Dictionary file and a Postings file.
38 Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?
39 Google data centers Google data centers mainly contain commodity machines. Data centers are distributed around the world. Estimate: a total of 1 million servers, 3 million processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers each quarter.
40 Distributed indexing Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.
41 Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents
42 Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion
43 Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists
45 MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … … without having to write code for the distribution part.
46 MapReduce Index construction was just one phase. Another phase: transforming a term- partitioned index into document-partitioned index. Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents (As we discuss in the web part of the course) most search engines use a document- partitioned index … better load balancing, etc.)
47 Dynamic indexing Up to now, we have assumed that collections are static. They rarely are: Documents come in over time and need to be inserted. Documents are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary
48 Simplest approach Maintain “big” main index Insertions New docs go into “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index
49 Dynamic indexing at search engines All the large search engines now do dynamic indexing Their indices have frequent incremental changes News items, new topical web pages But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new index, and the old index is then deleted
50 Something about dictionary
51 A na ï ve dictionary An array of struct: char int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time? How do we store a dictionary in memory efficiently?
52 Dictionary data structures Two main choices: Hash table Tree Some IR systems use hashes, some trees
53 Hashes Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search[tolerant retrieval] If vocabulary keeps going, need to occasionally do the expensive operation of rehashing everything
54 Trees Simplest: binary tree More usual: B + -treeB + -tree Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B + -trees mitigate the rebalancing problem
55 Other issues Wild-card query Example mon*: find all docs containing any word beginning with “mon” Spell correction Two main flavors: Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita.
56 Why compress the dictionary Must keep in memory Search begins with the dictionary Embedded/mobile devices
57 Dictionary storage - first cut Array of fixed-width entries ~400,000 terms; 28 bytes/term = 11.2 MB. Dictionary search structure 20 bytes4 bytes each
58 Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle supercalifragilisticexpialidocious. Ave. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary term?
59 Compressing the term list: Dictionary-as-a-String ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 400K x 8B = 3.2MB Pointers resolve 3.2M positions: log 2 3.2M = 22bits = 3bytes Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.
60 Blocking Store pointers to every kth term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.
61 Net Where we used 3 bytes/pointer without blocking 3 x 4 = 12 bytes for k=4 pointers, now we use 3+4=7 bytes for 4 pointers. Shaved another ~0.5MB; can save more with larger k. Why not go with larger k?
62 Dictionary search without blocking Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2*2+4*3+4)/8 ~2.6
63 Dictionary search with blocking Binary search down to 4-term block; Then linear search through terms in block. Blocks of 4 (binary tree), avg. = (1+2*2+2*3+2*4+5)/8 = 3 compares
64 Front coding Front-coding: Sorted words commonly have long common prefix – store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes automat Extra length beyond automat. Begins to resemble general string compression.
66 B+-tree Records must be ordered over an attribute Queries: exact match and range queries over the indexed attribute: “ find the name of the student with ID= ” or “ find all students with gpa between 3.00 and 3.5 ”
67 B+-tree:properties Insert/delete at log F (N/B) cost; keep tree height- balanced. (F = fanout) Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)
to keys to keysto keys to keys < 5757 k<8181 k<95 95 Index node
69 Data node To record with key 57 To record with key 81 To record with key 85 From non-leaf node to next leaf in sequence
70 EX: B+ Tree of order 3. (a) Initial tree , ,106040,5080,100 Index level Data level