1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Published byModified over 7 years ago
Presentation on theme: "1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information."— Presentation transcript:
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining
2 (Offline) Search Engine Data Flow - Parse - Tokenize - Per page analysis tokenized web pages dup table Parse & Tokenize Global Analysis 2 inverted text index 1 Crawler web page - Scan tokenized web pages, anchor text, etc - Generate text index Index Build - Dup detection - Static rank - Anchor text - Spam analysis - - … 34 rank table anchor text in background spam table
3 Inverted index For each term T, we must store a list of all documents that contain T. Brutus Calpurnia Caesar 248163264128 2358132134 1316 1 Dictionary Postings lists Sorted by docID (more later on why). Posting
4 Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen.
5 Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps
7 Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.
8 The result is split into a Dictionary file and a Postings file.
9 The index we just built How do we process a query?
10 Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: 128 34 248163264123581321 Brutus Caesar
11 34 12824816 3264 12 3 581321 The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 34 248163264123581321 Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
12 Index construction How do we construct an index? What strategies can we use with limited main memory?
13 Our corpus for this lecture Number of docs = n = 1M Each doc has 1K terms Number of distinct terms = m = 500K 667 million postings entries
14 How many postings? Number of 1’s in the i th block = nJ/i Summing this over m/J blocks, we have For our numbers, this should be about 667 million postings.
15 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Recall index construction Documents are processed to extract words and these are saved with the Document ID.
16 We focus on this sort step. We have 667M items to sort. Key step After all documents have been processed the inverted file is sorted by terms.
17 Index construction At 10-12 bytes per postings entry, demands several temporary gigabytes
18 System parameters for design Disk seek ~ 10 milliseconds Block transfer from disk ~ 1 microsecond per byte (following a seek) All other ops ~ 1 microsecond E.g., compare two postings entries and decide their merge order
19 If every comparison took 2 disk seeks, and N items could be sorted with N log 2 N comparisons, how long would this take? Bottleneck Build postings entries one doc at a time Now sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow – must sort N=667M records
20 If every comparison took 2 disk seeks, and N items could be sorted with N log 2 N comparisons, how long would this take? 12.4 years!!! Disk-based sorting Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would be too slow – must sort N=667M records
21 Sorting with fewer disk seeks 12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records by term. Define a Block ~ 10M such records can “easily” fit a couple into memory. Will have 64 such blocks to start with. Will sort within blocks first, then merge the blocks into one long sorted order.
22 Sorting 64 blocks of 10M records First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds Total time to read each block from disk and write it back 120M x 2 x 10 -6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of 10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours Need 2 copies of data on disk, throughout
23 Disk 1 34 2 2 1 4 3 Runs being merged. Merged run. Merging 64 sorted runs Merge tree of log 2 64= 6 layers. During each layer, read into memory runs in blocks of 10M, merge, write back.
25 Merging 64 runs Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours Time estimate for the merge operation: 6 x 640M x 10 -6 = 1 hour Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours
29 Indexing improvements Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases B1B1 B2B2 B1B1 B2B2 B1B1 B2B2 B1B1 B2B2 time Read Sort + Write
30 Positional indexing Given documents: D1: This is a test D2: Is this a test D3: This is not a test Reorganize by term : TERMDOCLOCDATA(caps) this101 is110 a120 test130 is201 this210 a220 test230 this301 is310 not320 a330 test340
31 Positional indexing In “postings list” format : a (1,2,0),(2,2,0),(3,3,0) is (1,1,0),(2,0,1),(3,1,0) not (3,2,0) test (1,3,0),(2,3,0),(3,4,0) this (1,0,1),(2,1,0),(3,0,1) Sort by : TERMDOCLOCDATA(caps) a120 a220 a330 is110 is201 is310 not320 test130 test230 test340 this101 this210 this301
32 Positional indexing with radix sort Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by location since Radix sort is stable!
33 Distributed indexing Maintain a master machine directing the indexing job – considered “safe” Break up indexing into sets of (parallel) tasks Master machine assigns each task to an idle machine from a pool
34 Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs
35 Parallel tasks Parser writes pairs into j partitions Each for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion
37 Above process flow a special case of MapReduce. Inverters Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings
38 MapReduce Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed on Google’s cluster everyday.
39 Motivation Input data is large The whole Web, billions of Pages Lots of machines Use them efficiently
40 A real example Term frequencies through the whole Web repository. Count of URL access frequency. Reverse web-link graph ….
41 Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)
42 Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
43 Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
49 Fault tolerance Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs)
50 Fault tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely)
51 Performance Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records): 150 seconds. Sort 10^10 100-byte records (modeled after TeraSort benchmark): 839 seconds.
53 Experience: Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines
54 MapReduce Overview MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal w/ messy details
55 Resources MG Chapter 5 MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Indexing Shared Content in Information Retrieval Systems, A. Broder et. al., EDBT2006 High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura et. al., VLDB 2004