7CCSMWAL Algorithmic Issues in the WWW

7CCSMWAL Algorithmic Issues in the WWW
Lecture 7

Text searching To search for a keyword within text, we can
Scan the text sequentially when The text is small, e.g., a few MB The text collection is very volatile (modified very frequently) No extra space is available (for building indices) Build a data structure over the text (called an index) to speed up the search when The text collection is large and static or semi-static (can be updated at reasonably regular intervals)

Inverted files Also called inverted indices
Mainly composed of two elements Dictionary (or vocabulary, or lexicon) Set of all different words (tokens, index terms) Posting list (or inverted list) Each word has a list of positions where the word appears. Used here to refer to terms within documents Postings file is the set of all posting lists Postings lists are much larger than the dictionary Dictionary is commonly kept in memory, and postings lists are normally kept on disk Structure of postings lists can vary (problem dependent) e.g. Each page of this PPT presentation

Example (see Intro to IR)
The dictionary sorted alphabetically into terms Each posting list is sorted by document ID The numbers are the documents in which the term occurs (or lines in a page or book or whatever)

Construct an inverted file
Input: a list of normalized tokens for each document Example Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:

Construct an inverted index
The core indexing step is sorting the lists of tokens so that the terms are alphabetical Multiple occurrences of the same term from the same document are then merged The term frequency is also recorded. (#occs. of term in doc) Instances of the same term are then grouped, and the result is split into a dictionary (of terms) and postings (list of documents containing term) The dictionary also records some statistics, such as the number of documents which contain each term (document frequency), and total number of occurrences

Example The inverted file

Data structures for postings lists
Fixed length arrays Each postings list is a fixed length array The arrays may not be fully utilized Reading a postings list is time-efficient as it is stored in contiguous locations

Data structures for postings lists
Singly linked lists Each postings list is a singly linked list No empty entries but there is the overhead of the pointers

Methods to index the terms
Various approaches include: External sort and merge In memory indexing based on hashing followed by merging Distributed indexing Google (e.g.) processes documents using Map-Reduce

Hardware constraint Indexing algorithm is governed by hardware constraints Characteristics of computer hardware Access to data in memory is much faster than access to data on disk It takes a few clock cycles to access a byte in memory, but much longer to transfer it from disk Want to keep as much data as possible in memory, especially the data that we need to access frequently

Index construction with disk
The list of tokens may be too large to be stored and sorted in memory External sorting algorithm minimize the number of random disk seeks during sorting Blocked sort-based indexing algorithm (BSBI) BSBI segment the collection into parts of equal size (Step 4 of the pseudo code) construct the intermediate inverted file for each part in memory (Step 5 of the pseudo code) This step is the same as when the list of tokens can fit in the memory where inverted file is constructed in the memory store the intermediate inverted files on disk (Step 6 of the pseudo code) merge all intermediate inverted files into the final index (Step 7 of the pseudo code)

BSBI pseudo code BSBIndexConstruction() n  0
while (all documents have not been processed) do n  n+1 block  ParseNextBlock() BSBI-Invert(block) WriteBlockToDisk(block, fn) MergeBlocks(f1, ..., fn; fmerge)

Merging intermediate inverted files
brutus d1,d3,d6,d7 caesar d1,d2,d4,d8,d9 julius d10 killed d8 noble d5 with d1,d2,d3,d5 brutus d1,d3 caesar d1,d2,d4 noble d5 with d1,d2,d3,d5 brutus d6,d7 caesar d8,d9 julius d10 killed d8 disk Two blocks (“posting lists to be merged”) are loaded from disk into memory, merged in memory (“merged posting lists”) and written back to disk

Single-pass in-memory indexing (SPIMI)
No sorting of tokens is required Tokens are processed one by one in memory When a term occurs for the first time, it is added to the dictionary (implemented as a hash table), and a new posting list is created Otherwise, find the corresponding postings list Then, add the docID to the postings list The process continues until the memory is full. The dictionary is then sorted and written to disk Note: the dictionary much shorter than the complete list of all tokens which occur (or that’s the idea)

Hash table A data structure that supports operation Lookup and Insert (and possibly Delete) in expected constant time Can be considered as a table of data Each term is stored in one of the entries of the table A hash function determines which data to be stored in which table entry Typically, the hash function maps a string (an index term or a key) to an integer (table entry) More details in slide p.37

Picture from Wikipedia
Example Picture from Wikipedia

SPIMI pseudo code SPIMI-Invert(token_stream) output_file = NewFile() dictionary = NewHash() while (free memory available) do token  next(token_stream) if term(token)  dictionary then postings_list = AddToDictionary(dictionary, term(token)) else postings_list = GetPostingsList(dictionary, term(token)) if full(postings_list) then posting_list = DoublePostingList(dictionary, term(token)) AddToPostingsList(postings_list, docID(token)) sorted_terms  SortTerms(dictionary) WriteBlockToDisk(sorted_terms, dictionary, output_file) return output_file The pseudo code only shows how an intermediate inverted file is constructed The final inverted files merging is the same as BSBI

Example List of tokens Main memory Disk Hash table Inverted file
(did, 1) (enact, 1) (julius, 1) (casear, 1) (so, 2) (did, 2) (it, 2) (the, 3) (you, 3) (hold, 3) ... Terms Postings casear 1 enact I so 2 julius did 1,2 Terms Postings casear 1 did 1,2 enact I julius so 2 Hash table Inverted file The tokens are read one by one and inserted to the hash table in main memory until the memory is full The entries in the hash table are sorted and written to disk as inverted file

Distributed indexing Perform indexing on large computer cluster
A computer cluster is a group of tightly coupled computers that work closely together The group may consists of hundreds or thousands of nodes (computers) Individual nodes can fail at any time The result of the construction process is a distributed index that is partitioned across several machines Either according to term or according to document We focus on term-partitioned index

Distributed indexing MapReduce: a general architecture for distributed computing A master node (computer) directs the process of dividing the work up into small tasks assigning the tasks to individual nodes re-assigning tasks in case of node failure

Distributed indexing The master-node breaks the input documents into splits Each split is a subset of documents (corresponding to the partitions of the list of tokens made in BSBI/SPIMI) Two set of tasks Parsers Inverters

Parsers Master assigns a split to an idle parser node
Parser reads one document at a time and produces (term, doc) pairs Parser writes pairs into j partitions for passing on to Inverters Each partition is for a range of terms’ first letters E.g., a-f, g-p, q-z  here j=3

Inverters To complete the index inversion
Parses pass the term-partitions to the inverters. Or can send the (term, doc) pairs one at a time An inverter collects all (term, doc) pairs (= postings) for its term-partition Sorts and writes to posting lists

... ... ... Data flow Master Splits of documents Postings a-f g-p q-z
assign assign Postings a-f Parser a-f g-p q-z Inverter Parser g-p a-f g-p q-z Inverter ... ... ... q-z Inverter Parser a-f g-p q-z partitions

Dynamic indexing Up to now, we have assumed that collections are static They rarely are New Documents come in over time and need to be inserted Documents are deleted and modified This means that the dictionary and the postings have to be modified: Posting updates for terms already in dictionary New terms are added to dictionary

Simplest approach. Block update
Maintain “big” main index on disk New documents go into “small” auxiliary index in memory Merge the auxiliary index block and the main index when the auxiliary index is bigger than a threshold value Assume that the threshold value for refreshing the auxiliary index is a large constant n

Suppose symbol  represents the merge operation
Example New main index after merging with auxiliary index Auxiliary index Main index n postings  0 postings n postings 1st merge n postings  n postings 2n postings 2nd merge 3rd merge n postings  2n postings 3n postings ... ... ... k-th merge n postings  (k-1)n postings kn postings Suppose symbol  represents the merge operation

Time complexity To process T=kn items uses k=T/n merges
To merge two sorted lists of size n, and Jn takes O(n+Jn)=O(Jn) time Process of building a main index with T postings needs J=1,…,T/n merges so takes O(1n + 2n +3n (T/n)n) = O(T2/n) time

Logarithmic merge Basic idea: Don’t merge auxiliary and main index directly
Speeds up merging and index construction in dynamic indexing Maintain a series of indexes, each twice as large as the previous one Keep smallest index (Z0) in memory Larger indices (I0, I1, ...) on disk (size doubling) I0 with size n, I1 with size 2n, I2 with size 4n, and so on The scheme for merging If Z0 gets too big (>=n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write Z1 to disk as I1 (if no I1) or merge with I1 to form Z2

Pseudo code of logarithmic merging
LMergeAddToken(indexes, Z0, token) Z0  Merge(Z0, {token}) if |Z0| = n then for i  0 to  do if Ii  indexes then Zi+1  Merge(Ii, Zi) (Zi+1 is a temporay index on disk) indexes  indexes – {Ii} else Ii  Zi (Zi becomes the permanent index Ii) indexes  indexes  {Ii} Break Zo   LogarithmicMerge() Zo   (Z0 is the in-memory index) indexes   while true do LMergeAddToken(indexes, Z0, GetNextToken())

Example symbol  represents the merge operation
Actions taken Indexes 1st time when |Z0|=n I0  Z0; Z0  ; I0 2nd time when |Z0|=n Z1  I0  Z0; I1  Z1; Remove I0; Z0   I1 3rd time when |Z0|=n I0  Z0; Z0   I0, I1 4th time when |Z0|=n Z1  I0  Z0; I1  Z1 ; Z2  I1  Z1; I2  Z2 Remove I0, I1; Z0   I2 5th time when |Z0|=n I0, I2 6th time when |Z0|=n I1, I2 7th time when |Z0|=n I0, I1, I2 k-th time indices at binary k-1

Time complexity Size Doubling: For T postings blocks, the series of indexes consists of at most log T indexes, I0, I1, I2, ..., Ilog T Why? Need k=log(T/n) levels for (2^k)n=T items To build a main index with T postings, the overall construction time is O(T log T) Each posting is processed (i.e., merged) only once on each of (at most) log T levels Why? If merging occurs item moves up a level So logarithmic merge is more efficient for index construction than block update as T log T < T2

Searching the Index

Search structures for dictionaries
Given a keyword of a query, determine if the keyword exists in the vocabulary (dictionary). If so, identify the pointer to the corresponding posting If no search structure exists, we have to check the terms of the dictionary one by one until a match is found or all terms are exhausted takes O(n) time, where n is the number of terms of the dictionary Search structures help speed up the vocabulary look-up operation

Search structures for dictionaries
Two main choices Hash table (introduced on slide p.16) Search tree Factors affecting choice How many terms are we likely to have? Is the number likely to remain static, or change a lot? Are we likely to only have new terms inserted, or also to have some terms in the dictionary deleted? What are the relative frequencies with which various terms will be accessed? General speaking, hash table is preferable for more static data and search tree handles dynamic data more efficiently.

Hash table Hash table: An array with a hash function and collision management Mainly operated by a hash function, which determines where to find (or insert) a term Ash function maps a term to an integer between 0 and N-1, where N is the number of entries of the hash table Hashing is reproduce-able randomness. It looks like a term is mapped to a random array index, but every time we map the term we get the same index.

An example of hash function
Suppose the dictionary consists of terms that are composed of lower-case letters or white-space only A term consists of at most 20 characters Let f() be a function that maps white- space to 0, ‘a’ to 1, ‘b’ to 2, ..., ‘z’ to 26. Let N be a large prime number The hash function F(word) can be defined as [ f(1st character) + f(2nd character)*26 +f(3rd character)*262 + f(4th character)* ] mod N

Hash function Suppose N=13 For term ‘caesar’ For term ‘enact’
F(‘caesar’) = 3 + 1*26 + 5* * * *265 mod 13 = mod 13 = 3 For term ‘enact’ F(‘enact’) = *26 + 1* * *264 mod 13 = mod 13 = 5 Exercise: the, let, it , best Powers of 26: 1, 26, 676, 17576, Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12

Collision Collision – two different terms mapped to the same entry
For example For term ‘was’: F(‘was’) = * *262 = mod 13 = 10 ‘was’ is mapped to the same entry as ‘julius’ Collision can be resolved by auxiliary structures, secondary hash function, or rehashing Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12

Search tree Binary search tree B-tree
The terms are in sorted order in the in-order traversal of the tree Only practical for in-memory operations Read for interest only

Binary search tree (BST)
Binary tree – a tree with every node having at most two children Binary search tree – every node is associated with a key (term) in which The term associated with the left child is lexicographically smaller than that of the parent node and The term associated with the parent is lexicographically smaller than that of the right child E.g., did caesar enact

Example Note: Posting, the documents containing the term
caesar did enact so the i julius killed Vocabulary 1 1 1 1 1 1 2 1 Postings 2 2

Searching in BST Start from the root node, the search proceeds to one of the two subtrees below by comparing the term you are searching and the term associated with the root The search stops when a match is found or a leave node is reached The search (or lookup) operation takes O(log T) time where T is the number of terms, provided that the BST is balanced Balance criteria, e.g., the numbers of terms under the two subtrees of any node are either equal or differ by 1

B-tree Number of subtrees under an internal node varies in a fixed interval [a,b], where ab are positive integers The number of terms associated with an internal node, except the root, is between a-1 to b-1 Can be viewed as “collapsing” multiple levels of the binary tree into one Good for the case that dictionary is disk resident, in which case this collapsing serves the function of prefetching imminent binary tests The integers a and b are determined by the sizes of disk blocks

Example a=2 and b=4 capitol hath Vocabulary be did i’ julius let
ambitious brutus caesar enact I it killed me 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 Postings 2 2

Reminder Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback