Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Hardware parameters Caching  Accessing data in memory is a few times faster than from disk  Keep as much information as possible in main memory 2 ParameterHDD Average seek time5 × 10 −3 s (5 ms) Average time to read a byte from disk2 × 10 −8 s (50MB/s) Average time to access a byte in memory 5 × 10 −9 s (0.5 μs) Processor’s clock rate10 9 per second Low-level operation (compare, swap a word) 10 −8 s (1 μs)

Hardware parameters Reading from disk  About 10 MB data in one contiguous chunk on disk – One seek (~5 ms) + Read (~ 0.2 s) ≈ 0.2 s  About 10 MB data in 1000 non- contiguous chunks on disk – Thousand seeks (~5 s) + Read (~ 0.2 s) ≈ 5.2 s 3 ParameterHDD Average seek time5 × 10 −3 s (5 ms) Average time to read a byte from disk2 × 10 −8 s (50MB/s) Average time to access a byte in memory 5 × 10 −9 s (0.5 μs) Processor’s clock rate10 9 per second Low-level operation (compare, swap a word) 10 −8 s (1 μs)

Hardware parameters  The OS reads and writes data in blocks – Block sizes may be 8, 16, 32 or 64KB – Reading 1 byte data takes same time as the entire block – Buffer: part of main memory where a block is stored after reading or while writing 4 ParameterHDD Average seek time5 × 10 −3 s (5 ms) Average time to read a byte from disk2 × 10 −8 s (50MB/s) Average time to access a byte in memory 5 × 10 −9 s (0.2 μs) Processor’s clock rate10 9 per second Low-level operation (compare, swap a word) 10 −8 s (1 μs)

Hardware parameters  Simultaneous reading and processing – Disk  main memory data transfer is done by system bus – Processor is free, can work while reading data – The system can read data and decompress at the same time 5 ParameterHDD Average seek time5 × 10 −3 s (5 ms) Average time to read a byte from disk2 × 10 −8 s (50MB/s) Average time to access a byte in memory 5 × 10 −9 s (0.2 μs) Processor’s clock rate10 9 per second Low-level operation (compare, swap a word) 10 −8 s (1 μs)

Hardware parameters  The parameters change a lot for SSDs – Seek: about 50 times faster than HDD – Read: about 4 times faster than HDD 6 ParameterHDD Average seek time5 × 10 −3 s (5 ms) Average time to read a byte from disk2 × 10 −8 s (50MB/s) Average time to access a byte in memory 5 × 10 −9 s (0.2 μs) Processor’s clock rate10 9 per second Low-level operation (compare, swap a word) 10 −8 s (1 μs) SSD 0.1 ms 5 × 10 −9 s (200 MB/s)

Blocked sort based indexing 7 Segments of the collections Step 1: partition the collection into segments pairs for each segment Step 2: make a pass over all documents, write out pairs pairs for each segment sorted by term Step 3: separately in each segment, sort the pairs by term. May require an external memory sort. Posting lists for each segment Final posting lists for each term Step 4: separately in each segment, create the posting lists for each term Step 5: for each term, merge the posting lists from all segments and create one single posting list

Single-pass in memory indexing 8 Collection Step 1: Keep parsing documents, convert to pairs pairs in memory Step 2: invert pairs in memory to make posting lists; immediately available for searching as well Posting lists in memory Step 3: When memory is exhausted, write out the whole index to disk after sorting the posting lists and dictionary Index (dictionary and posting lists) for one block written to disk Final index merged Step 4: merge the index files for different blocks

Dynamic indexing  For fast access, posting lists are written in contiguous blocks  Changes to existing index requires a lot of moving of the data  Approach: create an auxiliary index for incremental changes, keep growing that index – Searches are performed in both old and auxiliary indices, results are merged – When auxiliary index grows significantly large, merge it with the original one (costly operation, but not done often)  Ease of merging – If each posting list can be one file  Good – Bad idea! The OS cannot handle too many files (too many terms) well – Tradeoff between #of files and #of posting lists per file: keep some posting lists in one file, but not all in one 9

Security Search in enterprise data  Not all users may be allowed to view all documents  Approach 1: use access control lists – For each user / group of users, have a posting list of documents the user / group CAN access – Intersect that list with the search result list – Problem: difficult to maintain when access changes – Access control lists may be very large  Approach 2: check at query time – From the retrieved results, filter out those which the user is not allowed to access – May slow down retrieval 10

Index Compression Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata 11

Data compression Represent the information using (hopefully) lesser number of bits than used by the original representation  Lossy: may lose some “information” – Consequently, may not be able to get back the original  Lossless: just a different representation, will be able to “decompress” and get back the original – We will consider only lossless compression now Question: Can a lossless compression algorithm “compress” any input data? 12

Index compression  Use less space to store – In disk or in memory – Saving is saving anyway  Dictionary compression – May store the full or most of the dictionary in memory  Posting list compression – Assumption: cannot store it in memory – Transfer data from disk faster – Typically can compress index by 4 times  Need a fast decompression – Without compression: transfer 4x bytes from disk – With compression: transfer x bytes and decompress 13

The posting lists  For a “small” dataset, need 4 bytes (or even less) to store the doc IDs  For a “very large” dataset, may need ~8 bytes to store the doc IDs  Observation 1: the docIDs are large integers, but the gaps are not so large! – Idea: Store the first doc ID, and then store only the gaps (Gap encoding)  Observation 2(a): for some posting lists, the gaps are very small – Very frequent terms  Observation 2(b): For some the gaps can be large as well – Very infrequent terms  Some posting lists may also have some large and some small gaps – Fixed size for gaps would not work! 14 obama…342993434783903689839… the…454647… sankhadeep…54892834394099834828493832…

Gap encoding: variable byte encoding  Approach: store the first doc ID, then store the gaps  Use 1 byte units  First bit is the indicator, next 7 bits are payload (the number)  First bit is 1 if this is the last byte, 0 otherwise 15 doc IDs824829215406 ID, gaps8245214577 encoding 00000110 10111000 10000101 00001101 00001100 10110001 824 is encoded as 0000110 0111000 [without the first bit of both bytes] (512 + 256 + 32 + 16 + 8) 824 is encoded as 0000110 0111000 [without the first bit of both bytes] (512 + 256 + 32 + 16 + 8) First indicator bit is 0 because this is not the last byte Some gaps take more bytes, some gaps take less

Elias γ encoding  Unary code: n  n 1s followed by a 0 – For example 3  1110 – Some conventions use n 0s followed by a 1 (equivalent)  γ encoding (due to Peter Elias) – Consider the binary representation of n – It starts with a 1; chop it off (we know it, so redundant) – Example: 9  1001  001 (call this part offset) – Encode the length of the offset in unary – Example: length of offset of 9 is 3. Unary encoding: 1110 – γ encoding of 9  1110001 16 length of offset is 3 in unary the number is leading 1 and the offset = 1001

Elias γ encoding  Better than variable byte encoding because γ encoding is bit- level encoding (empirically 15% better in posting lists)  Decompression is less efficient than variable byte encoding  Observation: the unary encoding of length may use a lot of space  Solution: encode length + 1 using γ encoding again ( Elias δ encoding) – More effective for large numbers – Example: 509  11111111011111101 (γ)  111000111111101 (δ)  Question: why length + 1?  Omega encoding: recursively apply γ encoding on the length 17

References  Primarily: IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/http://nlp.stanford.edu/IR-book/ 18

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Similar presentations

Presentation on theme: "Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Similar presentations

Presentation on theme: "Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."— Presentation transcript:

Similar presentations

About project

Feedback