Information Retrieval Space occupancy evaluation.

Slides:



Advertisements
Similar presentations
File Systems.
Advertisements

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
CS276 Information Retrieval and Web Search Lecture 5 – Index compression.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Index Compression David Kauchak cs160 Fall 2009 adapted from:
HST 952 Computing for Biomedical Scientists Lecture 9.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval and Web Search
Hash Table indexing and Secondary Storage Hashing.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
CS276A Information Retrieval Lecture 3. Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Information Retrieval IR 4. Plan This time: Index construction.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lecture 6: Index Compression
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
ADFOCS 2004 Prabhakar Raghavan Lecture 2. Corpus size for estimates Consider n = 1M documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation.
CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Index Compression David Kauchak cs458 Fall 2012 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Introduction to Algorithms Chapter 16: Greedy Algorithms.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Information Retrieval On the use of the Inverted Lists.
Chapter 5 Record Storage and Primary File Organizations
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
CS315 Introduction to Information Retrieval Boolean Search 1.
CS 257: Database System Principles Variable length data and record BY Govind Kalyankar Class Id: 107.
Index construction: Compression of postings
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval and Web Search
Text Indexing and Search
Indexing & querying text
Index Compression Adapted from Lectures by Prabhakar Raghavan (Google)
Index Compression Adapted from Lectures by
Disk Storage, Basic File Structures, and Buffer Management
Index construction: Compression of postings
Lecture 5: Index Compression Hankz Hankui Zhuo
Index construction: Compression of postings
3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Index construction: Compression of postings
Adapted from Lectures by
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval Space occupancy evaluation

Storage analysis First we will consider space for the postings Recall that access is sequential Then will do the same for the dictionary Recall that access is random Finally we will analyze the storage of the documents… Random access is here crucial for “snippet retrieval”

Information Retrieval Postings storage

Recall that… Brutus the Calpurnia

Postings: two conflicting forces A term like Calpurnia occurs in maybe one doc out of a million. Hence we would like to store this pointer using log 2 #docs bits. A term like the occurs in virtually every doc, so that number of bits is too much. We would prefer 0/1 vector in this case.

Gap-coding for postings Store the gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Advantages: Store smaller integers Smaller and smaller if clustered How much is the saving by the  -encoding ?

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence:

A rough analysis on Zipf law and  -coding… Zipf Law: kth term occurs  c n/k times, for a proper constant c depending on the data collection.  -coding on the k-th term costs: n/k gaps, use 2log 2 k +1 bits for each gap; Log is concave, thus the maximum cost is (2n/k) log 2 k + (n/k) bits

Sum over k from 1 to m=500K Do this by breaking values of k into groups: group i consists of 2 i-1  k < 2 i. Group i has 2 i-1 components in the sum, each contributing ~(2n/k) log 2 k ~ (2ni)/2 i-1. Summing over i from 1  19 (500k terms), we get a net estimate of 340Mbits. Then add #docs bit per each occurrence (they are 1Gb) because of +1 in  : 1.34 Gbit ~ 170Mb  20-bit coding, would have required 20 Gbits [~ 2.5 Gb]

Sum over k from 1 to m=500K Do this by breaking values of k into groups: group i consists of 2 i-1  k < 2 i. Group i has 2 i-1 components in the sum, each contributing ~(2n/k) log 2 k ~ (2ni)/2 i-1. Summing over i from 1  19 (500k terms), we get a net estimate of 340Mbits. Then add #docs bit per each occurrence (they are 1Gb) because of +1 in  : 1.34 Gbit ~ 170Mb  20-bit coding, would have required 20 Gbits [~ 2.5 Gb]

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=  binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

Fixed Binary Codewords [~8 bits per TREC12] Get fast (de)compress & reduce space wasting fixed number of bits [width] for a varying number of items [span] Example: 38,17,13,34,6,2,1,3,1,2,3,1 Flat binary code, needs 6 bits per item, total 72  -code would need 36 bits Greedily split and use fixed-binary codes: <12; (6,4:38,17,13,34),(3,1:6),(2,7:2,1,3,1,2,3,1) Width=#bitsSpan=#items What widths and spans are preferrable ? Every group forced to fit in 1 machine word!!

Golomb  codes [7.54 bits on TREC12] It is a parametric code: depends on k We set d = 2  log k  +1 - k, hence d may be represented in  log k  bits Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 In the second case r > d, so that (r+d)/2 > d  Consequently, the first  log k  bits discriminate the case in decompression Useful when integers concentrated around k: k=3, v=9  d=1, Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(x) = p (1-p) x-1, where mean(x)=1/p, and i.i.d ints If k is a power-of-2, then we have the simpler Rice code [q times 0s] 1 If r ≤ d use  log k  bits (enough), else use  log k  bits r ≤ d ? r : (r+d)

PFor and PForDelta Thick line: decompr R2c Thin line: decompr R2R

IL compression S9, S16 stuff ints in a word using 9/16 configurations Positions impact for a factor 4, but more on decompr since worse cache use on 1000 queries

Integer coding vs postings length S9, S16 stuff ints in a word using 9/16 configurations

Information Retrieval Dictionary storage (several incremental solutions)

Recall the Heaps Law… Empirically validated model: V = kN b where b ≈ 0.5, k ≈ 30  100; N = # tokens Some considerations: V is decreased by case-folding, stemming Indexing all numbers could make it extremely large (so usually don’t) Spelling errors contribute a fair bit of size

1 st solution: basic idea… Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. Binary search 20 bytes4 bytes each Wasteful, avg word is 8 char

2 nd solution: raw term-sequence ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Binary search these pointers Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.

3 rd solution: blocking Store pointers to every kth on term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 12 bytes on 3 pointers Lose 4*1 bytes on term lengths. Net saving is 8 bytes every 4 dict words

Impact of k on search & space Search for a dictionary word Binary search down to 4-term block; Then linear search through terms in block.  Increasing k, we would slow down the linear scan but reduce the dictionary space occupancy to ~8MB

Encodes automate Suffix length to drop 4 th solution: front coding Idea: Sorted words commonly have long common prefix – store differences only wrt the previous term in a block of k. 8automata8automate9automatic10automation  8automata1  e1  ic1  on Lucene stores k=128 First term is full