Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

1 INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 11 Compression

2 ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

3 Outline compression for inverted indexes Dictionary storage
Dictionary-as-a-String Blocking

4 Basic indexing pipeline
Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman 00:00:20  00:0:45 Indexer Inverted index friend roman countryman 2 4 13 16 1

5 Why compression for inverted indexes?
Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory 00:03:49  00:04:10

6 Dictionary storage - first cut
Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. 00:08:20  00:10:35 Allows for fast binary search into dictionary 20 bytes 4 bytes each

7 Fixed-width terms are wasteful
Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And still can’t handle supercalifragilisticexpialidocious. Written English averages ~4.5 characters. Exercise: Why is/isn’t this the number to use for estimating the dictionary size? Short words dominate token counts. Average word in English: ~8 characters. Explain this. 00:11:40  00:12:00 00:12:25  00:13:25 00:13:35  00:14:05 What are the corresponding numbers for Italian text?

8 Compressing the term list: Dictionary-as-a-String
Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 500KB x 8 = 4MB 00:16:46  00:17:30 00:18:00  00:19:30 00:20:00  00:22:40 Pointers resolve 4M positions: log24M = 22bits = 3bytes Binary search these pointers 8

9 Total space for compressed list
4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 500K terms  9.5MB Total Space = 500K terms * (4 bytes Freq + 4 bytes Postings + 3 bytes term pointer + 8 bytes per term in term string) = 500K terms * (19 Bytes/Term)  Now avg. 11  bytes/term,  not 20. 00:22:50  00:23:50

10 Blocking Store pointers to every kth on term string.
Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. 00:31:00  00:32:10 00:35:00  00:37:20 00:38:35  00:39:30  Save 9 bytes  on 3  pointers. Lose 4 bytes on term lengths.

11 Net Where we used 3 bytes/pointer without blocking
3 x 4 = 12 bytes for k=4 pointers, now we use 3+4=7 bytes for 4 pointers. 00:42:00  00:44:00 00:44:30  00:45:30 Shaved another ~0.5MB; can save more with larger k. Why not go with larger k?

12 Resources Chapter 5 of IIR Resources at http://ifnlp.org/ir
Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a) Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002) More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006)


Download ppt "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"

Similar presentations


Ads by Google