Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.

Succinct Data Structures for Permutations, Functions and Suffix Arrays

5th July 2004CPM A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK)

An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Partitioned Elias-Fano Indexes

Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.

Dictionaries and Data-Aware Measures Ankur Gupta Butler University.

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.

CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Succinct Representations of Trees S. Srinivasa Rao Seoul National University.

Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.

BTrees & Bitmap Indexes

Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.

CS Lecture 9 Storeing and Querying Large Web Graphs.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

CS336: Intelligent Information Retrieval

Nick Harvey & Kevin Zatloukal

1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.

Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.

Information Retrieval Space occupancy evaluation.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Succinct Representations of Trees

Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large  It might be useful for some modern devices to support information retrieval.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,

Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.

Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,

Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.

ENCODING NEAREST LARGER VALUES Pat Nicholson* and Rajeev Raman** * MPII ** University of Leicester.

Internal Memory Pointer MachineRandom Access MachineStatic Setting Data resides in records (nodes) that can be accessed via pointers (links). The priority.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.

Index construction: Compression of postings

COMP9319: Web Data Compression and Search

Tries 07/28/16 11:04 Text Compression

Succinct Data Structures

Text Indexing and Search

Succinct Data Structures

Index construction: Compression of postings

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman

Given static sequence of positive integers, such that Support Problem Minimising space for storing according to some compressibility criteria. Supporting Sum rapidly Trivial solution Explicitly store Sum values Requires: bits, support Sum in O(1) time PSDS Prefix Sums problem

Motivation: Inverted List Locations of keywords in main text Positive sequence of strictly increasing integers Term=Moses: 650, 687, 696 Bible doc: Moses…… ….Moses… ………………..Moses…

Store differences Significant space saving, standard technique [Managing Gigabytes, Witten, Moffat, Bell] We can store in PSDS => th location of keyword Direct access to individual, help to answer conjunction queries. Inverted List

String Collection Collection of non-empty stings, Store in PSDS, where Concatenate strings Store concatenated string in array Store concatenated string in compress self-index e.g. FM-index or CSA Get th string Offset =, Length = Text string: selmanjava3d programming2000” Offsets:0,6, 13,10, 24

URLs Web search engines with large database of URLs. URLs are strings URLs are 60 chars long average compressed fairly well Explicit pointer for each URL requires 64 bits

XML Documents XML Doc: selman java3d programming 2000 book titleauthoryear #doc “Java3d programming” “[cr][sp]” “selman” 5 “2000” “[cr][sp]” selman java3d programming 2000  Text nodes chars in length in avg. Compressed to average 3-4 bytes  32-bit pointer overhead for each string (naive)

Related Work [CJM]: Clark thesis,Jacobson FOCS 89, Clark,Munro SODA 96 [Geary et al.]: Geary, Raman, Rahman CPM 04,TCS 06 [Kim et al.]: Kim, Chae Na, Kim, Park WEA 05 [Gupta et al. (a)]: Gupta, Hon, Shah, Vitter DCC 06 [Gupta et al. (b)]: Gupta, Hon, Shah, Vitter WEA 06 [GV]: Grossi, Vitter STOC 00, SICOMP 05 [MG]: Witten, Moffat, Bell, Managing Gigabytes

Select Space usage: bits Time:. Bitvector Representation Write in unary is “0001” |B|= m bits B: position of the iith 1 bit in B [CJM, KIM et al.] } B:

Given the # of 1s in B is n different bit sequences Lower bound to store all L sequences bits. space usage is based on Average. Could we do better? Succinct Bound

Data-aware encoding Exploit skewed distribution. Self-delimiting encodings of values. concat. unary and binary.. add up to, average value is then.,

Data-aware encoding Golomb(b,x) Concat. in unary and in binary using or bits b=3 Golomb(3,9)= q=2 in unary(q+1)=001 and r=2 => Best encoding for inverted lists if - [MG]. - [Gupta et. al. (b)] Not achievable

a) GOLOMBSUCCINCT b) New Select DS c) Data aware PSDS. Space:, bits Time: d) Implementation and Experimental Evaluation Contributions of paper

If..- [GV, Elias] Succinct vs Golomb

Succinct PSDS Given Compute Lower-half: Lower order bits of, so we take bits Upper-half: Multiplicities bits. Upper-order bits, i.e= 0,1,1,2,2,4,5,6 Space usage: time: - [CJM] V: Simple to do Succinct i.e. 5= i.e. 5= 00101=> B: e.g [GV, Elias] get(B,4)=

New select DS = position of the th 1 bit in bitstring B of length N Extracted string & contracted string [Kim et al.] Remove zero blocks [Geary et al.]: Fast select – every block has at least a single 1 bit. Block of zeros 001…… …… XX A: P: A’:

New select DS Assume BS of N bits Results Select & rank: O(1) time, space: N+o(N) bits Select1 and select0 Partitioned BV [Delpratt, Raman, Rahman, WEA 06] In practice Joint fastest with CJM

New select DS TypicalWorst-case NewDSCJMKIMNewDSCJMKIM Input BS(1- )NNNNNN Select(1- )0.94N(1+ )0.52N(1+ )0.63N0.94N2.77N1.17N rank0.03N0.5N0.25N0.02N0.5N0.25N sum~2N ~1.94N4.27N2.42N Reliable space bound Speed evaluation: Orders.xml NewDS=0.101, CJM=0.105, KIM=0.178 oper./per sec

Data aware tree PSDS Results Space usage: bits, Time: [Gupta et al. (b)] achieved, Time:

Delete larger child Indicate nodes removed. n-1 extra bits Data aware tree PSDS

Implemented Succinct, Explicit- & Succinct- PSDS Gamma tree PSDS Remove right child nodes vs largest node = negligible difference Tree is slow Data: Lengths of Text node strings Compressibility measures Succinct measure close to GOLOMB measure Implementation and Experimental Evaluation File#Text nodesGapGOLOMBSucc. orders.xml150K Xpath.xml1.7m per node

Experimental Evaluation Results Comparative space usage for data structures Linux machine, 8 million random operation calls, 10 repeated runs Time: sec per operation FileSuccinctExplicit-Succinct- orders.xml Xpath.xml Succinct PSDS performed best

Compression of Prefix sums is important Space efficient data-aware PSDS Succinct PSDS was more appropriate in our application New select DS Future improvements Succinct- more competitive: single -decode x20 faster than single select To data aware tree PSDS Conclusions and future work

Thank you!