IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret.

Slides:



Advertisements
Similar presentations
T.Sharon-A.Frank 1 Multimedia Compression Basics.
Advertisements

15 Data Compression Foundations of Computer Science ã Cengage Learning.
Data Compression CS 147 Minh Nguyen.
Applied Algorithmics - week7
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Compression & Huffman Codes
BTrees & Bitmap Indexes
SWE 423: Multimedia Systems
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
A Data Compression Algorithm: Huffman Compression
Computer Science 335 Data Compression.
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos MIT CSAIL joint work with: Velen Liang, Daniel Abadi,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Access Path Selection in a Relational Database Management System Selinger et al.
Database Management 9. course. Execution of queries.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos * MIT CSAIL joint work with: Velen Liang, Daniel Abadi, and Sam Madden massachusetts.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Digital Image Processing Lecture 22: Image Compression
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
4C8 Dr. David Corrigan Jpeg and the DCT. 2D DCT.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Textbook does not really deal with compression.
Compression & Huffman Codes
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Succinct Data Structures
Data Compression.
Parallel Databases.
Applied Algorithmics - week7
Data Compression.
Data Compression CS 147 Minh Nguyen.
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
CSE 589 Applied Algorithms Spring 1999
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
15 Data Compression Foundations of Computer Science ã Cengage Learning.
CPS 296.3:Algorithms in the Real World
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Presentation transcript:

IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret Swart

IBM Almaden Research Center © 2006 IBM Corporation Oxide is cheap, so why compress?  Make better use of memory –Increase capacity of in memory database –Increase effective cache size of on disk database  Make better use of bandwidth –I/O and memory bandwidth are expensive to scale –ALU operations are cheap and getting cheaper  Minimize storage and replication costs

IBM Almaden Research Center © 2006 IBM Corporation Why compress relations?  Relations are important for structured information  Text, video, audio, image compression is more advanced than relational  Statistical and structural properties of the relation can be exploited to improve compression  Relational data have special access patterns –Don’t just “inflate.” Need to run selections, projections and aggregations

IBM Almaden Research Center © 2006 IBM Corporation Our results  Near optimal compression of relational data –Exploits data skew, column correlations and lack of ordering –Theory: Compress m i.i.d. tuples to within 4.3 m bits of entropy (but theory doesn’t count dictionaries) –Practice: Between 8 and 40x compression  Scanning compressed relational data –Directly perform projections, equality and range selections, and joins on entropy compressed data –Cache efficient dictionary usage –Query short circuiting

IBM Almaden Research Center © 2006 IBM Corporation This Talk Raw Data Compressed Data Analyze Meta Data & Dictionaries Compress Query Results Update New Raw Data CSVZIP Flow  Analyze to determine compression plan  Compress to reduce size  Execute many queries over compressed data  Periodically update data and dictionaries

IBM Almaden Research Center © 2006 IBM Corporation Sources of Redundancy in Relations  Column Value space much smaller than Domain –|C| << |domain(C)| –Type specific transformations, dictionaries  Skew in value frequency –H(C) << lg |C| –Entropy encoding (e.g. Huffman codes)  Column correlations within a tuple –H(C 1, C 2 ) << H(C 1 ) + H(C 2 ) –Column co-coding  Incidental tuple ordering –H({T 1, T 2, …, T m }) ~ H(T 1,T 2, …, T m ) – m lg m –Sort and delta code  Tuple correlations –If correlated tuples share common columns, sort first on those columns {“Apple”, “Pear”, “Mango”} in CHAR(10) 90% of fruits are “Apple” Mangos are mainly sold in August Mango buyers also buy paper towels

IBM Almaden Research Center © 2006 IBM Corporation Male/John Compression Process: Step 1 Input tuple Column 1Column 2 Co-code transform Type specific transform Column 1 & 2 Column 3.A Column Code TupleCode Column Code Column 3 Column 3.B Column Code Huffman Encode Dict Huffman Encode Dict Huffman Encode Dict Male/John/Sat Sat2006 Male, John, 08/10/06, Mango p = 1/512p = 1/8p = 1/512 w35/Mango w35 MaleJohn08/10/06Mango Michael4.2% David3.8% James3.6% Robert3.5% John3.5% William2.5% Mark2.4% Richard2.3% Thomas1.9% Steven1.5% MonTueWedThuFriSatSun Male3%4%10%6%23%42%12% Female4%5%9%15%17%28%22%

IBM Almaden Research Center © 2006 IBM Corporation Compression Process: Step 2 First tuple code Tuplecode — Sorted Tuplecodes 1 Previous Tuplecode Delta Huffman Encode Delta Code Append Dict Compression Block —— — Look Ma, no delimiters!

IBM Almaden Research Center © 2006 IBM Corporation Compression Results  P1 – P6: Various projections of TPC-H tables  P7: SAP SEOCOMPODF  P8: TPC-E Customer

IBM Almaden Research Center © 2006 IBM Corporation Huffman Code Scan operations  SELECT SUM(price) FROM Sale WHERE week(saleDate) = 23 AND fruit = “Mango” AND year(saleDate) between 1997 AND 2005  Scan this: –Skip Over first column: Need length –Range Compare on 2 nd column: year in 1997 to 2005 –Equality Compare 3 rd column: Week = 23, fruit = Mango –Decode 4 th column for aggregation  Segregated Coding: Faster operations, same compression –Assign Huffman Codes in order of length |code(v)| < |code(w)|  code(v) < code(w) –Sort codes within a length |code(v)| = |code(w)|  (v < w  code(v) < code(w)) YearCode

IBM Almaden Research Center © 2006 IBM Corporation Segregated Coding: Computing Code Length  One code length  Constant function – #define codeLen(w) 6  Second largest code length << lg L1 cache size  Use lookup table – #define codeLen(w) \ codeTable[x>>26]  Otherwise compare input with max code of each length – #define codeLen(w) \ (w <= 0b …)?3 \ :(w <= 0b …)?6 \ :(w <= 0b …)?7 … ))) YearCode

IBM Almaden Research Center © 2006 IBM Corporation Segregated Coding: Range Query switch (codeLen(w)) { case 3: return w>>28 != 0; 302 case 4: return w >= 0b && w <= 0b ; case 5: return w >= 0b && w <= 0b ; } 333 Value code SELECT * WHERE col BETWEEN 112 and 302

IBM Almaden Research Center © 2006 IBM Corporation Advantages of Segregated Coding  Find code length quickly –No access to dictionary  Fast Range query –No access to dictionary for constant ranges  Cache Locality –Because values are sorted by code length, commonly used values are clustered near the beginning of the array –The beginning of the array is most likely to be in cache, improving the cache hit ratio

IBM Almaden Research Center © 2006 IBM Corporation Query Short Circuiting  Reuse predicates and values that depend on unchanged columns  Sorting causes many unchanged columns Previous Tuple: Delta Value: Next Tuple: Common Bits: Unchanged Columns: Gender/ FName Reused predicates: Sex = Male Name = John Year ≥ 2005 Reduces instructions but adds a branch! Year

IBM Almaden Research Center © 2006 IBM Corporation Selected Prior Work  Entropy Coding –Shannon (1948), Huffman (1952) Arithmetic coding – Abramson (1963) Pasco, Rissanen (1976)  Row or Page Coding –Compress each row or page independently. Decompress on page load or row touch. Compression code is localized. [Oracle, DB2, IMS]  Column-wise coding –Each column value gets a fixed length code from a per column dictionary. [Sybase IQ, CStore, MonetDB] –Pack multiple short values into 16 bit quantities and decode them as a unit to save CPU [Abadi/Madden/Ferreira]  Delta coding –Sort and difference or remove common prefix from adjacent codes [Inverted Indices, B-trees, CStore]  Text coding –“gzip” style coding using n-grams, Huffman codes, and sliding dictionaries [Ziv, Lempel, Welch, Katz]  Order preserving codes –Allows range queries at a cost in compression [Hu/Tucker, Antoshenkov/Murray/Lomet, Zandi/Iyer/Langdon]  Lossy coding –Model based lossy compression: SPARTAN, Vector quantization

IBM Almaden Research Center © 2006 IBM Corporation Work in Progress  Analysis to find best: –Dictionaries that fit in L2 cache size –Set of columns to co-code –Column ordering for sort  Generate code for efficient queries on x86-64, Power5 and Cell –Don’t interpret meta-data at run time –Utilize architecture features  Update –Incremental update of dictionaries. Background merge of new rows.  Release of CSVZIP utilities

IBM Almaden Research Center © 2006 IBM Corporation Observations  Entropy decoding uses less I/O, but more ALU ops than conventional decoding –Our technique removes the cache as a problem –Have to squeeze every ALU op: Trends in favor  Variable length codes makes vectorization and out-of-order execution hard –Exploit compression block parallelism instead  These techniques can be exploited in a column store

IBM Almaden Research Center © 2006 IBM Corporation Back up

IBM Almaden Research Center © 2006 IBM Corporation Entropy Encoding on a Column Store  Don’t build tuple code: Treat tuple as vector of column codes and sort lexicographically  Columns early in the sort: Run length encoded deltas  Columns in the middle of the sort: Entropy encoded deltas  Columns late in the sort: Concatenated column codes  Independently break columns into compression blocks  Make dictionaries bigger because only using one at a time

IBM Almaden Research Center © 2006 IBM Corporation Entropy: A measure of information content  Entropy of a random variable R –The expected number of bits needed to represent the outcome of R –H(R) = ∑ r  domain(R) Pr(R = r) lg (1/ Pr(R = r))  Conditional entropy of R given S –The expected number of bits needed to represent the outcome of R given we already know the outcome of S. –H(R | S) = ∑ s  domain(S) ∑ r  domain(R) Pr(R = r & S = s) – lg (1/ Pr(R = r & S = s)) – H(S)  If R is a random relation of size n, then R is a multi-set of random variables {T 1, …, T n } where each random tuple T i is a cross product of random attributes C 1i  …  C ki

IBM Almaden Research Center © 2006 IBM Corporation The Entropy of a Relation  We define a random relation R of size m over D as a random variable whose outcomes are multi-sets of size m where each element is chosen identically and independently from an arbitrary tuple distribution D. The results are dependent on H(D) and thus on the optimal encoding of tuples chosen from D. –If we do a good job of co-coding and Huffman coding, then the tuple codes are entropy coded: They are random bit strings whose length depends on the distribution of the column values but whose entropy is equal to their length  Lemma 2: The Entropy of random relation R of size m over a distribution D is at least m H(D) – lg m!  Theorem 3: The Algorithm presented compresses a random relation R of size m to within H(R) m bits, if m > 100

IBM Almaden Research Center © 2006 IBM Corporation Proof of Lemma 2  Let R be a random vector of m tuples i.i.d. over distribution D whose outcomes are sequences of m tuples, t 1, …, t m.  Obviously H( R ) is m H(D).  Consider an augmentation of R that adds an index to each tuple so that t i has the value i appended. Define R1 as a set consisting of exactly those values. H(R1) = m H(D) as there is a bijection between R1 and R  But the random multi-set R is a projection of the set R1 and there are exactly m! equal probability sets R1 that each project to each outcome of R so H(R1) ≤ H(R) + lg m! and thus H(R) ≥ m H(D) – lg m!

IBM Almaden Research Center © 2006 IBM Corporation Proof sketch of Theorem 3  Lemma 1 says: If R is random multi-set of m values over the uniform distribution 1..m and m > 100, then H(delta(sort(R))) < 2.67 m.  But we have values from an arbitrary distribution, so work by cases –For values longer than lg m bits, truncate, getting a uniform distribution in the range. –For values shorter than lg m bits, append random bits, also getting a uniform distribution.