Presentation is loading. Please wait.

Presentation is loading. Please wait.

How To Build a Compressed Bitmap Index

Similar presentations


Presentation on theme: "How To Build a Compressed Bitmap Index"— Presentation transcript:

1 How To Build a Compressed Bitmap Index
Theodore Johnson

2 What Are Bitmap Indices?
Let R be a relation (a table of records), with records 1 .. N. A bitmap of predicate P on R is: bit i is set to 1 O(Ri), else bit i is set to zero. Typical use: P is of the form, “attribute x has value y”. State Name State Has_cell_phone Bob NY Y Mary NJ N Pete CT N Sue CT Y Anne NJ Y Gus NY Y Ken NJ N NY CT 1

3 Why Use Bitmap Indices? Efficient representation of duplicates in an index. A record ID requires 32 bits / indexed record, a bitmap index requires 1 bit /record / attribute value. Efficient evaluation of complex Boolean selection conditions. Word-wise Boolean operations. “New England states” vs. “Tri-state area” vs “Coastal states” “People who work in a state different than their residence, drive a BMW or a Chevy, are married, and do not have a cell phone”. Other special tricks : fast COUNT aggregates.

4 Why NOT Use Bitmap Indices?
High-cardinality attributes pose a problem - More than 32 attribute values => high space overhead. - Inherently sequential bitmaps => expensive to perform large range queries.

5 Is There a Solution? Bitslice indices Compression
Represent integer values as bits: 5 = 101 Create a bitmap for each bit: B4, B2, B1 Special algorithms for range queries, etc. Problem: useful only in highly specialized cases. Compression Compress the bitmaps using, e.g. gzip or one of the special bitmap compression algorithms. Oracle B-tree indices: all duplicates represented with a compressed bitmap. Problem: I have no guidelines for optimizing the index. What do I compress? using which algorithm? How do I perform Boolean operations? HELP! S. Sarawagi, Data Engineering Bulletin, 1997, “Indexing OLAP Data”

6 The Performance of Compressed Bitmap Indices
There are many bitmap compression algorithms None compression (verbatim) Gzip (using the zlib library). Run Length Encoding (list of # of 0’s between 1’s) Variable bit length encoding (compressed RLE) Variable byte length encoding (a hybrid technique). Which algorithm is best? Best compression Fastest for Boolean operations Is there an overall best algorithm, or should I choose on a per-bitmap basis?

7 Special Bitmap Compression Codes
Variable bit length codes (ExpGol). Compress a run length encoding. Use fewer bits for smaller runs. Gamma code: 1, 010, 011, 00100, 00101, 00110, 00111, , ... ExpGol: near optimal, N of M bits set => N(log(M/N)+2) Variable byte length codes (BBC). Compress the bitmap in units of bytes, create code words that are byte sequences. Hybrid: represent long runs of zeros with a “gap” code, short runs with a piece of the verbatim bitmap. Lots of code word packing tricks. Operations can be fast because you avoid bit manipulations. 1-pass compression => you can perform Boolean operations directly on the compressed representation But the algorithms are very compelx and hard to code.

8 Algorithms for Boolean Operations
Bitmaps are used in data warehouses to perform compex Boolean selections. Compression and decompression time is secondary. I want fast algorithms to perform an operation Operation between a compressed operand an a foundset Four main algorithms: Basic : Uncompress the bitmap, byte-wise operations with the foundset. Inplace : Extract the list of set bits from the compressed bitmap, operate on the foundset in place. Merge : The foundset and the bitmap are in RLE form. The list of bits is merged. Direct : Encoding specific.

9 Basic output: verbatim foundset: verbatim operand : verbatim operand
&= |=

10 Inplace output: verbatim foundset: verbatim operand : RLE, ExpGol, BBC

11 Merge output: RLE foundset: RLE operand : RLE foundset operand
And / OR output foundset

12 Direct output: BBC foundset: BBC operand : BBC
The idea is to create a new merged code word from the code words of the foundset and the bitmap. Creating BBC code words is a completely local process, so extensive partial results do not need to be saved. Unfortunately, the details are very complex.

13 Compression, Synthetic Data

14 Compression, Real Bitmap Indices
ExpGol : 7.3 bits per tuple BBC 2S: 1.04 bits per tuple

15 Performing an Operation

16 Performing an Operation

17 Trends

18 What Is The Best Compression?
Space : It depends on the properties of the bitmap. Density, clusteredness. Using the best compressor for all bitmaps in an index gives a 10% to 20% space savings. Time : It depends on the bitmap and the operations Simple analysis: I can store bitmaps in the Verbatim or the ExpGol format. Assume the use of Inplace, 3X as many ORs as ANDs. Include the cost of reading the bitmap from disk. Compress the bitmap if the bit density is .05 or less.

19 What is the best way to evaluate a Boolean expression?
Parse the Boolean expression into an operator tree. Assign an evaluation algorithm to each node. Requires a global optimization. Rewrite the expression ? Joint work with Sihem Amer-Yahia.

20 Assigning Evaluation Algorithms
Assume that the expression tree and bitmap index encodings are fixed. We can estimate algorithm costs from the properties of the operand bitmaps. But different algorithms expect and produce results in different formats. An additional translation step might be required. Global effects of local decisions => we need a global optimizer. Algorithm Foundset Operand Output Basic verbatim Inplace RLE, ExpGol, BBC Merge RLE Direct BBC

21 Bitmap Format Translations

22 Dynamic Programming Algorithm
Take advantage of the fact that the decisions are localized. The cost to use an algorithm to evaluate an operator is: The cost of the algorithm The cost of generating bitmaps from the subtrees in the necessary format. The cost of translating the output to the desired output format. cost for each output format translation cost for each algorithm op cost for each output format cost for each output format

23 Rewriting the Boolean Expression
Eliminating the NOT operation X And Not Y ==> X And_Not Y X Or Not Y ==> X Or_Not Y NOT can be an expensive operation, because it can generally is implemented using a Direct algorithm only. However, And_Not and Or_Not have fast algorithms: And_Not : Inplace and Merge Or_Not : Inplace Rewrite collections of AND, OR clauses to encourage the use of fast algorithms OR: Convert the densest to Verbatim, then use Inplace AND: Gather sparse operands and use Merge

24 Current Status Compressed Bitmap Representation: Index: Optimizer
Convert between any pair of representations (verbatim, BBC, etc.) Perform AND, OR, NOT, AND_NOT, OR_NOT operations using Basic, Inplace, Merge, or Direct algorithms. Index: One index per data file in delimited ascii format. Simple load-on-demand index to bitmap blocks for attribute values. Optimizer Cost model for operations, transforms Dynamic programming optimizer Expression rewriting. In combination: Works for simple expressions, still testing.

25 Furthermore ... Test data set: 5 attributes, up to 100 unique values per attribute Mbytes. Gzip : .8 Mbytes Compressed bitmap index on all attributes: .9 Mbytes So, I can compress a data set almost as efficiently as gzip, but with the data fully indexed. There are fast algorithms for computing COUNT aggregates over bitmaps. Can I use the bitmap index to answer some OLAP queries? Conventional OLAP structures do not handle high dimensional data well, but high dimensionality poses only minor problems for bitmap indices.

26 Query Processing for OLAP
Queries are: Select G1, ..., Gn, count(*) From Fact_Table Where C Group By G1, ... , Gn • Strategy: Compute the bitmap representing C For each value gij of Gi, compute the bitmap of each gij Compute count(C And g1j1 And ... And gnjn)


Download ppt "How To Build a Compressed Bitmap Index"

Similar presentations


Ads by Google