Presentation is loading. Please wait.

Presentation is loading. Please wait.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Similar presentations


Presentation on theme: "March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani."— Presentation transcript:

1 March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

2 March, 2002 Problem Statement Main objective: maps logical requests to qualified objects —A logical request: 20001015<=eventTime & 200<energy<300 … —Objects: Set of object ids; Set of files containing the objects; Offsets within the files, …

3 March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb 0159625159627263520000827.0 11759 12390291341 1159625159627263620000827.0 11759 12390291470 2159625159627263720000827.0 11759 12390291663 OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy 09091228266.56-26.4048 112431415317.46-29.0853 212851533281.53-6.7548 A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.

4 March, 2002 Application: Combustion Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) A dozen or more variables are computed at each time step and each grid point Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000

5 March, 2002 Building a Bitmap Index 1.Partition each property into bins (binning) —e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… 2.Generate a bit vector for each bin (encoding) —Bit i of bit vector j is 1 iff NLb[i] is in bin j 3.Compress each bit vector 000000000000000000000000000000 000010001000000000010001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property 1 000001110111011000001110111011 101100000000000101100000000000 010000001000000010000001000000 000000000000100000000000000100 000000000000000000000000000000 property 2 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property n 000010000000000000010000000000...

6 March, 2002 Advantages of Bitmap Index Bitmap index: specialized index that takes advantage —Read-mostly data: data produced from scientific experiments can be appended in large groups Fast operations —“Predicate queries” can be performed with bitwise logical operations Predicate ops: =,, =, range, Logical ops: AND, OR, XOR, NOT —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory

7 March, 2002 Operation-efficient Compression Methods Best known: byte-aligned bitmap code (BBC) —Uses run-length encoding (next slide) —Byte alignment, optimized for space efficiency —Decoding on bit level, not optimal for operations —Used in oracle We developed a new word-aligned scheme: WAH —Uses run-length encoding —Word alignment —Designed for minimal decoding to gain speed

8 March, 2002 Operation-efficient Compression Methods Uncompressed: 0000000000001111000000000......0000001000000001111111100000000.... 000000 Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data Based on variations of Run Length Compression

9 March, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits

10 March, 2002 Information About the Test Machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400mhz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects, 12 attributes Synthetic data —100 million objects, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds

11 March, 2002 Logical Operation Time(Synthetic Data) 10X improvement

12 March, 2002 Logical Operation Time (STAR Data) Also 10X improvement

13 March, 2002 Encoding Schemes – Main Idea Equality encoding Range encoding Interval encoding 12 bins 123456789101112 Interval, Range encoding: operates on 2 bins only!

14 March, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?

15 March, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries

16 March, 2002 Timing Results MethodIndex (X data) Time (sec) Speed ORACLEScan060.1 B-tree3.60.950.6 Native vertical partition Scan00.571 20 bins0.180.115 50 bins0.430.078 100 bins0.900.0511

17 March, 2002 Summary Compressed bitmap indices are effective for range queries Better compression scheme —50% more space, but 12 time faster !!! Among the different encoding schemes —The interval encoding is the overall winner

18 March, 2002 Future Work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology (microarrays) Study non-uniform binning strategies Study more encoding schemes Integrate with conventional database system: to better handle metadata, to provide more versatile front-end 

19 March, 2002 How Many Bins for Continuous Domains? Range(x) Range(y) Edge bin...................................................... More bins Less objects in edge bins Searching edge bins: skip-scan over “attribute vertical partition”


Download ppt "March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani."

Similar presentations


Ads by Google