March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Slides:



Advertisements
Similar presentations
Bitmap Index Design and Evaluation Ariel Noy Data representation and retrieval seminar By: Chee-Yong Chan Yannis E.Ioannidis.
Advertisements

Arjun Suresh S7, R College of Engineering Trivandrum.
File Systems.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
BTrees & Bitmap Indexes
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
ITIS 5160 Indexing. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
Chapter 14 The Second Component: The Database.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How are data stored? –physical level –logical level.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Scientific Data Management Research Group National Energy Research Scientific Computing Center, L B N L 1 Henrik Nordberg, June 1998 Query Estimator Henrik.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Chapter 4 Logical & Physical Database Design
B. Information Technology (Hons.) CMPB245: Database Design Physical Design.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Module 11: File Structure
How To Build a Compressed Bitmap Index
ITIS 5160 Indexing.
Database Management Systems (CS 564)
COMP 430 Intro. to Database Systems
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
CSCI206 - Computer Organization & Programming
Chapter 11: File System Implementation
Module 11: Data Storage Structure
Introduction to Database Systems
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Variable Length Data and Records
CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
ICOM 5016 – Introduction to Database Systems
Presentation transcript:

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

March, 2002 Problem Statement Main objective: maps logical requests to qualified objects —A logical request: <=eventTime & 200<energy<300 … —Objects: Set of object ids; Set of files containing the objects; Offsets within the files, …

March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.

March, 2002 Application: Combustion Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) A dozen or more variables are computed at each time step and each grid point Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000

March, 2002 Building a Bitmap Index 1.Partition each property into bins (binning) —e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… 2.Generate a bit vector for each bin (encoding) —Bit i of bit vector j is 1 iff NLb[i] is in bin j 3.Compress each bit vector property property property n

March, 2002 Advantages of Bitmap Index Bitmap index: specialized index that takes advantage —Read-mostly data: data produced from scientific experiments can be appended in large groups Fast operations —“Predicate queries” can be performed with bitwise logical operations Predicate ops: =,, =, range, Logical ops: AND, OR, XOR, NOT —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory

March, 2002 Operation-efficient Compression Methods Best known: byte-aligned bitmap code (BBC) —Uses run-length encoding (next slide) —Byte alignment, optimized for space efficiency —Decoding on bit level, not optimal for operations —Used in oracle We developed a new word-aligned scheme: WAH —Uses run-length encoding —Word alignment —Designed for minimal decoding to gain speed

March, 2002 Operation-efficient Compression Methods Uncompressed: Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data Based on variations of Run Length Compression

March, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits

March, 2002 Information About the Test Machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400mhz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects, 12 attributes Synthetic data —100 million objects, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds

March, 2002 Logical Operation Time(Synthetic Data) 10X improvement

March, 2002 Logical Operation Time (STAR Data) Also 10X improvement

March, 2002 Encoding Schemes – Main Idea Equality encoding Range encoding Interval encoding 12 bins Interval, Range encoding: operates on 2 bins only!

March, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?

March, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries

March, 2002 Timing Results MethodIndex (X data) Time (sec) Speed ORACLEScan060.1 B-tree Native vertical partition Scan bins bins bins

March, 2002 Summary Compressed bitmap indices are effective for range queries Better compression scheme —50% more space, but 12 time faster !!! Among the different encoding schemes —The interval encoding is the overall winner

March, 2002 Future Work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology (microarrays) Study non-uniform binning strategies Study more encoding schemes Integrate with conventional database system: to better handle metadata, to provide more versatile front-end 

March, 2002 How Many Bins for Continuous Domains? Range(x) Range(y) Edge bin More bins Less objects in edge bins Searching edge bins: skip-scan over “attribute vertical partition”