Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Indexing DNA Sequences Using q-Grams

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Dwarf: A High Performance OLAP Engine Nick Roussopoulos ACT Inc. & UMD.

Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.

Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.

Indexing Network Voronoi Diagrams*

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

Spatial and Temporal Data Mining

Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:

Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Efficient Volume Visualization of Large Medical Datasets Stefan Bruckner Institute of Computer Graphics and Algorithms Vienna University of Technology.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Terasort Using SAGA-MapReduce Given by: Sharath Maddineni

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

Scaling up analytical queries with column-stores Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki École Polytechnique Fédérale de Lausanne.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,

Scientific Data Management Research Group National Energy Research Scientific Computing Center, L B N L 1 Henrik Nordberg, June 1998 Query Estimator Henrik.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Efficient Data Compression in Location Based Services Yuni Xia, Yicheng Tu, Mikhail Atallah, Sunil Prabhakar.

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –

Spatial Data Management Challenges in the Simulation Sciences

Efficient Multi-User Indexing for Secure Keyword Search

Querying and Analysing Big Scientific Data

Efficient Image Classification on Vertically Decomposed Data

Selectivity Estimation of Big Spatial Data

Sameh Shohdy, Yu Su, and Gagan Agrawal

CSCI206 - Computer Organization & Programming

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Efficient Image Classification on Vertically Decomposed Data

On Spatial Joins in MapReduce

THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads

Storage Structure and Efficient File Access

Lecture 16. Classification (II): Practical Considerations

Presentation transcript:

Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs, Zurich*Imperial College London ‡ École Polytechnique Fédérale de Lausanne

2 voltage time Scaling up Brain Simulations time Temporal Resolution Model Resolution 3D Neuron Model Time Series Analysis: key to neuroscientific discovery

Exploration Hypothesis Testing 3 Neuron firing: which and when Identify subsets of interest: time series where voltage > -40 and time step ∈ [300,400] ThresholdQuery time Threshold queries fuel efficient data analysis voltage

4 Time Series Correlation… time series id voltage time step …enables efficient time series-specific compression TrendsCorrelationOpportunity to scale with Increased simulation durationAcross time increase in temporal resolution Increasingly detailed modelsAcross time series increase in spatial resolution

5 Time Series Data Discretization Timestep Bin Binning: Partition the values into bins Range encoding: Set bin to ‘1’ if condition satisfied, ‘0’ otherwise ≥ 5 ≥ 10 ≥ 15 ≥ Timestep Value 3: [15-20) 2: [10-15) 1: [5-10) 0: [0-5) Precomputed answers stored as a bitmap Increased similarity across time series

Timestep Bin Bitmap Compression Today Run-Length-Encoding compresses each bitvector  Word-Aligned Hybrid Code (WAH) [SSDBM ’02] 4×’0’ 2×’0’, 1×’1’, 1×‘0’ 3×’1’, 1×‘0’ Compression prevents direct access  Timesteps don’t correspond to bit positions Values filtered independently of timesteps Similarities across time series are not exploited

7 Our Approach: RUBIK Bitmap index creation Bitmap stacking Quadtree-based bitmap decomposition Access specific timesteps Exploit similarities

8 Start Mix Timestep Time series Bins Quadtree-based 3D Bitmap Decomposition

9 Start Mix First Split All 0 All 1 Mix Second Split All 0 All 1 Mix All 0 Quadtree-based 3D Bitmap Decomposition Apply WAH

10 Query Execution Mix All 0 All 1 Mix All 0 All 1 Mix All Query: voltage > 11 in time steps 1 and 2 Timestep Bin Transformation into a 2D bitmap problem One tree traversal to retrieve multiple bitmaps

11 Stacking Time Series Bitmaps Goal: Maximize size and number of common squares Mix All 1 cluster 1cluster 2 MixAll 0 All 1 bitmap 1 bitmap 2 bitmap 3 ⇒ Maximize compression across time series

12 The speedup is increased from 9 to 23 Scaling with Data Volume Datasets: 300K – 1.2M time series, 1000 time steps, 1.2GB – 4.8GB Benchmark: 60 threshold queries, random thresholds, up to 11% selectivity In-memory indexes: FastBitF (WAH-compressed bitmap index), FastBit API and RUBIK Configuration: 128 bins Hardware: AMD Opteron, 2.7GHz, 32GB RAM RUBIK index size scales sublinearly

Datasets: 500K – 2M time series, 1024 time steps, 2.1GB – 8.4GB 13 ~80% of the time is spent on filtering RUBIK Sensitivity Analysis 6.7X 5.8X 7.5X Hardware: AMD Opteron, 2.7GHz, 32GB RAM Increased similarity ⇒ Increased compression Benchmark: 60 threshold queries, random thresholds, up to 15% selectivity Configuration: 128 bins

14 Threshold Queries on Time Series Thank you! Subsets of interest in neuroscience simulations RUBIK outperforms state-of-the-art by using: –Quadtree decomposition ⇒ Transformation into a 2D bitmap problem –Time series clustering ⇒ Similarities across time series are exploited RUBIK scales particularly well with time series from increasingly detailed simulation models

15 Experimental measurement Simulation Analysis Model time Scientific Simulations

16 Stacking Time Series Bitmaps All 0Mix All 0 MixAll 1 All 0Mix All 1Mix cluster 1 cluster 2 cluster All 0 Mix

Datasets: Neuroscience: 300K – 1.2M time series, 1000 time steps, 1.2GB – 4.8GB on disk Synthetic: 500K - 2M time series, 1024 time steps, 2.1GB – 8.4 GB on disk Benchmark: 60 threshold queries, random thresholds, selectivity up to 15% Software: RUBIK FastBitF (WAH-compressed bitmap index), FastBit API Hardware: AMD Opteron, 2.7GHz, 32GB RAM 17 Experimental Methodology

Datasets 18 Neuroscience Dataset Synthetic Dataset Synthetic Data Generation Impulse response Spike excitation Parameters: time offset of the excitation time constant of the model sensitivity factor of the model (amplitude of the response) Additional Gaussian noise (activity independent of the excitation)

19 Bitmap Compression: FastBit Approach Indexing software for scientific applications Key innovation: Word-Aligned Hybrid (WAH) compression –Variation of Run-Length Encoding –Encode/decode bitmaps in word size chunks –Minimal decoding to gain speed FastBitF: One-dimensional indexing on the observation value Filtering according to queried time boundaries

20 Impact of Binning FastBitF-128 bins almost as big as RUBIK-256 bins FastBitF-512 bins bigger than the indexed data Datasets: 300K time series, 1000 time steps, 1.2GB Hardware: AMD Opteron, 2.7GHz, 32GB RAM Higher resolution binning for higher indexing precision In-memory indexes: FastBitF (WAH-compressed bitmap index), FastBit API and RUBIK

21 Scaling with Temporal Resolution Hardware: AMD Opteron, 2.7GHz, 32GB RAM Datasets: 300K time series, time steps, 1.2GB – 4.8GB In-memory indexes: FastBitF (WAH-compressed bitmap index), FastBit API and RUBIK Configuration: 128 bins Benchmark: 60 threshold queries, random thresholds, stretched time ranges FastBitF compresses efficiently along time dimension Speedup decreases from 9x to 6x

22 Comparative Analysis Hardware: AMD Opteron, 2.7GHz, 32GB RAM In-memory indexes: FastBit10, FastBit25, FastBitF and RUBIK Fixed space budget: 150MB Benchmark: 60 threshold queries Dataset: 300K time series, 1000 time steps, 1.2GB

23 Comparative Analysis Hardware: AMD Opteron, 2.7GHz, 32GB RAM In-memory indexes: FastBitF and RUBIK Configuration: 128 bins Benchmark: 60 threshold queries Dataset: 2M time series, 1024 time steps, 8.4GB