Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
SkewTune: Mitigating Skew in MapReduce Applications
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Information Retrieval in Practice
Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Overview of Search Engines
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Oracle Index study for Event TAG DB M. Boschini S. Della Torre
Database Management 9. course. Execution of queries.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin† UC Berkeley and †Databricks Inc.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
GSLPI: a Cost-based Query Progress Indicator
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
Partition Architecture Yeon JongHeum
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.
Matrix Multiplication in Hadoop
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Information Retrieval in Practice
Image taken from: slideshare
SQL Server Statistics and its relationship with Query Optimizer
Big Data is a Big Deal!.
Hadoop.
Search Engine Architecture
Running virtualized Hadoop, does it make sense?
Hadoop Clusters Tess Fulkerson.
Introduction to Query Optimization
Evaluation of Relational Operations
Introduction to Spark.
Data Lifecycle Review and Outlook
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Overview of Query Evaluation
Implementation of Relational Operations
CS222: Principles of Data Management Notes #11 Selection, Projection
Fast, Interactive, Language-Integrated Cluster Computing
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Presentation transcript:

Fine-grained Partitioning for Aggressive Data Skipping Calvin SIGMOD 2014 UC Berkeley

Contents Background Contribution Overview Algorithm Data skipping Experiment

Background How to get insights of enormous datasets interactively ? How to shorten query response time on huge datasets ? Drawbacks 1.Coarse-grained block(partition)s 2.Not balance 3.The remaining block(partition)s still contain many tuples 4.Blocks do not match the workload skew 5.Data and query filter correlation Block / Partition Oracle / Hbase / Hive / LogBase Prune data block(partition) according to metadata

Goals Workload-driven blocking technique Fined-grained Balance-sized Offline Re-executable Co-exists with original partitioning techniques

Example Extract features Vectorization 1.Split block 2.Storage How to chooseHow to split ConditionSkip F3F3 P 1, P 3 F 1 ^F 2 P 2, P 3

Contribution Feature selection Identity representative filters Modeled as Frequent itemset mining Optimal partitioning Balanced-Max-Skip partitioning problem – NP Hard A bottom-up framework for approximate solution

Overview (1)extract features from workload (2)scan table and transform tuple to (vector, tuple)-pair (3)count by vector to reduce partitioner input (4)generate blocking map(vector -> blockId) (5)route each tuple to its destination block (6)update union block feature to catalog

Workload Assumptions Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range

Workload Modeling Q={Q 1,Q 2,…Q m } Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 F: All predicates in Q F i : Q i ’s predicates f ij : Each item in F i product in (‘shoes’, ‘shirts’) vs product= ‘shoes’ product in (‘shoes’, ‘shirts’) vs revenue > 21

Filter augmentation Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 Examples:  Q 1 : product=‘shoes’, product in (‘shoes’, ‘shirts’)  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21  Q 3 : product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’) Frequent itemset mining with threshold T(=2) numFeat

Partitioning problem modeling  ={F 1,F 2,…F m } as features, weight w i V={v 1,v 2,…v n } as transformed tuple  V ij indicates whether v i satisfies F j P={P 1,P 2,P 3 } as a partition Cost function C(P i ) as sum of tuples that P i can skip for all queries in workload : Max(C(P)) NP-Hard

The bottom up framework Ward’s method: Hierarchical grouping to optimize an objective function n 2 log(n) R: {vector -> blockId, …}

Data skipping 1.Generate vector 2.OR with each partition vector 3.Block with at least one 0 bit can be skipped

Experiment Environment  Amazon Spark EC2 cluster with 25 instances  8*2.66GHz CPU cores  64 GB RAM  2*840 GB disk storage Implement and experiment on Shark (SQL on spark)

Datasets TPC-H 600 million rows, 700GB in size Query templates (q 3,q 5,q 6,q 8,q 10,q 12,q 14,q 19 ) 800 queries as training workload, 100 from each 80 testing queries, 10 from each TPC-H Skewed TPC-H query generator has a uniform distribution 800 queries as training workload, 100 from each under Zipf distribution Conviva User access log of video streams 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, … 674 training queries and 61 testing queries 680 million tuples, 1TB in size TPC-H 相关说明:

TPC-H results Query performance Measure number of tuples scanned and response time for different blocking and skipping schemas Full scan: no data skipping, baseline Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000 partitions. Shark’s data skipping used Fineblock: numFeature=15 features from 800 training queries, minSize=50k, Shark’s data skipping and feature-based data skipping are used

TPC-H results - efficiency

TPC-H results – effect of minSize The smaller the block size is, the more chance we can skip data numFeature=15 and various minSize Y-value : ratio of number scanned to number must be scanned

TPC-H results – effect of numFeat

TPC-H results – blocking time A month partition in TPC-H 7.7 million tuples, 8GB in size 1000 blocks numFeat=15,minSize=50 One minute

Convia results Query performance Fullscan: no data skipping Range: partition on date and a frequently queried column, Shark’s skipping used Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping and feature-based skipping used