Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley.

Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley

Contents Background Contribution Overview Algorithm Data skipping Experiment

Background How to get insights of enormous datasets interactively ? How to shorten query response time on huge datasets ? Drawbacks 1.Coarse-grained block(partition)s 2.Not balance 3.The remaining block(partition)s still contain many tuples 4.Blocks do not match the workload skew 5.Data and query filter correlation Block / Partition Oracle / Hbase / Hive / LogBase Prune data block(partition) according to metadata

Goals Workload-driven blocking technique Fined-grained Balance-sized Offline Re-executable Co-exists with original partitioning techniques

Example Extract features Vectorization 1.Split block 2.Storage How to chooseHow to split ConditionSkip F3F3 P 1, P 3 F 1 ^F 2 P 2, P 3

Contribution Feature selection Identity representative filters Modeled as Frequent itemset mining Optimal partitioning Balanced-Max-Skip partitioning problem – NP Hard A bottom-up framework for approximate solution

Overview (1)extract features from workload (2)scan table and transform tuple to (vector, tuple)-pair (3)count by vector to reduce partitioner input (4)generate blocking map(vector -> blockId) (5)route each tuple to its destination block (6)update union block feature to catalog

Workload Assumptions Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range

Workload Modeling Q={Q 1,Q 2,…Q m } Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 F: All predicates in Q F i : Q i ’s predicates f ij : Each item in F i product in (‘shoes’, ‘shirts’) vs product= ‘shoes’ product in (‘shoes’, ‘shirts’) vs revenue > 21

Filter augmentation Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 Examples:  Q 1 : product=‘shoes’, product in (‘shoes’, ‘shirts’)  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21  Q 3 : product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’) Frequent itemset mining with threshold T(=2) numFeat

Partitioning problem modeling  ={F 1,F 2,…F m } as features, weight w i V={v 1,v 2,…v n } as transformed tuple  V ij indicates whether v i satisfies F j P={P 1,P 2,P 3 } as a partition Cost function C(P i ) as sum of tuples that P i can skip for all queries in workload : Max(C(P)) NP-Hard

The bottom up framework Ward’s method: Hierarchical grouping to optimize an objective function n 2 log(n) R: {vector -> blockId, …}

Data skipping 1.Generate vector 2.OR with each partition vector 3.Block with at least one 0 bit can be skipped

Experiment Environment  Amazon Spark EC2 cluster with 25 instances  8*2.66GHz CPU cores  64 GB RAM  2*840 GB disk storage Implement and experiment on Shark (SQL on spark)

Datasets TPC-H 600 million rows, 700GB in size Query templates (q 3,q 5,q 6,q 8,q 10,q 12,q 14,q 19 ) 800 queries as training workload, 100 from each 80 testing queries, 10 from each TPC-H Skewed TPC-H query generator has a uniform distribution 800 queries as training workload, 100 from each under Zipf distribution Conviva User access log of video streams 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, … 674 training queries and 61 testing queries 680 million tuples, 1TB in size TPC-H 相关说明： http://blog.csdn.net/fivedoumi/article/details/12356807http://blog.csdn.net/fivedoumi/article/details/12356807

TPC-H results Query performance Measure number of tuples scanned and response time for different blocking and skipping schemas Full scan: no data skipping, baseline Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000 partitions. Shark’s data skipping used Fineblock: numFeature=15 features from 800 training queries, minSize=50k, Shark’s data skipping and feature-based data skipping are used

TPC-H results - efficiency

TPC-H results – effect of minSize The smaller the block size is, the more chance we can skip data numFeature=15 and various minSize Y-value : ratio of number scanned to number must be scanned

TPC-H results – effect of numFeat

TPC-H results – blocking time A month partition in TPC-H 7.7 million tuples, 8GB in size 1000 blocks numFeat=15,minSize=50 One minute

Convia results Query performance Fullscan: no data skipping Range: partition on date and a frequently queried column, Shark’s skipping used Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping and feature-based skipping used

Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley.

Similar presentations

Presentation on theme: "Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley.

Similar presentations

Presentation on theme: "Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback