Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley.

Similar presentations


Presentation on theme: "Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley."— Presentation transcript:

1 Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley

2 Contents Background Contribution Overview Algorithm Data skipping Experiment

3 Background How to get insights of enormous datasets interactively ? How to shorten query response time on huge datasets ? Drawbacks 1.Coarse-grained block(partition)s 2.Not balance 3.The remaining block(partition)s still contain many tuples 4.Blocks do not match the workload skew 5.Data and query filter correlation Block / Partition Oracle / Hbase / Hive / LogBase Prune data block(partition) according to metadata

4 Goals Workload-driven blocking technique Fined-grained Balance-sized Offline Re-executable Co-exists with original partitioning techniques

5 Example Extract features Vectorization 1.Split block 2.Storage How to chooseHow to split ConditionSkip F3F3 P 1, P 3 F 1 ^F 2 P 2, P 3

6 Contribution Feature selection Identity representative filters Modeled as Frequent itemset mining Optimal partitioning Balanced-Max-Skip partitioning problem – NP Hard A bottom-up framework for approximate solution

7 Overview (1)extract features from workload (2)scan table and transform tuple to (vector, tuple)-pair (3)count by vector to reduce partitioner input (4)generate blocking map(vector -> blockId) (5)route each tuple to its destination block (6)update union block feature to catalog

8 Workload Assumptions Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range

9 Workload Modeling Q={Q 1,Q 2,…Q m } Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 F: All predicates in Q F i : Q i ’s predicates f ij : Each item in F i product in (‘shoes’, ‘shirts’) vs product= ‘shoes’ product in (‘shoes’, ‘shirts’) vs revenue > 21

10 Filter augmentation Examples:  Q 1 : product=‘shoes’  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32  Q 3 : product=‘shirts’, revenue > 21 Examples:  Q 1 : product=‘shoes’, product in (‘shoes’, ‘shirts’)  Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21  Q 3 : product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’) Frequent itemset mining with threshold T(=2) numFeat

11 Partitioning problem modeling  ={F 1,F 2,…F m } as features, weight w i V={v 1,v 2,…v n } as transformed tuple  V ij indicates whether v i satisfies F j P={P 1,P 2,P 3 } as a partition Cost function C(P i ) as sum of tuples that P i can skip for all queries in workload : Max(C(P)) NP-Hard

12 The bottom up framework Ward’s method: Hierarchical grouping to optimize an objective function n 2 log(n) R: {vector -> blockId, …}

13 Data skipping 1.Generate vector 2.OR with each partition vector 3.Block with at least one 0 bit can be skipped

14 Experiment Environment  Amazon Spark EC2 cluster with 25 instances  8*2.66GHz CPU cores  64 GB RAM  2*840 GB disk storage Implement and experiment on Shark (SQL on spark)

15 Datasets TPC-H 600 million rows, 700GB in size Query templates (q 3,q 5,q 6,q 8,q 10,q 12,q 14,q 19 ) 800 queries as training workload, 100 from each 80 testing queries, 10 from each TPC-H Skewed TPC-H query generator has a uniform distribution 800 queries as training workload, 100 from each under Zipf distribution Conviva User access log of video streams 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, … 674 training queries and 61 testing queries 680 million tuples, 1TB in size TPC-H 相关说明: http://blog.csdn.net/fivedoumi/article/details/12356807http://blog.csdn.net/fivedoumi/article/details/12356807

16 TPC-H results Query performance Measure number of tuples scanned and response time for different blocking and skipping schemas Full scan: no data skipping, baseline Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000 partitions. Shark’s data skipping used Fineblock: numFeature=15 features from 800 training queries, minSize=50k, Shark’s data skipping and feature-based data skipping are used

17 TPC-H results - efficiency

18 TPC-H results – effect of minSize The smaller the block size is, the more chance we can skip data numFeature=15 and various minSize Y-value : ratio of number scanned to number must be scanned

19 TPC-H results – effect of numFeat

20 TPC-H results – blocking time A month partition in TPC-H 7.7 million tuples, 8GB in size 1000 blocks numFeat=15,minSize=50 One minute

21 Convia results Query performance Fullscan: no data skipping Range: partition on date and a frequently queried column, Shark’s skipping used Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping and feature-based skipping used


Download ppt "Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley."

Similar presentations


Ads by Google