Download presentation
Presentation is loading. Please wait.
Published byShanna Young Modified over 8 years ago
1
Fine-grained Partitioning for Aggressive Data Skipping Calvin 2015-06-03 SIGMOD 2014 UC Berkeley
2
Contents Background Contribution Overview Algorithm Data skipping Experiment
3
Background How to get insights of enormous datasets interactively ? How to shorten query response time on huge datasets ? Drawbacks 1.Coarse-grained block(partition)s 2.Not balance 3.The remaining block(partition)s still contain many tuples 4.Blocks do not match the workload skew 5.Data and query filter correlation Block / Partition Oracle / Hbase / Hive / LogBase Prune data block(partition) according to metadata
4
Goals Workload-driven blocking technique Fined-grained Balance-sized Offline Re-executable Co-exists with original partitioning techniques
5
Example Extract features Vectorization 1.Split block 2.Storage How to chooseHow to split ConditionSkip F3F3 P 1, P 3 F 1 ^F 2 P 2, P 3
6
Contribution Feature selection Identity representative filters Modeled as Frequent itemset mining Optimal partitioning Balanced-Max-Skip partitioning problem – NP Hard A bottom-up framework for approximate solution
7
Overview (1)extract features from workload (2)scan table and transform tuple to (vector, tuple)-pair (3)count by vector to reduce partitioner input (4)generate blocking map(vector -> blockId) (5)route each tuple to its destination block (6)update union block feature to catalog
8
Workload Assumptions Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range
9
Workload Modeling Q={Q 1,Q 2,…Q m } Examples: Q 1 : product=‘shoes’ Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32 Q 3 : product=‘shirts’, revenue > 21 F: All predicates in Q F i : Q i ’s predicates f ij : Each item in F i product in (‘shoes’, ‘shirts’) vs product= ‘shoes’ product in (‘shoes’, ‘shirts’) vs revenue > 21
10
Filter augmentation Examples: Q 1 : product=‘shoes’ Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32 Q 3 : product=‘shirts’, revenue > 21 Examples: Q 1 : product=‘shoes’, product in (‘shoes’, ‘shirts’) Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21 Q 3 : product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’) Frequent itemset mining with threshold T(=2) numFeat
11
Partitioning problem modeling ={F 1,F 2,…F m } as features, weight w i V={v 1,v 2,…v n } as transformed tuple V ij indicates whether v i satisfies F j P={P 1,P 2,P 3 } as a partition Cost function C(P i ) as sum of tuples that P i can skip for all queries in workload : Max(C(P)) NP-Hard
12
The bottom up framework Ward’s method: Hierarchical grouping to optimize an objective function n 2 log(n) R: {vector -> blockId, …}
13
Data skipping 1.Generate vector 2.OR with each partition vector 3.Block with at least one 0 bit can be skipped
14
Experiment Environment Amazon Spark EC2 cluster with 25 instances 8*2.66GHz CPU cores 64 GB RAM 2*840 GB disk storage Implement and experiment on Shark (SQL on spark)
15
Datasets TPC-H 600 million rows, 700GB in size Query templates (q 3,q 5,q 6,q 8,q 10,q 12,q 14,q 19 ) 800 queries as training workload, 100 from each 80 testing queries, 10 from each TPC-H Skewed TPC-H query generator has a uniform distribution 800 queries as training workload, 100 from each under Zipf distribution Conviva User access log of video streams 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, … 674 training queries and 61 testing queries 680 million tuples, 1TB in size TPC-H 相关说明: http://blog.csdn.net/fivedoumi/article/details/12356807http://blog.csdn.net/fivedoumi/article/details/12356807
16
TPC-H results Query performance Measure number of tuples scanned and response time for different blocking and skipping schemas Full scan: no data skipping, baseline Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000 partitions. Shark’s data skipping used Fineblock: numFeature=15 features from 800 training queries, minSize=50k, Shark’s data skipping and feature-based data skipping are used
17
TPC-H results - efficiency
18
TPC-H results – effect of minSize The smaller the block size is, the more chance we can skip data numFeature=15 and various minSize Y-value : ratio of number scanned to number must be scanned
19
TPC-H results – effect of numFeat
20
TPC-H results – blocking time A month partition in TPC-H 7.7 million tuples, 8GB in size 1000 blocks numFeat=15,minSize=50 One minute
21
Convia results Query performance Fullscan: no data skipping Range: partition on date and a frequently queried column, Shark’s skipping used Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping and feature-based skipping used
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.