Fine-grained Partitioning for Aggressive Data Skipping Calvin SIGMOD 2014 UC Berkeley
Contents Background Contribution Overview Algorithm Data skipping Experiment
Background How to get insights of enormous datasets interactively ? How to shorten query response time on huge datasets ? Drawbacks 1.Coarse-grained block(partition)s 2.Not balance 3.The remaining block(partition)s still contain many tuples 4.Blocks do not match the workload skew 5.Data and query filter correlation Block / Partition Oracle / Hbase / Hive / LogBase Prune data block(partition) according to metadata
Goals Workload-driven blocking technique Fined-grained Balance-sized Offline Re-executable Co-exists with original partitioning techniques
Example Extract features Vectorization 1.Split block 2.Storage How to chooseHow to split ConditionSkip F3F3 P 1, P 3 F 1 ^F 2 P 2, P 3
Contribution Feature selection Identity representative filters Modeled as Frequent itemset mining Optimal partitioning Balanced-Max-Skip partitioning problem – NP Hard A bottom-up framework for approximate solution
Overview (1)extract features from workload (2)scan table and transform tuple to (vector, tuple)-pair (3)count by vector to reduce partitioner input (4)generate blocking map(vector -> blockId) (5)route each tuple to its destination block (6)update union block feature to catalog
Workload Assumptions Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range
Workload Modeling Q={Q 1,Q 2,…Q m } Examples: Q 1 : product=‘shoes’ Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32 Q 3 : product=‘shirts’, revenue > 21 F: All predicates in Q F i : Q i ’s predicates f ij : Each item in F i product in (‘shoes’, ‘shirts’) vs product= ‘shoes’ product in (‘shoes’, ‘shirts’) vs revenue > 21
Filter augmentation Examples: Q 1 : product=‘shoes’ Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32 Q 3 : product=‘shirts’, revenue > 21 Examples: Q 1 : product=‘shoes’, product in (‘shoes’, ‘shirts’) Q 2 : product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21 Q 3 : product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’) Frequent itemset mining with threshold T(=2) numFeat
Partitioning problem modeling ={F 1,F 2,…F m } as features, weight w i V={v 1,v 2,…v n } as transformed tuple V ij indicates whether v i satisfies F j P={P 1,P 2,P 3 } as a partition Cost function C(P i ) as sum of tuples that P i can skip for all queries in workload : Max(C(P)) NP-Hard
The bottom up framework Ward’s method: Hierarchical grouping to optimize an objective function n 2 log(n) R: {vector -> blockId, …}
Data skipping 1.Generate vector 2.OR with each partition vector 3.Block with at least one 0 bit can be skipped
Experiment Environment Amazon Spark EC2 cluster with 25 instances 8*2.66GHz CPU cores 64 GB RAM 2*840 GB disk storage Implement and experiment on Shark (SQL on spark)
Datasets TPC-H 600 million rows, 700GB in size Query templates (q 3,q 5,q 6,q 8,q 10,q 12,q 14,q 19 ) 800 queries as training workload, 100 from each 80 testing queries, 10 from each TPC-H Skewed TPC-H query generator has a uniform distribution 800 queries as training workload, 100 from each under Zipf distribution Conviva User access log of video streams 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, … 674 training queries and 61 testing queries 680 million tuples, 1TB in size TPC-H 相关说明:
TPC-H results Query performance Measure number of tuples scanned and response time for different blocking and skipping schemas Full scan: no data skipping, baseline Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000 partitions. Shark’s data skipping used Fineblock: numFeature=15 features from 800 training queries, minSize=50k, Shark’s data skipping and feature-based data skipping are used
TPC-H results - efficiency
TPC-H results – effect of minSize The smaller the block size is, the more chance we can skip data numFeature=15 and various minSize Y-value : ratio of number scanned to number must be scanned
TPC-H results – effect of numFeat
TPC-H results – blocking time A month partition in TPC-H 7.7 million tuples, 8GB in size 1000 blocks numFeat=15,minSize=50 One minute
Convia results Query performance Fullscan: no data skipping Range: partition on date and a frequently queried column, Shark’s skipping used Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping and feature-based skipping used