Presentation is loading. Please wait.

Presentation is loading. Please wait.

LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To appear in IEEE Transactions on Parallel and Distributed.

Similar presentations


Presentation on theme: "LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To appear in IEEE Transactions on Parallel and Distributed."— Presentation transcript:

1 LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To appear in IEEE Transactions on Parallel and Distributed Systems

2 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

3 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

4 Introduction  The new era of Big Data is coming!  – 20 PB per day (2008)  – 30 TB per day (2009)  – 60 TB per day (2010)  –petabytes per day  What does big data mean?  Important user information  significant business value

5 MapReduce  What is MapReduce?  most popular parallel computing model proposed by Google database operation Search engine Machine learning Cryptanalysi s Scientific computation Applications … Select, Join, Group Page rank, Inverted index, Log analysis Clustering, machine translation, Recommendation

6 Data skew in MapReduce  The imbalance in the amount of data assigned to each task  Fundamental reason:  The datasets in the real world are often skewed  physical properties, hot spots  We do not know the data distribution beforehand  It cannot be solved by speculative execution Mantri has witnessed the Coefficients of variation in data size across tasks are 0.34 and 3.1 at the 50 th and 90 th percentiles in the Microsoft production cluster

7 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

8 Architecture Split 1 Split 2 … Split M Map Part 2 Part 1 Map Part 2 Part 1 Map Part 2 Part 1 Reduc e Output2 Input files Map Stage Reduce Stage Output files Output1 Master … Assig n [(K 1, V 1 )] → [(K 2,V 2 )] → [(K 2, [V 2 ])] map combine [(K2, [V2])] → → [(K3,V3)] → copy sort reduce Intermediate data are divided according to some user defined partitioner

9 Challenges to solve data skew  Many real world applications exhibit data skew  Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index, etc.  The data distribution cannot be determined ahead of time  The computing environment can be heterogeneous  Diversity of hardware  Resource competition in cloud environment

10 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

11 Previous work  In the parallel database area  limited on join, group, and aggregate operations  Pre-run sampling jobs  Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11)  Operating pre-processing extracting and sampling procedures for the spatial feature extraction (SOCC’11)  Collect data information during the job execution  Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10)  Partitioning intermediate data into more partitions and using greedy bin- packing to pack them after all maps finish (CLOSER’11, ICDE’12)  Skewtune (SIGMOD’12)  Split skewed tasks when detected  Reconstruct the output by concatenating the results Bring barrier between map and reduce phases Bin-packing cannot support total order Significant overhead Applicable only to certain applications Need more task slots Cannot detect large keys Cannot split in copy and sort phases

12 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

13 LIBRA – Solving data skew HDFS Normal Map Sample Map Normal Map Sample Map Master 2:Sample Data Reduce HDFS 1: Issue Sample Tasks First 3: Calculate Partitions 4: Ask Workers to Partition Map Output

14 Sampling and partitioning  Sampling strategy  Random, TopCluster (ICDE’12)  LIBRA – p largest keys and q random keys  Estimate Intermediate Data Distribution  Large keys -> represent only one large key  Random keys -> represent a small range keys  Partitioning strategy  Hash, bin packing, range  LIBRA - range

15 Heterogeneity Consideration Performance=1.5 Cnt=300 Performance=1 Performance=0.5 Node1Node2 Node3 Cnt=150 Cnt=100 Cnt=50 Intermediate data Reducer1 Reducer2 Reducer3 Finish Processing Start

16 Problem Statement

17 Distribution estimation L L K1K1 K2K2 K3K3 … K i-1 KiKi K i+1 K |L| … P 1 keys, Q 1 tuples P 2 keys, Q 2 tuples P 3 keys, Q 3 tuples P i (=1) keys, Q i tuples P i+1 keys, Q i+1 tuples (a)Sum up samples (b) Pick up “marked keys” (c) Estimate distribution

18 Sparse Index to Speed Up Partitioning Intermediate data ( K b1, V b1 ) ( K b1+1, V b1+1 ) …… Index chunk Sparse index …… ( K b1, Offset 1, L 1, Checksum 1 ) Offset n L1L1 L2L2 LnLn ( K b2, Offset 2, L 2, Checksum 2 ) ( K bn, Offset n, L n, Checksum n ) ( K b2, V b2 ) ( K b2+1, V b2+1 ) …… ( K bn, V bn ) ( K bn+1, V bn+1 ) …… Offset 2 Offset 1 decrease the partition time by an order of magnitude

19 Large Cluster Splitting  treat each intermediate (k,v) pair independently in reduce phase  e.g. sort, grep, join A, cnt = 100B, cnt = 10 C, cnt = 10 Cluster split is not allow A, cnt=100 B, cnt = 10 C, cnt = 10 Cluster split is allow A, cnt=60 A, cnt = 40 B, cnt = 10 C, cnt = 10 Reducer 1Reducer 2Reducer 1Reducer 2 Data Skewed

20 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

21 Experiment Environment  Cluster:  30 virtual machines on 15 physical machines  Each physical machine:  dual-Processors (2.4GHz Xeon E5620)  24GB of RAM  two 150GB disks  connected by 1Gbps Ethernet  Each virtual machine:  2 virtual core, 4GB RAM and 40GB of disk space  Benchmark:  Sort, Grep, Inverted Index, join

22 Evalution - Accuracy of the Sampling Method

23 Evaluation – LIBRA Execution (sort) 80% faster than Hadoop Hash 167% faster than Hadoop Range

24 Evaluation – Degree of the skew (sort) The overhead of LIBRA is minimal

25 Evaluation – different applications  Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB

26  Inverted Index application  Dataset: full English Wikipedia archive Evaluation – different applications

27  Join application Evaluation – different applications

28 Evaluation – Heterogeneous Environments (sort) 30% faster than without heterogeneous consideration

29 0 Outlines 2. Background3. Previous work4. System Design1. Introduction5. Evaluation6. Conclusion

30 Conclusion  We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce:  A new sampling method for general user-defined programs  p largest keys and q random keys  An approach to balance the load among the reduce tasks  Large key split support  An innovative consideration of heterogeneous environment  Balance the processing time instead of just the amount of data  Performance evaluation demonstrates that:  the improvement is significant (up to 4x times faster)  the overhead is minimal and negligible even in the absence of skew

31 Thank You!


Download ppt "LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To appear in IEEE Transactions on Parallel and Distributed."

Similar presentations


Ads by Google