Download presentation

Presentation is loading. Please wait.

Published byJazmin Costain Modified over 4 years ago

2
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions Published in SoCC 2010

3
Motivation Science is becoming a data management problem MapReduce is an attractive solution – Simple API, declarative layer, seamless scalability, … But it is hard to – Express complex algorithms and – Get good performance (14 hours vs. 70 minutes) SkewReduce: – Goal: Scalable analysis with minimal effort – Toward scalable feature extraction analysis 2

4
Application 1: Extracting Celestial Objects Input – { (x,y,ir,r,g,b,uv,…) } Coordinates Light intensities … Output – List of celestial objects Star Galaxy Planet Asteroid … M34 from Sloan Digital Sky Survey 3

5
Application 2: Friends of Friends Simple clustering algorithm used in astronomy Input: – Points in multi-dimensional space Output: – List of clusters (e.g., galaxy, …) – Original data annotated with cluster ID Friends – Two points within ε distance Friends of Friends – Transitive closure of Friends relation 4 ε

6
Parallel Friends of Friends Partition Local clustering Merge – P1-P2 – P3-P4 Merge – P1-P2-P3-P4 Finalize – Annotate original data P2P4 P3P1 C1 C2 C3 C4 C5 C6 C5 →C3 C6 →C3 C4→C3C6 →C5 5

7
Parallel Feature Extraction Partition multi-dimensional input data Extract features from each partition Merge (or reconcile) features Finalize output Features INPUT DATA Map “Hierarchical Reduce” 6

8
Skew Local Clustering (MAP) Merge (REDUCE) Problem: Skew The top red line runs for 1.5 hours 5 minutes Time Task ID 35 minutes 7 8 node cluster, 32map/32reduce slots

9
Unbalanced Computation: Skew Computation skew – Characteristics of an algorithm Same amount of input data != Same runtime O(N log N) ~ O(N 2 ) 0 friends per particleO(N) friends per particle 8 Can we scale out off-the-shelf implementation without (or minimal) modifications?

10
Solution 1? Micro partition Assign tiny amount of work to each task to reduce skew 9

11
How about having micro partitions? It works! Framework overhead! To find sweet spot, need to try different granularities! Can we find a good partitioning plan without trial and error? 10 8 node cluster, 32map/32reduce slots

12
Outline Motivation SkewReduce – API (in the paper) – Partition Optimization Evaluation Summary 11

13
Partition Optimization Varying granularities of partitions Can we automatically find a good partition plan and schedule? Serial Feature Extraction Algorithm Merge Algorithm 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 12

14
Approach Sample SkewReduce Optimizer 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 Cluster configuration Cluster configuration Cost functions Goal: minimize expected total runtime SkewReduce runtime plan – Bounding boxes for data partitions – Schedule Runtime Plan 13

15
Partition Plan Guided By Cost Functions “Given sample, how long will it take to process?” Two cost functions: – Feature cost: (Bounding box, sample, sample rate) → cost – Merge cost:(Bounding boxes, sample, sample rate) → cost Basis of two optimization decisions – How (axis, point) to split a partition – When to stop partitioning 14 …

16
Search Partition Plan Greedy top-down search – Split if total expected runtime improves Evaluate costs for subpartitions and merge Estimate new runtime 100 Original 1 2 3 50 10 Possible Split 2 1 3 1 3 2 Schedule 2 = 110 Schedule 1 = 60 REJECT ACCEPT 15 Time

17
Summary of Contributions Given a feature extraction application – Possibly with computation skew SkewReduce – Automatically partitions input data – Improves runtime in spite of computation skew Key technique: user-defined cost functions 16

18
Evaluation 8 node cluster – Dual quad core CPU, 16 GB RAM – Hadoop 0.20.1 + custom patch in MapReduce API Distributed Friends of Friends – Astro: Gravitational simulation snapshot 900 M particles – Seaflow: flow cytometry survey 59 M observations 17

19
Does SkewReduce work? SkewReduce plan yields 2 ~ 8 times faster running time 128 MB16 MB4 MB2 MBManualSkewReduce 14.18.84.15.72.01.6 87.263.177.798.7-14.1 Hours Minutes 18 (1.9 GB, 3D)(18 GB, 3D) MapReduce 1 hour preparation

20
Impact of Cost Function Higher fidelity = Better performance Astro 19

21
Highlights of Evaluation Sample size – Representativeness of sample is important Runtime of SkewReduce optimization – Less than 15% of real runtime of SkewReduce plan Data volume in Merge phase – Total volume during Merge = 1% of input data Details in the paper 20

22
Conclusion Scientific analysis should be easy to write, scalable, and have a predictable performance SkewReduce – API for feature extracting functions – Scalable execution – Good performance in spite of skew Cost-based partition optimization using a data sample Published in SoCC 2010 – More general version is coming out soon! 21

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google