Presentation is loading. Please wait.

Presentation is loading. Please wait.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Similar presentations


Presentation on theme: "SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting."— Presentation transcript:

1

2 SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions Published in SoCC 2010

3 Motivation Science is becoming a data management problem MapReduce is an attractive solution – Simple API, declarative layer, seamless scalability, … But it is hard to – Express complex algorithms and – Get good performance (14 hours vs. 70 minutes) SkewReduce: – Goal: Scalable analysis with minimal effort – Toward scalable feature extraction analysis 2

4 Application 1: Extracting Celestial Objects Input – { (x,y,ir,r,g,b,uv,…) } Coordinates Light intensities … Output – List of celestial objects Star Galaxy Planet Asteroid … M34 from Sloan Digital Sky Survey 3

5 Application 2: Friends of Friends Simple clustering algorithm used in astronomy Input: – Points in multi-dimensional space Output: – List of clusters (e.g., galaxy, …) – Original data annotated with cluster ID Friends – Two points within ε distance Friends of Friends – Transitive closure of Friends relation 4 ε

6 Parallel Friends of Friends Partition Local clustering Merge – P1-P2 – P3-P4 Merge – P1-P2-P3-P4 Finalize – Annotate original data P2P4 P3P1 C1 C2 C3 C4 C5 C6 C5 →C3 C6 →C3 C4→C3C6 →C5 5

7 Parallel Feature Extraction Partition multi-dimensional input data Extract features from each partition Merge (or reconcile) features Finalize output Features INPUT DATA Map “Hierarchical Reduce” 6

8 Skew Local Clustering (MAP) Merge (REDUCE) Problem: Skew The top red line runs for 1.5 hours 5 minutes Time Task ID 35 minutes 7 8 node cluster, 32map/32reduce slots

9 Unbalanced Computation: Skew Computation skew – Characteristics of an algorithm Same amount of input data != Same runtime O(N log N) ~ O(N 2 ) 0 friends per particleO(N) friends per particle 8 Can we scale out off-the-shelf implementation without (or minimal) modifications?

10 Solution 1? Micro partition Assign tiny amount of work to each task to reduce skew 9

11 How about having micro partitions? It works! Framework overhead! To find sweet spot, need to try different granularities! Can we find a good partitioning plan without trial and error? 10 8 node cluster, 32map/32reduce slots

12 Outline Motivation SkewReduce – API (in the paper) – Partition Optimization Evaluation Summary 11

13 Partition Optimization Varying granularities of partitions Can we automatically find a good partition plan and schedule? Serial Feature Extraction Algorithm Merge Algorithm 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 12

14 Approach Sample SkewReduce Optimizer 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 Cluster configuration Cluster configuration Cost functions Goal: minimize expected total runtime SkewReduce runtime plan – Bounding boxes for data partitions – Schedule Runtime Plan 13

15 Partition Plan Guided By Cost Functions “Given sample, how long will it take to process?” Two cost functions: – Feature cost: (Bounding box, sample, sample rate) → cost – Merge cost:(Bounding boxes, sample, sample rate) → cost Basis of two optimization decisions – How (axis, point) to split a partition – When to stop partitioning 14 …

16 Search Partition Plan Greedy top-down search – Split if total expected runtime improves Evaluate costs for subpartitions and merge Estimate new runtime 100 Original 1 2 3 50 10 Possible Split 2 1 3 1 3 2 Schedule 2 = 110 Schedule 1 = 60 REJECT ACCEPT 15 Time

17 Summary of Contributions Given a feature extraction application – Possibly with computation skew SkewReduce – Automatically partitions input data – Improves runtime in spite of computation skew Key technique: user-defined cost functions 16

18 Evaluation 8 node cluster – Dual quad core CPU, 16 GB RAM – Hadoop 0.20.1 + custom patch in MapReduce API Distributed Friends of Friends – Astro: Gravitational simulation snapshot 900 M particles – Seaflow: flow cytometry survey 59 M observations 17

19 Does SkewReduce work? SkewReduce plan yields 2 ~ 8 times faster running time 128 MB16 MB4 MB2 MBManualSkewReduce 14.18.84.15.72.01.6 87.263.177.798.7-14.1 Hours Minutes 18 (1.9 GB, 3D)(18 GB, 3D) MapReduce 1 hour preparation

20 Impact of Cost Function Higher fidelity = Better performance Astro 19

21 Highlights of Evaluation Sample size – Representativeness of sample is important Runtime of SkewReduce optimization – Less than 15% of real runtime of SkewReduce plan Data volume in Merge phase – Total volume during Merge = 1% of input data Details in the paper 20

22 Conclusion Scientific analysis should be easy to write, scalable, and have a predictable performance SkewReduce – API for feature extracting functions – Scalable execution – Good performance in spite of skew Cost-based partition optimization using a data sample Published in SoCC 2010 – More general version is coming out soon! 21


Download ppt "SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting."

Similar presentations


Ads by Google