Download presentation

Presentation is loading. Please wait.

Published byJazmin Costain Modified about 1 year ago

1

2
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions Published in SoCC 2010

3
Motivation Science is becoming a data management problem MapReduce is an attractive solution – Simple API, declarative layer, seamless scalability, … But it is hard to – Express complex algorithms and – Get good performance (14 hours vs. 70 minutes) SkewReduce: – Goal: Scalable analysis with minimal effort – Toward scalable feature extraction analysis 2

4
Application 1: Extracting Celestial Objects Input – { (x,y,ir,r,g,b,uv,…) } Coordinates Light intensities … Output – List of celestial objects Star Galaxy Planet Asteroid … M34 from Sloan Digital Sky Survey 3

5
Application 2: Friends of Friends Simple clustering algorithm used in astronomy Input: – Points in multi-dimensional space Output: – List of clusters (e.g., galaxy, …) – Original data annotated with cluster ID Friends – Two points within ε distance Friends of Friends – Transitive closure of Friends relation 4 ε

6
Parallel Friends of Friends Partition Local clustering Merge – P1-P2 – P3-P4 Merge – P1-P2-P3-P4 Finalize – Annotate original data P2P4 P3P1 C1 C2 C3 C4 C5 C6 C5 →C3 C6 →C3 C4→C3C6 →C5 5

7
Parallel Feature Extraction Partition multi-dimensional input data Extract features from each partition Merge (or reconcile) features Finalize output Features INPUT DATA Map “Hierarchical Reduce” 6

8
Skew Local Clustering (MAP) Merge (REDUCE) Problem: Skew The top red line runs for 1.5 hours 5 minutes Time Task ID 35 minutes 7 8 node cluster, 32map/32reduce slots

9
Unbalanced Computation: Skew Computation skew – Characteristics of an algorithm Same amount of input data != Same runtime O(N log N) ~ O(N 2 ) 0 friends per particleO(N) friends per particle 8 Can we scale out off-the-shelf implementation without (or minimal) modifications?

10
Solution 1? Micro partition Assign tiny amount of work to each task to reduce skew 9

11
How about having micro partitions? It works! Framework overhead! To find sweet spot, need to try different granularities! Can we find a good partitioning plan without trial and error? 10 8 node cluster, 32map/32reduce slots

12
Outline Motivation SkewReduce – API (in the paper) – Partition Optimization Evaluation Summary 11

13
Partition Optimization Varying granularities of partitions Can we automatically find a good partition plan and schedule? Serial Feature Extraction Algorithm Merge Algorithm

14
Approach Sample SkewReduce Optimizer Cluster configuration Cluster configuration Cost functions Goal: minimize expected total runtime SkewReduce runtime plan – Bounding boxes for data partitions – Schedule Runtime Plan 13

15
Partition Plan Guided By Cost Functions “Given sample, how long will it take to process?” Two cost functions: – Feature cost: (Bounding box, sample, sample rate) → cost – Merge cost:(Bounding boxes, sample, sample rate) → cost Basis of two optimization decisions – How (axis, point) to split a partition – When to stop partitioning 14 …

16
Search Partition Plan Greedy top-down search – Split if total expected runtime improves Evaluate costs for subpartitions and merge Estimate new runtime 100 Original Possible Split Schedule 2 = 110 Schedule 1 = 60 REJECT ACCEPT 15 Time

17
Summary of Contributions Given a feature extraction application – Possibly with computation skew SkewReduce – Automatically partitions input data – Improves runtime in spite of computation skew Key technique: user-defined cost functions 16

18
Evaluation 8 node cluster – Dual quad core CPU, 16 GB RAM – Hadoop custom patch in MapReduce API Distributed Friends of Friends – Astro: Gravitational simulation snapshot 900 M particles – Seaflow: flow cytometry survey 59 M observations 17

19
Does SkewReduce work? SkewReduce plan yields 2 ~ 8 times faster running time 128 MB16 MB4 MB2 MBManualSkewReduce Hours Minutes 18 (1.9 GB, 3D)(18 GB, 3D) MapReduce 1 hour preparation

20
Impact of Cost Function Higher fidelity = Better performance Astro 19

21
Highlights of Evaluation Sample size – Representativeness of sample is important Runtime of SkewReduce optimization – Less than 15% of real runtime of SkewReduce plan Data volume in Merge phase – Total volume during Merge = 1% of input data Details in the paper 20

22
Conclusion Scientific analysis should be easy to write, scalable, and have a predictable performance SkewReduce – API for feature extracting functions – Scalable execution – Good performance in spite of skew Cost-based partition optimization using a data sample Published in SoCC 2010 – More general version is coming out soon! 21

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google