Presentation is loading. Please wait.

Presentation is loading. Please wait.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Similar presentations


Presentation on theme: "Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard."— Presentation transcript:

1 Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard Labs

2 Cloud Environment Advantages ▫Large amount of resources ▫Elasticity ▫Pay-as-you-go pricing model Challenges ▫Distributed resources ▫Error-prone

3 MapReduce and Pig MapReduce: Simple and fault tolerant framework for data processing in the cloud Pig ▫Advanced MapReduce based platform ▫Widely used: Yahoo!, Twitter, LinkedIn ▫PigLatin: A high-level declaratice language for expressing data analysis tasks as Pig programs j1 j2 j3 j4 j5 j6j7

4 Motivation Latency-sensitive applications ▫Personalized advertising ▫Spam and fraud detection ▫Real-time log analysis How much resource does an application need to meet their deadlines?

5 Contributions Performance modeling for Pig programs ▫Given a Pig grogram, estimates its completion time as a function of assigned resource Deadline driven resource allocation estimates for Pig programs ▫Given a completion time target, determine the amount of resources for a Pig program to achieve it

6 Outline Introduction Building block ▫Performance model for single MapReduce jobs Resource allocation for Pig programs Evaluation Conclusion and ongoing work

7 Theoretical Makespan Bounds Bounds- based makespan estimates ▫n tasks, k servers ▫avg: average duration of the n tasks ▫max: maximum duration of the n tasks Lower bound Upper bound

8 Illustration Schedule 1: 1 4 3 2 3 1 2 Schedule 2: 3 1 2 3 2 1 4 Makespan = 4 Lower bound = 4 Makespan = 7 Upper bound = 8 1 2 4 3 1 2 4 3

9 Estimate the bounds of the job completion time based on job profile ▫Most production jobs are executed routinely on new data sets ▫Job profile based on previous running  Map stage: M avg, M max, AvgInputSize, Selectivity  Reduce stage: Sh avg, Sh max, R avg, R max, Selectivity ▫Predict the completion time for future running with the profile Estimate Completion Time for Single MR Job

10 Estimating bounds on the duration of map and reduce stages Map stage duration depends on: ▫N M -- the number of map tasks ▫S M -- the number of map slots Reduce stage duration depends on: ▫N R -- the number of reduce tasks ▫S R -- the number of reduce slots Job duration T J low, T J up, T j avg ▫Sum of the map and reduce stage duration 10 Estimate Completion Time for Single MR Job

11 Given a deadline D and the job profile, find the minimal resource to complete the job within D Resource Allocation for Single MR Job Given number of map/reduce tasks Find the value of S M J, S R J with minimum value of S M J + S R J using Lagrange's multipliers Statistics from job profile

12 Outline Introduction Building block ▫Performance model for single MapReduce jobs Resource allocation for Pig programs Evaluation Conclusion and ongoing work

13 Performance Model for Pig Programs Let P = {J 1, J 2,….J N }, extract the job profile of each job contained in P ▫Assign unique name for each job within a program The program completion time  sum of the completion time of all the jobs contained in P

14 Possible strategy: find out an appropriate pair of map and reduce slots for each job in the program Problem: difficult to implement and manage by the scheduler Resource Allocation for Pig Programs with

15 Resource Allocation for Pig Programs A simpler and more elegant solution ▫Allocate the same set of resource to the entire program instead of to each job Rewrite the previous equations into Find the minimum set of map and reduce slots ( S M P, S R P ) for the entire Pig program

16 Experiment Setup 66 nodes cluster in 2 racks ▫4 AMD 2.39GHz cores ▫8 GB RAM, ▫two 160GB hard disks Configuration ▫1 jobtracker, 1 namenode, 64 worker nodes ▫2 map slots and 1 reduce slot for each node

17 Benchmark Pigmix benchmark ▫17 programs ▫8 tables as the input data Dataset ▫Test dataset  Generated with the Pig mix data generator  Total size around 1TB. ▫Experimental dataset  Same layout as the test dataset  20% larger in size

18 Model Accuracy How well of our performance model captures Pig program completion time? Normalized results for predicted and measured completion time

19 Meeting Deadlines Are we meeting deadlines with our resource allocation mode? Pigmix executed on experimental data set : do we meet deadlines?

20 Conclusion ▫The performance model can accurately estimate the completion time of MapReduce workflow ▫Enables automatic resource provisioning for MapReduce workflow with deadlines Ongoing work ▫Refine the performance model for workflow with concurrent jobs ▫Incorporating failure scenarios in the current model

21 Thank you


Download ppt "Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard."

Similar presentations


Ads by Google