Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.

Effective Straggler Mitigation: Attack of the Clones [1]

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Disk-Locality in Datacenter Computing Considered Irrelevant Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica 1.

External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie.

Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.

What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.

Making Sense of Spark Performance

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.

Performance Evaluation

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Distributed Computations MapReduce

The Power of Choice in Data-Aware Cluster Scheduling

The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

A Case for Performance- Centric Network Allocation Gautam Kumar, Mosharaf Chowdhury, Sylvia Ratnasamy, Ion Stoica UC Berkeley.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

6.888 Lecture 8: Networking for Data Analytics Mohammad Alizadeh Spring  Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley)

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

CS239-Lecture 6 Performance Madan Musuvathi Visiting Professor, UCLA Principal Researcher, Microsoft Research.

Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker.

Big Data Analytics with Parallel Jobs

Packing Tasks with Dependencies

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Managing Data Transfer in Computer Clusters with Orchestra

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

PA an Coordinated Memory Caching for Parallel Jobs

MapReduce Simplied Data Processing on Large Clusters

Presentation transcript:

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC Berkeley

About Me PhD student at UC Berkeley Thesis work centers around performance of large-scale distributed systems Spark PMC member

… … … Task Spark (or Hadoop/Dryad/etc.) task

… … Task Spark (or Hadoop/Dryad/etc.) task …

… … Task …

… How can we make this job faster? Cache input data in memory Task

… How can we make this job faster? Cache input data in memory Task

… How can we make this job faster? Cache input data in memory Optimize the network Task

… How can we make this job faster? Cache input data in memory Optimize the network Task

… How can we make this job faster? Cache input data in memory Optimize the network Mitigate effect of stragglers Task Task: 30s Task: 2s Task: 3s Task: 2s

Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12]

Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Missing: what’s most important to end-to-end performance?

Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Disk Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Network Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Widely-accepted mantras: Network and disk I/O are bottlenecks Stragglers are a major issue with unknown causes

(1)How can we quantify performance bottlenecks? Blocked time analysis (2) Do the mantras hold? Takeaways based on three workloads run with Spark This work

Takeaways based on three Spark workloads: Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed

Takeaways will not hold for every single analytics workload nor for all time

Accepted mantras are often not true Methodology to avoid performance misunderstandings in the future This work:

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload?

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload?

What’s the job’s bottleneck?

What exactly happens in a Spark task? Task reads shuffle data, generates in- memory output comput e network time (1) Request a few shuffle blocks disk (2) Start processing local data (3) Process data fetched remotely (4) Continue fetching remote data : time to handle one shuffle block

What’s the bottleneck for this task? Task reads shuffle data, generates in- memory output comput e network time disk Bottlenecked on network and disk Bottlenec ked on network Bottlenecke d on CPU

What’s the bottleneck for the job? time tasks comput e network disk Task x: may be bottlenecked on different resources at different times Time t: different tasks may be bottlenecked on different resources

How does network affect the job’s completion time? time tasks :Time when task is blocked on the network Blocked time analysis: how much faster would the job complete if tasks never blocked on the network?

Blocked time analysis tasks (2) Simulate how job completion time would change (1) Measure time when tasks are blocked on the network

(1) Measure time when tasks are blocked on network comput e network disk : time blocked on network : time blocked on disk Original task runtime task runtime if network were infinitely fast Best case

(2) Simulate how job completion time would change Task 0 Task 1 Task 2 time 2 slots t o : Original job completion time Task 0 Task 1 Task 2 2 slots Incorrectly computed time: doesn’t account for task scheduling : time blocked on network

(2) Simulate how job completion time would change Task 0 Task 1 Task 2 time 2 slots t o : Original job completion time Task 0 Task 1 Task 2 2 slots : time blocked on network t n : Job completion time with infinitely fast network

Blocked time analysis: how quickly could a job have completed if a resource were infinitely fast?

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload?

Large-scale traces? Don’t have enough instrumentation for blocked-time analysis

SQL Workloads run on Spark TPC-DS (20 machines, 850GB; 60 machines, 2.5TB; 200 machines, 2.5TB) Big Data Benchmark (5 machines, 60GB) Databricks (Production; 9 machines, tens of GB) 2 versions of each: in-memory, on- disk Only 3 workloads Small cluster sizes

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload?

How much faster could jobs get from optimizing network performance?

Percenti les Median improvement: 2% 95%ile improvement: 10%

How much faster could jobs get from optimizing network performance? Median improvement at most 2% Percenti les

How much faster could jobs get from optimizing disk performance? Median improvement at most 19%

How important is CPU? CPU much more highly utilized than disk or network!

What about stragglers? 5-10% improvement from eliminating stragglers Based on simulation Can explain >60% of stragglers in >75% of jobs Fixing underlying cause can speed up other tasks too! 2x speedup from fixing one straggler cause

Takeaways based on three Spark workloads: Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload? network >

Why are our results so different than what’s stated in prior work? Are the workloads we measured unusually network-light? How can we compare our workloads to large-scale traces used to motivate prior work?

How much data is transferred per CPU second? Microsoft ’09-’10: 1.9–6.35 Mb / task second Google ’04-‘07: 1.34–1.61 Mb / machine second

Why are our results so different than what’s stated in prior work? Our workloads are network light 1)Incomplete metrics 2)Conflation of CPU and network time

When is the network used? map task … reduce task … Input data (read locally) Outpu t data (1) To shuffle intermediat e data (2) To replicate output data Some work focuses only on the shuffle

How does the data transferred over the network compare to the input data? Not realistic to look only at shuffle! Or to use workloads where all input is shuffled Shuffled data is only ~1/3 of input data! Even less output data

Prior work conflates CPU and network time To send data over network: (1) Serialize objects into bytes (2) Send bytes (1) and (2) often conflated. Reducing application data sent reduces both!

When does the network matter? Network important when: (1)Computation optimized (2)Serialization time low (3)Large amount of data sent over network

Why are our results so different than what’s stated in prior work? Our workloads are network light 1) Incomplete metrics e.g., looking only at shuffle time 2) Conflation of CPU and network time Sending data over the network has an associated CPU cost

Limitations Only three workloads Industry-standard workloads Results sanity-checked with larger production traces Small cluster sizes Results don’t change when we move between cluster sizes

Limitations aren’t fatal Only three workloads Industry-standard workloads Results sanity-checked with larger production traces Small cluster sizes Takeaways don’t change when we move between cluster sizes

Outline Methodology: How can we measure Spark bottlenecks? Workloads: What workloads did we use? Results: How well do the mantras hold? Why?: Why do our results differ from past work? Demo: How can you understand your own workload?

Demo

Often can tune parameters to shift the bottleneck (e.g., change snappy to lzf)

What’s missing from Spark metrics? Time blocked on reading input data and writing output data (HADOOP-11873) Time spent spilling intermediate data to disk (SPARK-3577)

Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed All traces and tools publicly available: tinyurl.com/summit-traces Will change with time! Takeaway: performance understandability should be a first-class concern! (almost) All Instrumentation now part of Spark I want your workload!