Spark Performance Patrick Wendell Databricks.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Making Fly Parviz Deyhim
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Introduction to Spark Internals
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Tuning and Debugging in Apache Spark Patrick February 20, 2015.
Distributed Computations MapReduce
Patrick Wendell Databricks Deploying and Administering Spark.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
SparkR: Enabling Interactive Data Science at Scale
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Storage in Big Data Systems
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin† UC Berkeley and †Databricks Inc.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Python Spark Intro for Data Science
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Data processing with Hadoop
Declarative Transfer Learning from Deep CNNs at Scale
Introduction to MapReduce
Introduction to Spark.
CS639: Data Management for Data Science
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Spark Performance Patrick Wendell Databricks

About me Work on performance benchmarking and testing in Spark Co-author of spark-perf Wrote instrumentation/UI components in Spark

Geared towards existing users Current as of Spark 0.8.1 This talk Geared towards existing users Current as of Spark 0.8.1

Outline Part 1: Spark deep dive Part 2: Overview of UI and instrumentation Part 3: Common performance mistakes

Why gain a deeper understanding? (patrick, $24), (matei, $30), (patrick, $1), (aaron, $23), (aaron, $2), (reynold, $10), (aaron, $10)….. RDD spendPerUser = rdd .groupByKey() .map(lambda pair: sum(pair[1])) .collect() spendPerUser = rdd .reduceByKey(lambda x, y: x + y) .collect() Copies all data over the network Reduces locally before shuffling

Let’s look under the hood

How Spark works RDD: a parallel collection w/ partitions User application creates RDDs, transforms them, and runs actions These result in a DAG of operators DAG is compiled into stages Each stage is executed as a series of tasks

sc.textFile("/some-hdfs-data") Example sc.textFile("/some-hdfs-data") RDD[String] .map(line => line.split("\t")) RDD[List[String]] .map(parts => (parts[0], int(parts[1]))) RDD[(String, Int)] .reduceByKey(_ + _, 3) RDD[(String, Int)] Array[(String, Int)] .collect() textFile map map reduceByKey collect

Execution Graph map reduceByKey collect textFile map Stage 2 Stage 1

Execution Graph read HDFS split read shuffle data apply both maps Stage 2 Stage 1 reduceByKey collect textFile Stage 2 Stage 1 read HDFS split apply both maps partial reduce write shuffle data read shuffle data final reduce send result to driver

Stage execution Create a task for each partition in the new RDD Serialize task Schedule and ship task to slaves Task 2 Task 3 Task 4

Fundamental unit of execution in Spark Task execution Fundamental unit of execution in Spark A. Fetch input from InputFormat or a shuffle B. Execute the task C. Materialize task output as shuffle or driver result Fetch input Pipelined Execution Execute task Write output

Spark Executor Core 1 Core 2 Core 3 Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Core 1 Execute task Fetch input Write output Execute task Fetch input Write output Core 2 Execute task Fetch input Write output Execute task Fetch input Write output Core 3

Summary of Components Tasks: Fundamental unit of work Stage: Set of tasks that run in parallel DAG: Logical graph of RDD operations RDD: Parallel dataset with partitions

Demo of perf UI First, show executors Back to stage page, show stage summaries Zoom in on stage Show histograms Show sorting based on time Show GC/Disk/runtime/executor Back to parent page Show environment data and storage

Where can you have problems? Scheduling and launching tasks Execution of tasks Writing data between stages Collecting results

1. Scheduling and launching tasks

Serialized task is large due to a closure hash_map = some_massive_hash_map() rdd.map(lambda x: hash_map(x)) .count_by_value() Detecting: Spark will warn you! (starting in 0.9…) Fixing Use broadcast variables for large object Make your large object into an RDD

Large number of “empty” tasks due to selective filter rdd = sc.textFile(“s3n://bucket/2013-data”) .map(lambda x: x.split(“\t”)) .filter(lambda parts: parts[0] == “2013-10-17”) .filter(lambda parts: parts[1] == “19:00”) rdd.map(lambda parts: (parts[2], parts[3]).reduceBy… Detecting Many short-lived (< 20ms) tasks Fixing Use `coalesce` or `repartition` operator to shrink RDD number of partitions after filtering: rdd.coalesce(30).map(lambda parts: (parts[2]…

2. Execution of Tasks

Tasks with high per-record overhead rdd.map(lambda x: conn = new_mongo_db_cursor() conn.write(str(x)) conn.close()) Detecting: Task run time is high Fixing Use mapPartitions or mapWith (scala) rdd.mapPartitions(lambda records: conn = new_mong_db_cursor() [conn.write(str(x)) for x in records]

Skew between tasks Detecting Stage response time dominated by a few slow tasks Fixing Data skew: poor choice of partition key  Consider different way of parallelizing the problem  Can also use intermediate partial aggregations Worker skew: some executors slow/flakey nodes  Set spark.speculation to true  Remove flakey/slow nodes over time

3. Writing data between stages

Not having enough buffer cache spark writes out shuffle data to OS-buffer cache Detecting tasks spend a lot of time writing shuffle data Fixing if running large shuffles on large heaps, allow several GB for buffer cash rule of thumb, leave 20% of memory free for OS and caches

Not setting spark.local.dir spark.local.dir is where shuffle files are written ideally a dedicated disk or set of disks spark.local.dir=/mnt1/spark,/mnt2/spark,/mnt3/spark mount drives with noattime, nodiratime

Not setting the number of reducers Default behavior: inherits # of reducers from parent RDD Too many reducers: Task launching overhead becomes an issue (will see many small tasks) Too few reducers:  Limits parallelism in cluster

4. Collecting results

Collecting massive result sets sc.textFile(“/big/hdfs/file/”).collect() Fixing If processing, push computation into Spark If storing, write directly to parallel storage

Advanced Profiling JVM Utilities: jstack <pid> jvm stack trace jmap –histo:live <pid> heap summary System Utilities: dstat io and cpu stats iostat disk stats lsof –p <pid> tracks open files

Conclusion Spark 0.8 provides good tools for monitoring performance Understanding Spark concepts provides a major advantage in perf debugging

Questions?