Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

Introduction to Spark Internals
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
SparkR: Enabling Interactive Data Science at Scale
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Intro to Spark Lightning-fast cluster computing. What is Spark? Spark Overview: A fast and general-purpose cluster computing system.
Storage in Big Data Systems
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark Streaming Large-scale near-real-time stream processing
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.
PySpark Tutorial - Learn to use Apache Spark with Python
Python Spark Intro for Data Science
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Running Apache Spark on HPC clusters
Big Data is a Big Deal!.
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Spark and Scala.
Spark and Scala.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Introduction to Spark.
CS639: Data Management for Data Science
CS5412 / Lecture 25 Apache Spark and RDDs
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Quick Demo

API Hooks Scala / Java – All Java libraries – *.jar – lang.org lang.org Python – Anaconda: m.io/cshop/anaconda/ m.io/cshop/anaconda/

Introduction

Spark Structure Start Spark on a cluster Submit code to be run on it

Another Perspective

Step by step

Example: WordCount

Limitations of MapReduce Performance bottlenecks—not all jobs can be cast as batch processes – Graphs? Programming in Hadoop is hard – Boilerplate boilerplate everywhere

Initial Workaround: Specialization

Along Came Spark Spark’s goal was to generalize MapReduce to support new applications within the same engine Two additions: – Fast data sharing – General DAGs (directed acyclic graphs) Best of both worlds: easy to program & more efficient engine in general

Codebase Size

More on Spark More general – Supports map/reduce paradigm – Supports vertex-based paradigm – General compute engine (DAG) More API hooks – Scala, Java, and Python More interfaces – Batch (Hadoop), real-time (Storm), and interactive (???)

Interactive Shells Spark creates a SparkContext object (cluster information) For either shell: sc External programs use a static constructor to instantiate the context

Interactive Shells spark-shell --master

Interactive Shells Master connects to the cluster manager, which allocates resources across applications Acquires executors on cluster nodes: worker processes to run computations and store data Sends app code to executors Sends tasks for executors to run

Resilient Distributed Datasets (RDDs) Resilient Distributed Datasets (RDDs) are primary data abstraction in Spark – Fault-tolerant – Can be operated on in parallel 1.Parallelized Collections 2.Hadoop datasets Two types of RDD operations 1.Transformations (lazy) 2.Actions (immediate)

Resilient Distributed Datasets (RDDs)

Can create RDDs from any file stored in HDFS – Local filesystem – Amazon S3 – HBase Text files, SequenceFiles, or any other Hadoop InputFormat Any directory or glob – /data/201414*

Resilient Distributed Datasets (RDDs) Transformations – Create a new RDD from an existing one – Lazily evaluated: results are not immediately computed Pipeline of subsequent transformations can be optimized Lost data partitions can be recovered

Resilient Distributed Datasets (RDDs)

Closures in Java

Resilient Distributed Datasets (RDDs) Actions – Create a new RDD from an existing one – Eagerly evaluated: results are immediately computed Applies previous transformations (cache results?)

Resilient Distributed Datasets (RDDs)

Spark can persist / cache an RDD in memory across operations Each slice is persisted in memory and reused in subsequent actions involving that RDD Cache provides fault-tolerance: if partition is lost, it will be recomputed using transformations that created it

Resilient Distributed Datasets (RDDs)

Broadcast Variables Spark’s version of Hadoop’s DistributedCache Read-only variable cached on each node Spark [internally] distributed broadcast variables in such a way to minimize communication cost

Broadcast Variables

Accumulators Spark’s version of Hadoop’s Counter Variables that can only be added through an associative operation Native support of numeric accumulator types and standard mutable collections – Users can extend to new types Only driver program can read accumulator value

Accumulators

Key/Value Pairs

Resources Original slide deck: shop.pdf shop.pdf Code samples: – c9610eac1d c9610eac1d – 08c08a c08a1132 –