Introduction to Spark.

Slides:

Advertisements

Similar presentations

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Spark Lightning-Fast Cluster Computing UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Scala Parallel Collections Aleksandar Prokopec, Tiark Rompf Scala Team EPFL.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Higher Order Functions Special thanks to Scott Shawcroft, Ryan Tucker, and Paul Beck for their work on these slides. Except where otherwise noted, this.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Data Engineering How MapReduce Works

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

PySpark Tutorial - Learn to use Apache Spark with Python

Python Spark Intro for Data Science

Architecture and design

Running Apache Spark on HPC clusters

Spark Programming By J. H. Wang May 9, 2017.

PROTECT | OPTIMIZE | TRANSFORM

Higher Order Functions

Concept & Examples of pyspark

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Scaling Spark on HPC Systems

Hadoop Tutorials Spark

Spark Presentation.

Concurrency without Actors

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Kishore Pusukuri, Spring 2018

Lecture 17: Distributed Transactions

湖南大学-信息科学与工程学院-计算机与科学系

CMPT 733, SPRING 2016 Jiannan Wang

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Distributed System Gang Wu Spring，2018.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Spark and Scala.

Interpret the execution mode of SQL query in F1 Query paper

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Spark and Scala.

Working with Key/Value Pairs

MapReduce Algorithm Design

CS639: Data Management for Data Science

5/7/2019 Map Reduce Map reduce.

Apache Hadoop and Spark

CMPT 733, SPRING 2017 Jiannan Wang

Fast, Interactive, Language-Integrated Cluster Computing

Streaming data processing using Spark

Working with Key/Value Pairs

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Presentation transcript:

Introduction to Spark

Outlines A brief history of Spark Programming with RDDs Transformations Actions

A brief history

Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty of programming directly in MapReduce Batch processing does not fit the use cases Performance bottlenecks Data will be frequently loaded from and saved to hard drives Spark is designed to overpass the limitations of MapReduce Handles batch, interactive, and real-time within a single framework Native integration with Java, Python, Scala Programming at a higher level of abstraction More general: map/reduce is just one set of supported constructs

Spark components Data are partitioned and executed on multiple worker nodes

Resilient Distributed Dataset (RDD) An RDD is simply a distributed collection of elements An RDD is an immutable distributed collection of objects Each RDD is split into multiple partitions In Spark all work is expressed as one of three operations Creation: Creating new RDDs Non-RDD data => RDD Transformation: Transforming existing RDDs RDD => RDD Action: Calling operations on RDDs to compute a result RDD => non-RDD value Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them

Creation of an RDD Users create RDDs in two ways Loading an external dataset Parallelizing a collection in your driver program

Transformations on RDDs Transformations are operations on RDDs that return a new RDD Such as map(), filter() Creation of RDD can be considered as a special type of transformation Transformed RDDs are computed lazily Only when you use them in an action Creations of RDDs are also carried out lazily

Actions on RDDs Operations that do something on the dataset E.g., Operations that return a final value to the driver program or write data to an external storage system Such as count() and first() Actions force the evaluation of the transformations required for the RDD they were called on

Lazy evaluation Lazy evaluation The operation is not immediately performed when we call a transformation on an RDD Spark internally records metadata to indicate that this operation has been requested Spark will not begin to execute until it sees an action Spark will re-compute the RDD and all of its dependencies each time we call an action on the RDD Result RDD will be computed twice in the above example input RDD might be computed twice if it is not persistent

Persistence (caching) Ask Spark to persist the data to avoid computing an RDD multiple times

Element-wise transformations map() Takes in a function and applies it to each element in the RDD. The result of the function is the new value of each element in the resulting RDD map()’s return type does not have to be the same as its input type filter() Takes in a function and returns an RDD that only has elements that pass the filter() function

The sample code for map() val input = sc.parallelize(List(1,2,3,4)) val result = input.map(x=>x*x) println(result.collect().mkString(“,”))

Element-wise transformations flatMap() The function we provide to flatMap() is called individually for each element in our input RDD Instead of returning a single element, we return an iterator with our return values Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators

flatMap() vs map() flatMap(): “flattening” the iterators returned to it

Pseudo set operations union operation keeps duplicates intersection operation removes duplicates

cartesian() transform

Actions collect(): Return all the elements of the dataset as an array to the driver program countByValue() returns a map of each unique value to its count

take(num): return the first num elements of the RDD http://stackoverflow.com/questions/37495039/difference-between-spark-rdds-take1-and-first Python fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780 Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 take(num): return the first num elements of the RDD top() will use the default ordering on the data

Python fold: https://github Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 Both reduce() and fold() will reduce the input RDD to a single element of the same type fold() needs an initial value

More details on reduce() and fold() RDD.reduce((x,y)=>x+y), RDD.fold(initial_value)((x,y)=>x+y) x: accumulator y: element in the partition In reduce() The accumulator first takes the first element in a partition then updating its value by adding the next element For example: a partition (1,2,3,4,5), RDD.reduce((x,y)=>x+y) Iteration 1: x=1, y=2 => x=(1+2)=3 Iteration 2: x=3, y=3 => x=(3+3)=6 Iteration 3: x=6, y=4 => x=(6+4)=10 Iteration 4: x=10, y=5 => x=(10+5)=15 In fold() The accumulator first takes the initial value in a partition then updating its value by adding the next element For example: a partition (1,2,3,4,5), RDD.fold(0)((x,y)=>x+y) Iteration 1: x=0, y=1 => x=(0+1)=1 http://stackoverflow.com/questions/29150202/pyspark-fold-method-output https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python Reduce: suppose to carry out commutative and associative binary operation

More details on reduce() and fold() The overall process Partitions are processed in parallel using multiple executors (or executor threads) Each partition is processed sequentially using a single thread Final merge is performed sequentially using a single thread on the driver For multiple partitions of an RDD First, the function will be applied on each partition Each partition will produce an accumulator All accumulators will be collected by the driver in a nondeterministic order Then the function will be applied to the list of accumulators For fold(), the initial value will be used again when aggregating the accumulators The partitioning behavior, plus certain sources of ordering nondeterminism may bring uncertainty to reduce()/fold() action when dealing with non communicative operations sc.parallelize(Seq(2.0, 3.0), 2).fold(1.0)((a, b) => math.pow(b, a)) What is the output? val inputrdd=sc.parallelize(List(1,25,8,4,2)) inputrdd.partitions.size val result=inputrdd.fold(0)((x,y)=>x+1) val inputrdd=sc.parallelize(List(1,25,8,4,2),2) val inputrdd=sc.parallelize(List(1,1,1,1,1)) val result=inputrdd.fold(1)((x,y)=>x+y) val inputrdd=sc.parallelize(List(1,1,1,1,1),2) val inputrdd=sc.parallelize(List(1,25,8,4,2),50) val result=inputrdd.reduce((x,y)=>x+1) val inputrdd=sc.parallelize(List(2,25,8,4,2)) val inputrdd=sc.parallelize(List(1,25,8,4,2),10) val result=inputrdd.reduce((x,y)=>x+y) val inputrdd=sc.parallelize(List(10,25,8,4,2),10) val inputrdd=sc.parallelize(List(1,25,8,4,0,2),2) val inputrdd=sc.parallelize(List(100,25,8,4,0,2),2) http://stackoverflow.com/questions/33788600/parallelize-method-in-sparkcontext http://stackoverflow.com/questions/29150202/pyspark-fold-method-output Python fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780 Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 Only one partition, (2.0, 3.0) => 9 Then (9.0)=9 2. If there are two partition, i.e., (2.0), (3.0), (2.0) => 2.0 (3.0) => 3.0 They may arrive at the driver at different order, i.e., (2.0, 3.0) or (3.0, 2.0). (3.0, 2.0) => 8 sc.parallelize(Seq(2.0, 3.0), 2).fold(1.0)((a, b) => math.pow(b, a)) val inputrdd=sc.parallelize(Seq(2.0, 3.0), 2) val result = inputrdd.fold(1.0)((a, b) => math.pow(b, a))

aggregate() The output of aggregate() can be different from the input RDD Prototype: def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B aggregate(zeroValue) (seqOp, combOp) It traverses the elements in different partitions Using seqOp to update the accumulator in each partition Then applies combOp (combine operation) to accumulators from different partitions The zeroValue is used in both seqOp and combOp Example: how to calculate the average of an input RDD (1,2,3,3) val sum=inputRDD.aggregate(0)((x, y) => x + y, (x, y) => x + y) val count = inputRDD.aggregate(0)((x, y) => x + 1, (x, y) => x + y) val average=sum/count How about val count= inputRDD.fold(0)((x, y) => x + 1)? http://stackoverflow.com/questions/26761087/explanation-of-the-aggregate-scala-function val count= inputRDD.fold(0)((x, y) => x + 1) The output is the number of partitions. See http://stackoverflow.com/questions/29150202/pyspark-fold-method-output

aggregate() Using a tuple x as the accumulator x._1: the running total x._2: the running count val inputrdd=sc.parallelize(List(1,25,8,4,2)) val result = inputrdd.aggregate((0,0))((acc,value)=>(acc._1+value,acc._2+1),(ACC,acci)=>(ACC._1+acci._1,ACC._2+acci._2)) Result.collect val result = input.aggregate((0,0))( (acc,value)=>(acc._1+value,acc._2+1), (ACC,acci)=>(ACC._1+acci._1,ACC._2+acci._2)) val ave = result._1 / result._2.toDouble