Introduction to Spark
Outlines A brief history of Spark Programming with RDDs Transformations Actions
A brief history
Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty of programming directly in MapReduce Batch processing does not fit the use cases Performance bottlenecks Data will be frequently loaded from and saved to hard drives Spark is designed to overpass the limitations of MapReduce Handles batch, interactive, and real-time within a single framework Native integration with Java, Python, Scala Programming at a higher level of abstraction More general: map/reduce is just one set of supported constructs
Spark components Data are partitioned and executed on multiple worker nodes
Resilient Distributed Dataset (RDD) An RDD is simply a distributed collection of elements An RDD is an immutable distributed collection of objects Each RDD is split into multiple partitions In Spark all work is expressed as one of three operations Creation: Creating new RDDs Non-RDD data => RDD Transformation: Transforming existing RDDs RDD => RDD Action: Calling operations on RDDs to compute a result RDD => non-RDD value Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them
Creation of an RDD Users create RDDs in two ways Loading an external dataset Parallelizing a collection in your driver program
Transformations on RDDs Transformations are operations on RDDs that return a new RDD Such as map(), filter() Creation of RDD can be considered as a special type of transformation Transformed RDDs are computed lazily Only when you use them in an action Creations of RDDs are also carried out lazily
Actions on RDDs Operations that do something on the dataset E.g., Operations that return a final value to the driver program or write data to an external storage system Such as count() and first() Actions force the evaluation of the transformations required for the RDD they were called on
Lazy evaluation Lazy evaluation The operation is not immediately performed when we call a transformation on an RDD Spark internally records metadata to indicate that this operation has been requested Spark will not begin to execute until it sees an action Spark will re-compute the RDD and all of its dependencies each time we call an action on the RDD Result RDD will be computed twice in the above example input RDD might be computed twice if it is not persistent
Persistence (caching) Ask Spark to persist the data to avoid computing an RDD multiple times
Element-wise transformations map() Takes in a function and applies it to each element in the RDD. The result of the function is the new value of each element in the resulting RDD map()’s return type does not have to be the same as its input type filter() Takes in a function and returns an RDD that only has elements that pass the filter() function
The sample code for map() val input = sc.parallelize(List(1,2,3,4)) val result = input.map(x=>x*x) println(result.collect().mkString(“,”))
Element-wise transformations flatMap() The function we provide to flatMap() is called individually for each element in our input RDD Instead of returning a single element, we return an iterator with our return values Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators
flatMap() vs map() flatMap(): “flattening” the iterators returned to it
Pseudo set operations union operation keeps duplicates intersection operation removes duplicates
cartesian() transform
Actions collect(): Return all the elements of the dataset as an array to the driver program countByValue() returns a map of each unique value to its count
take(num): return the first num elements of the RDD http://stackoverflow.com/questions/37495039/difference-between-spark-rdds-take1-and-first Python fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780 Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 take(num): return the first num elements of the RDD top() will use the default ordering on the data
Python fold: https://github Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 Both reduce() and fold() will reduce the input RDD to a single element of the same type fold() needs an initial value
More details on reduce() and fold() RDD.reduce((x,y)=>x+y), RDD.fold(initial_value)((x,y)=>x+y) x: accumulator y: element in the partition In reduce() The accumulator first takes the first element in a partition then updating its value by adding the next element For example: a partition (1,2,3,4,5), RDD.reduce((x,y)=>x+y) Iteration 1: x=1, y=2 => x=(1+2)=3 Iteration 2: x=3, y=3 => x=(3+3)=6 Iteration 3: x=6, y=4 => x=(6+4)=10 Iteration 4: x=10, y=5 => x=(10+5)=15 In fold() The accumulator first takes the initial value in a partition then updating its value by adding the next element For example: a partition (1,2,3,4,5), RDD.fold(0)((x,y)=>x+y) Iteration 1: x=0, y=1 => x=(0+1)=1 http://stackoverflow.com/questions/29150202/pyspark-fold-method-output https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python Reduce: suppose to carry out commutative and associative binary operation
More details on reduce() and fold() The overall process Partitions are processed in parallel using multiple executors (or executor threads) Each partition is processed sequentially using a single thread Final merge is performed sequentially using a single thread on the driver For multiple partitions of an RDD First, the function will be applied on each partition Each partition will produce an accumulator All accumulators will be collected by the driver in a nondeterministic order Then the function will be applied to the list of accumulators For fold(), the initial value will be used again when aggregating the accumulators The partitioning behavior, plus certain sources of ordering nondeterminism may bring uncertainty to reduce()/fold() action when dealing with non communicative operations sc.parallelize(Seq(2.0, 3.0), 2).fold(1.0)((a, b) => math.pow(b, a)) What is the output? val inputrdd=sc.parallelize(List(1,25,8,4,2)) inputrdd.partitions.size val result=inputrdd.fold(0)((x,y)=>x+1) val inputrdd=sc.parallelize(List(1,25,8,4,2),2) val inputrdd=sc.parallelize(List(1,1,1,1,1)) val result=inputrdd.fold(1)((x,y)=>x+y) val inputrdd=sc.parallelize(List(1,1,1,1,1),2) val inputrdd=sc.parallelize(List(1,25,8,4,2),50) val result=inputrdd.reduce((x,y)=>x+1) val inputrdd=sc.parallelize(List(2,25,8,4,2)) val inputrdd=sc.parallelize(List(1,25,8,4,2),10) val result=inputrdd.reduce((x,y)=>x+y) val inputrdd=sc.parallelize(List(10,25,8,4,2),10) val inputrdd=sc.parallelize(List(1,25,8,4,0,2),2) val inputrdd=sc.parallelize(List(100,25,8,4,0,2),2) http://stackoverflow.com/questions/33788600/parallelize-method-in-sparkcontext http://stackoverflow.com/questions/29150202/pyspark-fold-method-output Python fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780 Scala fold: https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943 Only one partition, (2.0, 3.0) => 9 Then (9.0)=9 2. If there are two partition, i.e., (2.0), (3.0), (2.0) => 2.0 (3.0) => 3.0 They may arrive at the driver at different order, i.e., (2.0, 3.0) or (3.0, 2.0). (3.0, 2.0) => 8 sc.parallelize(Seq(2.0, 3.0), 2).fold(1.0)((a, b) => math.pow(b, a)) val inputrdd=sc.parallelize(Seq(2.0, 3.0), 2) val result = inputrdd.fold(1.0)((a, b) => math.pow(b, a))
aggregate() The output of aggregate() can be different from the input RDD Prototype: def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B aggregate(zeroValue) (seqOp, combOp) It traverses the elements in different partitions Using seqOp to update the accumulator in each partition Then applies combOp (combine operation) to accumulators from different partitions The zeroValue is used in both seqOp and combOp Example: how to calculate the average of an input RDD (1,2,3,3) val sum=inputRDD.aggregate(0)((x, y) => x + y, (x, y) => x + y) val count = inputRDD.aggregate(0)((x, y) => x + 1, (x, y) => x + y) val average=sum/count How about val count= inputRDD.fold(0)((x, y) => x + 1)? http://stackoverflow.com/questions/26761087/explanation-of-the-aggregate-scala-function val count= inputRDD.fold(0)((x, y) => x + 1) The output is the number of partitions. See http://stackoverflow.com/questions/29150202/pyspark-fold-method-output
aggregate() Using a tuple x as the accumulator x._1: the running total x._2: the running count val inputrdd=sc.parallelize(List(1,25,8,4,2)) val result = inputrdd.aggregate((0,0))((acc,value)=>(acc._1+value,acc._2+1),(ACC,acci)=>(ACC._1+acci._1,ACC._2+acci._2)) Result.collect val result = input.aggregate((0,0))( (acc,value)=>(acc._1+value,acc._2+1), (ACC,acci)=>(ACC._1+acci._1,ACC._2+acci._2)) val ave = result._1 / result._2.toDouble