Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Similar presentations


Presentation on theme: "Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC"— Presentation transcript:

1 Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
2012

2 Why Spark? Another system for big data analytics
Isn’t MapReduce good enough? Simplifies batch processing on large commodity clusters

3 Expensive save to disk for fault tolerance

4 Why Spark? MapReduce can be expensive for some applications e.g.,
Iterative Interactive Lacks efficient data sharing Specialized frameworks did evolve for different programming models Bulk Synchronous Processing (Pregel) Iterative MapReduce (Haloop) ….

5 Solution: Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records Built through coarse grained transformations (map, join …) Can be cached for efficient reuse Apply the same operations to many pieces of data

6 RDD RDD RDD Read HDFS Cache Read Reduce Map

7 Solution: Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records Built through coarse grained, ordered transformations (map, join …) Fault Recovery? Lineage! Log the coarse grained operation applied to a partitioned dataset Simply recompute the lost partition if failure occurs! No cost if no failure

8 Cache Lineage RDD RDD RDD Read HDFS Read Map Reduce HDFS Read Map RDD

9 Cache Lineage RDD RDD RDD Read HDFS
RDDs track the graph of transformations that built them (their lineage) to rebuild lost data Cache Read Map Reduce HDFS Read Map Reduce Lineage RDD RDD RDD

10 What can you do with Spark?
RDD operations Transformations e.g., filter, join, map, group-by … Actions e.g., count, print … Control Partitioning Persistence

11 Partitioning PageRank Links (url, neighbors) Ranks (url, ranks)
Joins take place repeatedly Good partitioning reduces shuffles Contributions Ranks (url, ranks) Contributions

12 Generality RDDs allow unification of different programming models
Stream Processing Graph Processing Machine Learning …..

13 Gather-Apply-Scatter on GraphX
B C D Triplets Vertices Neighbors A B C D A B D C Graph Represented In a Table

14 Gather-Apply-Scatter on GraphX
B C D Group-By A A B D C Gather at A

15 Gather-Apply-Scatter on GraphX
B C D Map A B D C Apply

16 Gather-Apply-Scatter on GraphX
B C D Triplets A B C Join D A B D C Scatter

17 Summary RDDs provide a simple and efficient programming model
Generalized to a broad set of applications Leverages coarse-grained nature of parallel algorithms for failure recovery


Download ppt "Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC"

Similar presentations


Ads by Google