Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 29: Distributed Systems

Similar presentations


Presentation on theme: "Lecture 29: Distributed Systems"— Presentation transcript:

1 Lecture 29: Distributed Systems
CS May 8, 2019 Slides drawn heavily from Vitaly's CS 5450, Fall 2018

2 Conventional HPC System
Compute nodes High-end processors Lots of RAM Network Specialized Very high performance Storage server RAID disk array

3 Conventional HPC Programming Model
Programs described at a very low level detailed control of processing and scheduling Rely on a small number of software packages written by specialists limits problems and solutions methods

4 Typical HPC Operation Characteristics: Strengths Weaknesses
long-lived processes make use of special locality hold all data in memory high-bandwidth communication Strengths High utilitization of resources Effective for many scientific applications Weaknesses Requires careful tuning of applications to resources Intolerant of any variability

5 HPC Fault Tolerance Checkpoint Restore after failure
Periodically store state of all processes Significant I/O traffic Restore after failure Reset state to last checkpoint All intervening computation wasted Performance scaling Very sensitive to the number of failing components Wasted

6 Datacenters

7 Ideal Cluster Programming MOdel
Applications written in terms of high-level operations on the data Runtime system controls scheduling, load balancing

8 MapReduce

9 MapReduce Programming Model
Map computation across many data objects Aggregate results in many different ways System deals with resource allocation and availability

10 Example: Word Count In parallel, each worker computes word counts from individual files Collect results, wait until all finished Merge intermediate output Compute word count on merged intermediates

11 Parallel Map Process pieces of the dataset to generate (key, value) pairs in parallel Welcome everyone Hello everyone Welcome 1 everyone 1 Hello 1 Map Task 1 Map Task 2

12 Reduce Merge all intermediate values per key Welcome 1 everyone 1
Hello 1 everyone 2 Welcome 1 Hello 1

13 Partition Merge all intermediate values in parallel: partition keys, assign each key to one reduce task

14 MapReduce API

15 WordCount with MapReduce

16 WordCount with MapReduce

17 MapReduce Execution

18 Fault Tolerance in MapReduce
Map worker writes intermediate output to local disk, separated by partitioning; once completed, tells master node Reduce worker told of location of map task outputs, pulls their partition’s data from each mapper, executes function across data Note: “All-to-all” shuffle between mappers and reducers Written to disk (“materialized”) before each stage

19 Fault Tolerance in MapReduce
Master node monitors state of system If master fails, job aborts Map worker failure In-progress and completed tasks marked as idle Reduce workers notified when map task is re-executed on another map worker Reducer worker failure In-progress tasks are reset and re-executed Completed tasks had been written to global file system

20 Stragglers a straggler is task that takes long time to execute
Bugs, flaky hardware, poor partitioning For slow map tasks, execute in parallel on second “map” worker as backup, race to complete task When done with most tasks, reschedule any remaining executing tasks Keep track of redundant executions Significantly reduces overall run time

21 Modern Data Processing

22 Apache Spark Goal 1: Extend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive data mining Goal 2: Enhance programmability Integrate into Scala programming language Allow interactive use from Scala interpreter Also support for Java, Python...

23 Data Flow Models Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Example: MapReduce these models inefficient for applications that repeatedly reuse a working set of data Iterative algorithms (machine learning, graphs) Interactive data mining (R, Excel, Python) Map Map Map Map Map

24 Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are immutable, partitioned collections of objects spread across a cluster, stored in RAM or on disk Created through parallel transformations (map, filter, groupBy, join, ...) on data in stable storage Allow apps to cache working sets in memory for efficient reuse Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability Actions on RDDs support many applications Count, reduce, collect, save...

25 Spark Operations Transformations: define a new RDD
map, flatMap, filter, sample, groupByKey, sortByKey, union, join, etc. Actions: return a result to the driver program collect, reduce, count, lookupKey, save

26 Example: WordCount

27 Example: Logistic Regression
Goal: find best line separating two sets of points val rdd = spark.textFile(...).map(readPoint) val data = rdd.cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) random line best-fit line

28 Example: Logisitic Regression

29 Spark Scheduler creates DAG of stages
Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles

30 RDD Fault Tolerance RDD maintains lineage information that can be used to reconstruct lost partitions val rdd = spark.textFile(...).map(readPoint).filter(...) File Mapped RDD Filtered RDD

31 Distributed Systems Summary
Machines Fail If you have lots of machines, machines will fail frequently Goals: Reliability, Consistency, Scalability, Transparency Abstractions are good, as long as they don’t cost you too much

32 So what's the take away…


Download ppt "Lecture 29: Distributed Systems"

Similar presentations


Ads by Google