Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko.

Similar presentations


Presentation on theme: "Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko."— Presentation transcript:

1 Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko

2 Thesis Focus One class of operations that faces one common set of challenges ◦ cutting across four diverse and popular types of distributed infrastructures 2

3 Large-Scale Infrastructures Are everywhere 3 Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Grids TeraGrid, Grid5000, … For scientific communities Grids TeraGrid, Grid5000, … For scientific communities Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Internet Clouds Amazon, Google, HP, IBM, … For Web users, app developers Clouds Amazon, Google, HP, IBM, … For Web users, app developers

4 On-Demand Operations Operations that act upon the most up-to-date state of the infrastructure ◦ E.g., what’s going on right now in my data center? Four operations in the thesis ◦ On-demand monitoring (prelim)  Clouds, research testbeds ◦ On-demand replication (this talk)  Clouds, research testbeds ◦ On-demand scheduling  Grids ◦ On-demand key/value store  Peer-to-peer Diverse infrastructures & essential operations All have one common set of challenges 4

5 Common Challenges Scale and dynamism On-demand monitoring ◦ Scale: 200K machines (Google) ◦ Dynamism: values can change any time (e.g., CPU-util) On-demand replication ◦ Scale: ~300TB total, adding 2TB/day (Facebook HDFS) ◦ Dynamism: resource availability (failures, b/w, etc.) On-demand scheduling ◦ Scale: 4K CPUs running 44,000 tasks accessing 588,900 files (Coadd) ◦ Dynamism: resource availability (CPU, mem, disk, etc.) On-demand key/value store ◦ Scale: Millions of peers (BitTorrent) ◦ Dynamism: short-term unresponsiveness 5

6 Common Challenges Scale and dynamism 6 Scale Dynamism StaticDynamic On-Demand Operations

7 On-Demand Monitoring (Recap) Need ◦ Users and admins need to query & monitor the most-up-to-date attributes (CPU-util, disk-cap, etc.) of an infrastructure Scale ◦ # of machines, the amount of monitoring data, etc. Dynamism ◦ Static attributes (e.g., CPU-type) vs. dynamic attributes (e.g., CPU-util) Moara ◦ Expressive queries, e.g., avg. CPU-util where disk > 10 G or (mem-util < 50% and # of processes < 20) ◦ Quick response time and bandwidth-efficiency 7

8 Common Challenges On-demand monitoring: DB vs. Moara 8 (Data) Scale (Attribute) Dynamism Centralized (e.g., DB) Scaling centralized (e.g., Replicated DB) Static Attributes (e.g., CPU type) Dynamic Attributes (e.g., CPU util.) Centralized (e.g., DB) My Solution (Moara), etc.

9 Common Challenges On-demand replication of intermediate data (in detail later) 9 (Data) Scale (Availability) Dynamism Local File Systems (e.g., compilers) Distributed File Systems (e.g., Hadoop w/ HDFS) StaticDynamic Replication (e.g., Hadoop w/ HDFS) My Solution (ISS)

10 Thesis Statement On-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems. 10

11 Thesis Statement On-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems. ◦ Efficiency: responsiveness & bandwidth 11

12 Thesis Contributions Identifying on-demand operations as an important class of operations in large- scale infrastructures Identifying scale and dynamism as two common challenges Arguing the need for an on-demand operation in each case Showing how efficient each can be by simulations and real deployments 12

13 Thesis Overview ISS (Intermediate Storage System) ◦ Implements on-demand replication ◦ USENIX HotOS ’09 Moara ◦ Implements on-demand monitoring ◦ ACM/IFIP/USENIX Middleware ’08 Worker-centric scheduling ◦ Implements on-demand scheduling ◦ ACM/IFIP/USENIX Middleware ’07 MPIL (Multi-Path Insertion and Lookup) ◦ Implements on-demand key/value store ◦ IEEE DSN ’05 13

14 Thesis Overview 14 Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Grids TeraGrid, Grid5000, … For scientific communities Grids TeraGrid, Grid5000, … For scientific communities Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Internet Clouds Amazon, Google, HP, IBM, … For Web users, app developers Clouds Amazon, Google, HP, IBM, … For Web users, app developers Moara & ISS Worker-Centric Scheduling MPIL

15 Related Work Management operations: MON [Lia05], vxargs, PSSH[Pla], etc. Scheduling [Ros04, Ros05, Vis04] Gossip-based multicast [Bir02, Gup02, Ker01] On-demand scaling ◦ Amazon AWS, Google AppEngine, MS Azure, RightScale, etc. 15

16 ISS: Intermediate Storage System

17 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds 17

18 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks 18

19 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data  Scale and dynamism 19

20 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data  Scale and dynamism ◦ ISS (Intermediate Storage System) with on- demand replication 20

21 Dataflow Programming Frameworks Runtime systems that execute dataflow programs ◦ MapReduce (Hadoop), Pig, Hive, etc. ◦ Gaining popularity for massive-scale data processing ◦ Distributed and parallel execution on clusters A dataflow program consists of ◦ Multi-stage computation ◦ Communication patterns between stages 21

22 Example 1: MapReduce Two-stage computation with all-to-all comm. ◦ Google introduced, Yahoo! open-sourced (Hadoop) ◦ Two functions – Map and Reduce – supplied by a programmer ◦ Massively parallel execution of Map and Reduce 22 Stage 1: Map Stage 2: Reduce Shuffle (all-to-all)

23 Example 2: Pig and Hive Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 23 Shuffle (all-to-all) 1-to-1 comm.

24 Usage Google (MapReduce) ◦ Indexing: a chain of 24 MapReduce jobs ◦ ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) ◦ WebMap: a chain of 100 MapReduce jobs Facebook (Hadoop + Hive) ◦ ~300TB total, adding 2TB/day (in 2008) ◦ 3K jobs processing 55TB/day Amazon ◦ Elastic MapReduce service (pay-as-you-go) Academic clouds ◦ Google-IBM Cluster at UW (Hadoop service) ◦ CCT at UIUC (Hadoop & Pig service) 24

25 One Common Characteristic Intermediate data ◦ Intermediate data? data between stages Similarities to traditional intermediate data [Bak91, Vog99] ◦ E.g.,.o files ◦ Critical to produce the final output ◦ Short-lived, written-once and read-once, & used-immediately ◦ Computational barrier 25

26 One Common Characteristic Computational Barrier 26 Stage 1: Map Stage 2: Reduce Computational Barrier

27 Why Important? Scale and Dynamism ◦ Large-scale: possibly very large amount of intermediate data ◦ Dynamism: Loss of intermediate data => the task can’t proceed 27 Stage 1: Map Stage 2: Reduce

28 Failure Stats 5 average worker deaths per MapReduce job (Google in 2006) One disk failure in every run of a 6-hour MapReduce job with 4000 machines (Google in 2008) 50 machine failures out of 20K machine cluster (Yahoo! in 2009) 28

29 Hadoop Failure Injection Experiment Emulab setting ◦ 20 machines sorting 36GB ◦ 4 LANs and a core switch (all 100 Mbps) 1 failure after Map ◦ Re-execution of Map-Shuffle-Reduce ~33% increase in completion time 29 MapShuffleReduceMap Shuffl e Reduce

30 Re-Generation for Multi-Stage Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 30

31 Importance of Intermediate Data Why? ◦ Scale and dynamism ◦ A lot of data + when lost, very costly Current systems handle it themselves. ◦ Re-generate when lost: can lead to expensive cascaded re-execution We believe that the storage can provide a better solution than the dataflow programming frameworks 31

32 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ ISS (Intermediate Storage System)  Why storage?  Challenges  Solution hypotheses  Hypotheses validation 32

33 Why Storage? On-Demand Replication Stops cascaded re-execution 33 Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

34 So, Are We Done? No! Challenge: minimal interference ◦ Network is heavily utilized during Shuffle. ◦ Replication requires network transmission too, and needs to replicate a large amount. ◦ Minimizing interference is critical for the overall job completion time. ◦ Efficiency: completion time + bandwidth HDFS (Hadoop Distributed File System): much interference 34

35 Default HDFS Interference On-demand replication of Map and Reduce outputs (2 copies in total) 35

36 Background Transport Protocols TCP-Nice [Ven02] & TCP-LP [Kuz06] ◦ Support background & foreground flows Pros ◦ Background flows do not interfere with foreground flows (functionality) Cons ◦ Designed for wide-area Internet ◦ Application-agnostic ◦ Not designed for data center replication Can do better! 36

37 Revisiting Common Challenges Scale: need to replicate a large amount of intermediate data Dynamism: failures, foreground traffic (bandwidth availability) Two challenges create much interference. 37

38 Revisiting Common Challenges Intermediate data management 38 (Data) Scale (Availability) Dynamism Local File Systems (e.g., compilers) Distributed File Systems (e.g., Hadoop w/ HDFS) StaticDynamic Replication (e.g., Hadoop w/ HDFS) ISS

39 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ ISS (Intermediate Storage System) Why is storage the right abstraction? Challenges  Solution hypotheses  Hypotheses validation 39

40 Three Hypotheses 1. Asynchronous replication can help. ◦ HDFS replication works synchronously. 2. The replication process can exploit the inherent bandwidth heterogeneity of data centers (next). 3. Data selection can help (later). 40

41 Bandwidth Heterogeneity Data center topology: hierarchical ◦ Top-of-the-rack switches (under-utilized) ◦ Shared core switch (fully-utilized) 41

42 Data Selection Only replicate locally-consumed data 42 Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

43 Three Hypotheses 1. Asynchronous replication can help. 2. The replication process can exploit the inherent bandwidth heterogeneity of data centers. 3. Data selection can help. The question is not if, but how much. If effective, these become techniques. 43

44 Experimental Setting Emulab with 80 machines ◦ 4 X 1 LAN with 20 machines ◦ 4 X 100Mbps top-of-the-rack switch ◦ 1 X 1Gbps core switch ◦ Various configurations give similar results. Input data: 2GB/machine, random-generation Workload: sort 5 runs ◦ Std. dev. ~ 100 sec.: small compared to the overall completion time 2 replicas of Map outputs in total 44

45 Asynchronous Replication Modification for asynchronous replication ◦ With an increasing level of interference Four levels of interference ◦ Hadoop: original, no replication, no interference ◦ Read: disk read, no network transfer, no actual replication ◦ Read-Send: disk read & network send, no actual replication ◦ Rep.: full replication 45

46 Asynchronous Replication Network utilization makes the difference Both Map & Shuffle get affected ◦ Some Maps need to read remotely 46

47 Three Hypotheses (Validation) Asynchronous replication can help, but still can’t eliminate the interference. The replication process can exploit the inherent bandwidth heterogeneity of data centers. Data selection can help. 47

48 Rack-Level Replication Rack-level replication is effective. ◦ Only 20~30 rack failures per year, mostly planned (Google 2008) 48

49 Three Hypotheses (Validation) Asynchronous replication can help, but still can’t eliminate the interference The rack-level replication can reduce the interference significantly. Data selection can help. 49

50 Locally-Consumed Data Replication It significantly reduces the amount of replication. 50

51 Three Hypotheses (Validation) Asynchronous replication can help, but still can’t eliminate the interference The rack-level replication can reduce the interference significantly. Data selection can reduce the interference significantly. 51

52 ISS Design Overview Implements asynchronous rack-level selective replication (all three hypotheses) Replaces the Shuffle phase ◦ MapReduce does not implement Shuffle. ◦ Map tasks write intermediate data to ISS, and Reduce tasks read intermediate data from ISS. Extends HDFS (next) 52

53 ISS Design Overview Extends HDFS ◦ iss_create() ◦ iss_open() ◦ iss_write() ◦ iss_read() ◦ iss_close() Map tasks ◦ iss_create() => iss_write() => iss_close() Reduce tasks ◦ iss_open() => iss_read() => iss_close() 53

54 Hadoop + Failure 75% slowdown compared to no-failure Hadoop 54

55 Hadoop + ISS + Failure (Emulated) 10% slowdown compared to no-failure Hadoop ◦ Speculative execution can leverage ISS. 55

56 Replication Completion Time Replication completes before Reduce ◦ ‘+’ indicates replication time for each block 56

57 Summary Our position ◦ Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Problem: cascaded re-execution Requirements ◦ Intermediate data availability (scale and dynamism) ◦ Interference minimization (efficiency) Asynchronous replication can help, but still can’t eliminate the interference The rack-level replication can reduce the interference significantly. Data selection can reduce the interference significantly. Hadoop & Hadoop + ISS show comparable completion times. 57

58 Thesis Overview 58 Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users Grids TeraGrid, Grid5000, … For scientific communities Grids TeraGrid, Grid5000, … For scientific communities Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Research testbeds CCT, PlanetLab, Emulab, … For computer scientists Internet Clouds Amazon, Google, HP, IBM, … For Web users, app developers Clouds Amazon, Google, HP, IBM, … For Web users, app developers Moara & ISS Worker-Centric Scheduling MPIL

59 On-Demand Scheduling One-line summary ◦ On-demand scheduling of Grid tasks for data- intensive applications Scale ◦ # of tasks, # of CPUs, the amount of data Dynamism ◦ resource availability (CPU, mem, disk, etc.) Worker-centric scheduling ◦ Worker’s availability is the first-class criterion in scheduling tasks. 59

60 On-Demand Scheduling Taxonomy 60 Scale Dynamism Static Scheduling Task-Centric Scheduling StaticDynamic Task-Centric Scheduling Worker-Centric Scheduling

61 On-Demand Key/Value Store One-line summary ◦ On-demand key/value lookup algorithm under dynamic environments Scale ◦ # of peers Dynamism ◦ short-term unresponsiveness MPIL ◦ A DHT-style lookup algorithm that can operate over any topology and is resistant to perturbation 61

62 On-Demand Key/Value Store Taxonomy 62 Scale Dynamism Napster-style Central Directory DHTs StaticDynamic Napster-style Central Directory MPIL

63 Conclusion On-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems. Each on-demand operation has a different need, but a common set of challenges. 63

64 References [Lia05] J. Liang, S. Y. Ko, I. Gupta, and K. Nahrstedt. MON: On- demand Overlays for Distributed System Management. In USENIX WORLDS, 2005 [Pla] PlanetLab. [Ros05] A. L. Rosenberg and M. Yurkewych. Guidelines for Scheduling Some Common Computation-Dags for Internet-Based Computing. IEEE TC, Vol. 54, No. 4, April [Ros04] A. L. Rosenberg. On Scheduling Mesh-Structured Computations for Internet- Based Computing. IEEE TC, 53(9), September [Vis04] S. Viswanathan, B. Veeravalli, D. Yu, and T. G. Robertazzi. Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems. In IEEE/ACM GRID, [Bir02] K. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal Multicast.ACM TCSystems, 17(2):41–88, May [Gup02] I. Gupta, A.-M. Kermarrec, and A. J. Ganesh. E ffi cient Epidemic-Style Protocols for Reliable and Scalable Multicast. In SRDS, [Ker01] A.-M. Kermarrec, L. Massouli, and A. J. Ganesh. Probabilistic reliable dissemination in large-scale systems. IEEE TPDS, 14:248–258, [Vog99] W. Vogels. File System Usage in Windows NT 4.0. In SOSP, [Bak91] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirri ff, and J. K. Ousterhout. Measurements of a Distributed File System. SIGOPS OSR, 25(5), [Ven02] A. Venkataramani, R. Kokku, and M. Dahlin. TCP Nice: A Mechanism for Background Transfers. In OSDI, [Kuz06] A. Kuzmanovic and E. W. Knightly. TCP-LP: Low-Priority Service via End-Point Congestion Control. IEEE/ACM TON, 14(4):739–752,

65 Backup Slides 65

66 Hadoop Experiment Emulab setting ◦ 20 machines sorting 36GB ◦ 4 LANs and a core switch (all 100 Mbps) Normal execution: Map–Shuffle–Reduce 66 MapShuffleReduce

67 Worker-Centric Scheduling One-line summary ◦ On-demand scheduling of Grid tasks for data- intensive applications Background ◦ Data-intensive Grid applications are divided into chunks. ◦ Data-intensive applications access a large set of files – file transfer time is a bottleneck (from Coadd experience) ◦ Different tasks access many files together (locality) Problem ◦ How to schedule tasks in a data-intensive application that access a large set of files? 67

68 Worker-Centric Scheduling Solution: on-demand scheduling based on worker’s availability and reusing of files ◦ Task-centric vs. worker-centric: consideration of worker’s availability for execution ◦ We have proposed a series of worker-centric scheduling heuristics that minimizes file transfers Result ◦ Tested our heuristics with a real trace ◦ Compared them with the state-of-the-art task- centric scheduling ◦ Showed the task completion time reduction up to 20% 68

69 MPIL (Multi-Path Insertion/Lookup) One-line summary ◦ On-demand key/value lookup algorithm under dynamic environments Background ◦ Various DHTs (Distributed Hash Tables) have been proposed. ◦ Real-world trace studies show churn is a threat to DHTs’ performance. ◦ DHTs employ aggressive maintenance algorithms to combat churn – low lookup cost, but high maintenance cost 69

70 MPIL (Multi-Path Insertion/Lookup) Alternative: MPIL ◦ Is a new DHT routing algorithm ◦ Has low maintenance cost, but slightly higher lookup cost Features ◦ Uses multi-path routing to combat dynamism (perturbation in the environment). ◦ Runs over any overlay – no need for a maintenance algorithm 70


Download ppt "Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko."

Similar presentations


Ads by Google