Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Similar presentations


Presentation on theme: "The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1."— Presentation transcript:

1 The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1

2 Cloud Services Cheap Convenient Reliable 2

3 Yahoo Mail Disruption Hardware failures Wrong failover Disruptions – Some users could not access – Some users saw wrong notifications – Several days to recover 3

4 Outlook Disruption Hardware failures – Caching server Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4

5 Cloud Outages 5 Outage Amazon EBS Gmail App Engine Skype Google Drive Outlook Yahoo Mail Root Event Network misconfig Upgrade event Power failure Overload Network bug Caching failure Hardware failures Supposedly tolerable failure Network partition Servers offline 25 % machines offline 30 % nodes failed Network offline Failover to backend Servers offline Incorrect Recovery Re-mirroring storm Bad request routing Bad failover Positive feedback loop Timeout during failover Request flooding Buggy failover Major Outage Clusters collapsed All routing servers down All user app were degraded Almost all nodes failed 33 % requests affected 7-hour outage 1 % of users affected

6 Journey of Cloud Dependability Research 6

7 Fault-Tolerant Systems 7 Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts Complex failures Hard to handle and implement correctly Recovery protocols are very complex Recovery code is one of the most buggy parts

8 Offline Testing Thoroughly verify recovery mechanism 8

9 Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure 9 Mini cluster Production run Real workload Test workload

10 Offline Testing Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure Orders of magnitude different in scale – Facebook used 100 machines to mimic 3000-machine production run[2011] Small start-ups forego the luxury – Many tests are much smaller than this 10

11 Diagnosis Help administrators to point out and reproduce causes of outages BUT – Post-mortem, not prevent disruptions – Passive approach, wait outages happen before diagnosis 11

12 Online Testing and Failure Drills 12 Requests Customers Test Administrators “Inject failures online” Users outnumber testers Real deep scenarios

13 A Missing Piece 13 Boss, let do inject failures online using Chaos Monkey Hmm … EmployeeBoss Dear beloved customers, Thank you for trusting our services, but we accidentally lose your data because the failure drills that we run...

14 Future of Failure Drill 14 Drill-ready cloudsCurrent Drill A team of engineers standing by

15 Drill-Ready Cloud Computing Automatic failure drill and automatic cancellation Safe, efficient, easy manner Ideally, no engineering effort required 15

16 Drill-Ready Cloud Computing 16 Administrator Drill-Ready System Drill Mode Drill Spec Kill 25 % If it disrupts revert back Drill-ready cloud computing Systems take care failure injection and cancellation Drill-ready cloud computing Systems take care failure injection and cancellation

17 Outline Safety Efficiency Usability Generality 17

18 Safety Learn about failure implications without suffering through them Learn whether data can be lost – But not lose the data Learn whether SLA can be violated – But not violate it for long time 18

19 Safety Solutions Normal and drill states 19 Not drill aware

20 Safety Solutions Normal and drill states 20 Normal TopologyDrill Topology “Maintaining 2 states” Revert back to normal state easily Normal and drill states The first most important thing for drill-ready clouds Normal and drill states The first most important thing for drill-ready clouds

21 Safety Solutions Drill state isolation Self cancellation – Real failures during the drill – Drill master and drill agent – Drill master command agents – What if network partition? Agents are in limbo state – Self cancellation when agents cannot contact master 21

22 Safety Solutions Drill state isolation Self cancellation Safe drill specification – Drill specification 22 Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back Safe drill specification Check whether the specification can run safely Safe drill specification Check whether the specification can run safely

23 Efficiency Failures trigger data migration Monetary cost – Bandwidth – Storage space System performance – Affect users 23

24 Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 24 [11-20] [21-30] [1-10][31-40] [41-50][51-60] [41-45] [46-50] [11-15] [16-20] Yes, if we want to see background re-balance impact Read / Write data SLA okay?

25 Efficiency Solutions Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test 25 [16-30] [31-45] [1-15][46-60] No, if we want to measure performance, when we lose 2 nodes Read / Write [46-60] SLA okay? No key [11] Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications Low-overhead setup and cleanup The cost depends on the drill objectives and Drill objectives must be parts on drill specifications

26 Efficiency Solutions Low-overhead drill setup and cleanup Cheap drill specification – Smarter and cheaper drill specification 26 If replication is 50 % correct  assume that the rest are correct Stop half way and report success Replicating progress status

27 Usability Solutions Declarative drill specification language 27 – Need declarative language Describe results Easy to read and write Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill If recovery is 50% good Stop the drill Report success

28 Generality Solutions Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28

29 Conclusion Drill-ready cloud computing – New reliability paradigm Sketching a first draft We want your FEEDBACK 29

30 Thank You 30


Download ppt "The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1."

Similar presentations


Ads by Google