Presentation is loading. Please wait.

Presentation is loading. Please wait.

Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist.

Similar presentations

Presentation on theme: "Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist."— Presentation transcript:

1 Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist

2 2 Local Touch – Global Reach Your Moderator Microsoft MVP for the Windows Azure Platform

3 3 Local Touch – Global Reach Chaos Monkey? Hardware Fails Software has bugs People make mistakes

4 4 Local Touch – Global Reach What is an SLA? A negotiated agreement or contract Defines service availability/accessibility Penalties for violation Not a guarantee! What we really want: Availability, not promises Protection from loss of revenue

5 5 Local Touch – Global Reach What are we looking for? See for more: Protection From Hardware failures Data corruption (malicious & accidental) Failure of network Loss of facilities Accessible vs. Available Reachable by clients Degraded performance/function

6 6 Local Touch – Global Reach What we’re trying to achieve

7 7 Local Touch – Global Reach How do we create resilient systems?

8 8 Local Touch – Global Reach Assume everything will fail Common Points of Failure Machine\application crashes Throttling (exceeding capacity) Connectivity\Network External service dependencies

9 9 Local Touch – Global Reach Try/catch != Resilient String filename = "/nosuchdir/myfilename"; try { // Create the file new File(filename).createNewFile(); } catch (IOException e) { // Print out the exception that occurred System.out.println("Unable to create"+filename+":"+e.getMessage()); } This addresses the symptom, it does resolve the underlying problem

10 10 Local Touch – Global Reach Internal buffering Retry Policies Wait and try again Queue until available Go Asynchronous Increase capacity, if you’re willing to wait Queue Semantics

11 11 Local Touch – Global Reach Degrade, but don’t fail Due to higher than average volumes, processing of your request may be delayed. 404\503 error vs. placeholder content Try, try, and try yet again Image copyright of we SINGS

12 12 Local Touch – Global Reach Virtualization and Automation Virtualization - Provides greater flexibility to move workloads Automation – reduces ‘mean time to recovery’ Don’t forget the human factor!

13 13 Local Touch – Global Reach The “HI” Point Animation from TechEd NA 2012 - Windows Azure Internals by Mark Russinovich

14 14 Local Touch – Global Reach Dept. of Redundancy Dept. Have a backup, somewhere else More than one? Cost to benefit Ratio? Ready State Hot = full capacity Warm = scaled down, but ready to grow Cold = mothballed, starts from zero

15 15 Local Touch – Global Reach Its about probability 95% uptime 1 box : 5% downtime or 438hrs per year 2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year 4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000 0.000625% downtime or 3.285 MINUTES per year (that’s 18 ½ days!)

16 16 Local Touch – Global Reach N+1 - Extra Capacity Carry extra capacity to help even out spikes If you fail over, service degrades but doesn’t fail completely Buy time to react Speed recovery

17 17 Local Touch – Global Reach Always carry a spare 75% Capacity, half of our load 50% more capacity then needed Can absorb of temporary spikes Time to react if need to add capacity 100% of load, 150% Capacity 0% Capacity, redirect all load Over allocated, but still functioning Degrade, but don’t fail SYSTEM FAILURE!!!

18 18 Local Touch – Global Reach Controlled Chaos Best way to avoid failure is to fail constantly! – John Ciancutti, Netflix An untested plan is just a hypothesis. Via twitter @BrentCodeMonkey

19 19 Local Touch – Global Reach Detection - Seek out Issues If you do not monitor for issues, how can you react when they happen? Be an active participant. Multiple notification channels Leverage “runtime governance” Raise alarm before failures occur

20 20 Local Touch – Global Reach Functional Transparency

21 21 Local Touch – Global Reach Setting Expectations

22 22 Local Touch – Global Reach Different Environments Setting up the infrastructure isn’t easy Each environment has unique needs. Build environments to meet needs. Reduce environmental factors… dependencies on hardware and system components

23 23 Local Touch – Global Reach Mean time to Recovery Don’t set an artificial limit… Total Outage duration = Time to Detect + Time to Diagnose + Time to Decide + Time to Act We need to be back up within 5 minutes!

24 24 Local Touch – Global Reach Change the SLA Component based Little business context, hard to articulate the value. Directly dependent on components Our email server must have 99% uptime. 99% of our emails will be sent in 5 minutes or less Scenario based Directly relates to business value, provides flexibility in achieving objectives.

25 25 Local Touch – Global Reach Do, or do not! Your entire organization must be committed. This will take time. This will be expensive. You will still make mistakes, plan for and learn from them.

26 26 Local Touch – Global Reach Microsoft MVP for the Windows Azure Platform Contact Info Twitter: @BrentCodeMonkey Web: Questions??

27 Local Touch – Global Reach Thank you

Download ppt "Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist."

Similar presentations

Ads by Google