Presentation on theme: "System, heal thy self….. The major difference between a thing which might go wrong, and a thing which cannot possibly go wrong, is that, when a thing."— Presentation transcript:
The major difference between a thing which might go wrong, and a thing which cannot possibly go wrong, is that, when a thing which cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair. - un-attributed, posted in Microsoft conf. room The major difference between a thing which might go wrong, and a thing which cannot possibly go wrong, is that, when a thing which cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair. - un-attributed, posted in Microsoft conf. room
Failure is always an option… At (cloud) scale… Hardware fails just as much, often times more, than software Build world class server hardware with a MTBF of 30 years, buy 10,000 of these. Watch one fail each day The internet is not “five nines” (99.999). At most it’s two nines in any given location (very geography dependent) Expect it to be down minimum 4 (whole) days out of the year It’s impossible for humans to monitor, detect and react to issues at scale Server : Ops ratio is 1:50 – 1:150  Google has ~1M servers  == 6500+ ops?  – http://serverfault.com/questions/77374/whats-the-server-to-admin-ratio-at-your-workplace  – http://www.datacenterknowledge.com/archives/2011/08/01/report-google-uses-about-900000-servers/ When a system gets large/complex enough failure is not an option, it’s a fact of everyday life
Livesite first Traditional way of reacting to livesite issues/outages Page support staff, open support tickets for technicians, email programmers This is way too slow! Service-level (network-based) failover is common Load balancers, GTM routing, hot-hot BCP Software failover is less common Code has much more context if something is “wrong” or not Let your DEV team be the ops Have them feel the pain! Often, they are ultimately the people that will have to finally fix the problem anyway Most ops activities should be scripted/code – make fat-fingers less likely
Ain’t nothing like the real thing… How much of any given codebase is validation/verification and error handling? How much testing is performed on a system under real-world conditions/load? How easy is this even to simulate? How much (time/money) are we spending on QA? How many servers/network/etc for test, staging, vs. production? How often is it fully utilized? What if we rolled the whole lot together… TiP – Testing in Production …and didn’t worry about the spectre of “quality” Let bugs run free!
Towards Nirvana How do you get there – how much does this cost? This isn’t a small undertaking – it took MSFT 12+ years to get here in many guises and lessons learned BRS’s, SPoF’s, Wiring “oddities” The feedback/learning cycle is getting smaller though - Facebook has only been around for 8 years You need a service-orientated culture It’s not about shipping software You need team(s) focused on the “platform” and/or “fabric” These are absolutely critical and not cheap to build Buy vs. build? IaaS, PaaS, SaaS Azure, AppEngine, AWS Deploy, Monitoring, Rules Service Routing Failure Domain
System, heal thy self Don’t sweat the hard stuff Allow things to crash – let the system pick up the pieces Encourage things to crash – chaos monkey Simplicity in programming Shared and hardened services are key – take as much thought/effort out of individual developers hands as possible Not everything is equally important Maslow's hierarchy of needs – acceptable losses! Automated as many repair actions as possible Humans can’t be around all the time and can be slow to react (which sometime is a good thing) Tier 3 isn't 24/7, and for the major things that happen at 2:00am you just know those are the people you need
Dos/Bot/Load protection in Bing We are more concerned about protecting the “Good Guys” (Our carbon based users) than we are about blocking the “Bad Guys” (‘Synthetic’ traffic) Some amount of synthetic traffic is ok We have agreements in place to be scraped We scrape our own content
Crash Protection Service – Watson for services If query is “expensive” or causes crash(es) Cache and/or block the request “Bucket” crashes/errors and turn off features/flights if there’s a pattern Can help with complex scenarios 80/20 rule – are there a small set of bugs responsible for most issues? Gather data on bugs with large cause-effect chasm Catch (and respond) to things not seen before
Experimentation/Flighting Limit exposure, and impact, to a small subset of users Some may not like it, some may really like it If there are issues, can eject users from a flight (implicitly or explicitly), or stop the flight altogether Roll out changes gradually Allow systems to “warm up” Manage demand Allow for roll-back -- N+1 / N / N-1 versions A/B Testing Control group, treatment group. Look for differences Let users define acceptance criteria - scorecard off some key metrics In other complex systems, this is common – the FDA do this
Resiliency through redundancy Assume failure Disable faulty services/software Roll back to a “known good” state WARNING: Computer pr0n ahead! Assume machines/services will crash Have enough redundancy to continue to operate
Autopilot Frontdoor Web Index SU1 SU2 SU3 SU4 Application Services Collection Service Cockpit Watchdog Device Manager Provisioning Service Repair Service Deployment Service Core Autopilot Services Other service
Failure by design - Autopilot Using Autopilot means Bing has failure “designed in” All systems/services are designed such that any instance can be killed unexpectedly without destabilizing the rest of the system If service/machine(s) are failing, they will a) be restarted, then b) reimaged, then c) RMA’ed Also roll-back to previous “known good” version Allows for simpler development Don’t worry (too much) about failure cases / clean-up code “Crash early, [crash often]” Fork (“T-ed”) real traffic into pre-release and scale units during roll-out Customers are helping us test our v-next product without knowing it Allows for some simple security management Out of spec with current configuration -- reimage. More info at http://research.microsoft.com/pubs/64604/osr2007.pdfhttp://research.microsoft.com/pubs/64604/osr2007.pdf
Learning some painful lessons on the way Learning by previous mistakes: something the software industry really should understand by now.
Closing Guidance As software “experts”, we know abstraction is good. Abstract failure away from developers – allow (most of) them to think that the environment they are writing code for is perfect THEY DO THIS ANYWAY! The best BCP is no BCP You have systems that are on, and are being paid for – make use of them. Aim towards Testing in Production Monitoring, reliability, and QA become the same thing Services have to harden against each other Let the system regulate itself It’s way quicker at identifying issues, triaging problems, debugging, and performing repair actions MTTR is more important than MTTF Granted, this isn’t easy, and there can be painful lessons Humans are still needed – we’re not looking for a sentient system This isn’t necessarily one-off, bespoke – if you build it, they will come