Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Software Testing Doesnt Scale James Hamilton Microsoft SQL Server.
Clustering Technology For Scaleability Jim Gray Microsoft Research
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Availability in Globally Distributed Storage Systems
QA practitioners viewpoint
Microsoft Live Platform Services Designing and Deploying Internet- Scale Services James Hamilton, Architect Windows Live Platform Services
Large-Scale Distributed Systems Andrew Whitaker CSE451.
How to Ensure Your Business Survives, Even if Your Server Crashes Backup Fast, Recover Faster Fast and Reliable Disaster Recovery, Data Protection, System.
Copyright Hub Software Engineering Ltd 2010All rights reserved Hub Document Manager Product Overview.
Executional Architecture
OpenQM Martin Phillips Ladybridge Systems Ltd Data Resilience in a QM System.
1. SQL Server 2014 In-Memory by Design Arthur Zubarev June 21, 2014.
Chapter 16: Recovery System
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Availability in Globally Distributed Storage Systems
The Future of Correct Software George Necula. 2 Software Correctness is Important ► Where there is software, there are bugs ► It is estimated that software.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server
Modified from Silberschatz, Galvin and Gagne ©2009 CS 446/646 Principles of Operating Systems Lecture 1 Chapter 1: Introduction.
CS CS 5150 Software Engineering Lecture 21 Reliability 3.
©Company confidential 1 Performance Testing for TM & D – An Overview.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Developing PANDORA Mark Corbould Director, IT Business Systems.
Installing software on personal computer
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
National Manager Database Services
High Availability Module 12.
Module 12: Planning for and Recovering from Disasters.
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
1 Oracle 9i AS Availability and Scalability Margaret H. Mei Senior Product Manager, ST.
PMIT-6102 Advanced Database Systems
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 1: Introduction.
Module 12: Designing High Availability in Windows Server ® 2008.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Overview of implementations openBGP (and openOSPF) –Active development Zebra –Commercialized Quagga –Active development XORP –Hot Gated –Dead/commercialized.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
ICS-FORTH 25-Nov Infrastructure for Scalable Services Are we Ready Yet? Angelos Bilas Institute of Computer Science (ICS) Foundation.
Using Model Checking to Find Serious File System Errors StanFord Computer Systems Laboratory and Microsft Research. Published in 2004 Presented by Chervet.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
High Availability in DB2 Nishant Sinha
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Next Generation of Apache Hadoop MapReduce Owen
BIG DATA/ Hadoop Interview Questions.
Cloud Computing and Architecuture
Data Center Infrastructure
Mobile Application Test Case Automation
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
GlassFish in the Real World
CHAPTER 3 Architectures for Distributed Systems
Introduction of Week 6 Assignment Discussion
Clustering Technology For Fault Tolerance
Capitalize on modern technology
Scaleout vs. Scaleup Robert Barnes Microsoft
SharePoint 2019 Overview and Use SPFx Extensions
Performance And Scalability In Oracle9i And SQL Server 2000
Presentation transcript:

Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server

2 Overview  The Problem:  S/W size & complexity inevitable  Short cycles reduce S/W reliability  S/W testing is the real issue  Testing doesn’t scale  trading complexity for quality  Cluster-based solution  The Inktomi lesson  Shared-nothing cluster architecture  Redundant data & metadata  Fault isolation domains

3 S/W Size & Complexity Inevitable  Successful S/W products grow large  # features used by a given user small  But union of per-user features sets is huge  Reality of commodity, high volume S/W  Large feature sets  Same trend as consumer electronics  Example mid-tier & server-side S/W stack:  SAP: ~47 mloc  DB: ~2 mloc  NT: ~50 mloc  Testing all feature interactions impossible

4 Short Cycles Reduce S/W Reliability  Reliable TP systems typically evolve slowly & conservatively  Modern ERP systems can go through 6+ minor revisions/year  Many e-commerce sites change even faster  Fast revisions a competitive advantage  Current testing and release methodology:  As much testing as dev time  Significant additional beta-cycle time  Unacceptable choice:  reliable but slow evolving or fast changing yet unstable and brittle

5 Testing the Real Issue  15 yrs ago test teams tiny fraction of dev group  Now tests teams of similar size as dev & growing rapidly  Current test methodology improving incrementally:  Random grammar driven test case generation  Fault injection  Code path coverage tools  Testing remains effective at feature testing  Ineffective at finding inter-feature interactions  Only a tiny fraction of Heisenbugs found in testing ( ability_talk.ppt) ability_talk.pptwww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt  Beta testing because test known to be inadequate  Test team growth scales exponentially with system complexity  Test and beta cycles already intolerably long

6 The Inktomi Lesson  Inktomi web search engine (SIGMOD’98)  Quickly evolving software:  Memory leaks, race conditions, etc. considered normal  Don’t attempt to test & beta until quality high  System availability of paramount importance  Individual node availability unimportant  Shared nothing cluster  Exploit ability to fail individual nodes:  Automatic reboots avoid memory leaks  Automatic restart of failed nodes  Fail fast: fail & restart when redundant checks fail  Replace failed hardware weekly (mostly disks)  Dark machine room  No panic midnight calls to admins  Mask failures rather than futile attempt to avoid

7 Apply to High Value TP Data?  Inktomi model:  Scales to 100’s of nodes  S/W evolves quickly  Low testing costs and no-beta requirement  Exploits ability to lose individual node without impacting system availability  Ability to temporarily lose some data W/O significantly impacting query quality  Can’t loose data availability in most TP systems  Redundant data allows node loss w/o data availability lost  Inktomi model with redundant data & metadata a solution to exploding test problem

8 Client Connection Model/Architecture Server Node Server Cloud  All data & metadata multiply redundant  Shared nothing  Single system image  Symmetric server nodes  Any client connects to any server  All nodes SAN-connected

9 Client Compilation & Execution Model Server Cloud Server Thread Lex analyze Parse Normalize Optimize Code generate Query execute  Query execution on many subthreads synchronized by root thread

10 Client Node Loss/Rejoin Server Cloud  Execution in progress  Rejoin.  Node local recovery  Rejoin cluster  Recover global data at rejoining node  Rejoin cluster  Lose node  Recompile  Re-execute

11 Client Redundant Data Update Model Server Cloud  Updates are standard parallel plans  Optimizer knows all redundant data paths  Generated plan updates all  No significant new technology  Like materialized view & index updates today

12 Fault Isolation Domains  Trade single-node perf for redundant data checks:  Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code  Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most)  Fail fast rather than attempting to repair:  Bring down node for mem-based data structure faults  Never patch inconsistent data…other copies keep system available  If anything goes wrong “fire” the node and continue:  Attempt node restart  Auto-reinstall O/S, DB and recreate DB partition  Mark node “dead” for later replacement

13 Summary  100 MLOC of server-side code and growing:  Can’t fight it & can’t test it …  quality will continue to decline if we don’t do something different  Can’t afford 2 to 3 year dev cycle  60’s large system mentality still prevails:  Optimizing precious machine resources is false economy  Continuing focus on single-system perf dead wrong:  Scalability & system perf rather than individual node performance  Why are we still incrementally attacking an exponential problem?  Any reasonable alternatives to clusters?

Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server