Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server.

Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

2 Overview  The Problem:  S/W size & complexity inevitable  Short cycles reduce S/W reliability  S/W testing is the real issue  Testing doesn’t scale  trading complexity for quality  Cluster-based solution  The Inktomi lesson  Shared-nothing cluster architecture  Redundant data & metadata  Fault isolation domains

3 S/W Size & Complexity Inevitable  Successful S/W products grow large  # features used by a given user small  But union of per-user features sets is huge  Reality of commodity, high volume S/W  Large feature sets  Same trend as consumer electronics  Example mid-tier & server-side S/W stack:  SAP: ~47 mloc  DB: ~2 mloc  NT: ~50 mloc  Testing all feature interactions impossible

4 Short Cycles Reduce S/W Reliability  Reliable TP systems typically evolve slowly & conservatively  Modern ERP systems can go through 6+ minor revisions/year  Many e-commerce sites change even faster  Fast revisions a competitive advantage  Current testing and release methodology:  As much testing as dev time  Significant additional beta-cycle time  Unacceptable choice:  reliable but slow evolving or fast changing yet unstable and brittle

5 Testing the Real Issue  15 yrs ago test teams tiny fraction of dev group  Now tests teams of similar size as dev & growing rapidly  Current test methodology improving incrementally:  Random grammar driven test case generation  Fault injection  Code path coverage tools  Testing remains effective at feature testing  Ineffective at finding inter-feature interactions  Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt) www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.pptwww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt  Beta testing because test known to be inadequate  Test team growth scales exponentially with system complexity  Test and beta cycles already intolerably long

6 The Inktomi Lesson  Inktomi web search engine (SIGMOD’98)  Quickly evolving software:  Memory leaks, race conditions, etc. considered normal  Don’t attempt to test & beta until quality high  System availability of paramount importance  Individual node availability unimportant  Shared nothing cluster  Exploit ability to fail individual nodes:  Automatic reboots avoid memory leaks  Automatic restart of failed nodes  Fail fast: fail & restart when redundant checks fail  Replace failed hardware weekly (mostly disks)  Dark machine room  No panic midnight calls to admins  Mask failures rather than futile attempt to avoid

7 Apply to High Value TP Data?  Inktomi model:  Scales to 100’s of nodes  S/W evolves quickly  Low testing costs and no-beta requirement  Exploits ability to lose individual node without impacting system availability  Ability to temporarily lose some data W/O significantly impacting query quality  Can’t loose data availability in most TP systems  Redundant data allows node loss w/o data availability lost  Inktomi model with redundant data & metadata a solution to exploding test problem

8 Client Connection Model/Architecture Server Node Server Cloud  All data & metadata multiply redundant  Shared nothing  Single system image  Symmetric server nodes  Any client connects to any server  All nodes SAN-connected

9 Client Compilation & Execution Model Server Cloud Server Thread Lex analyze Parse Normalize Optimize Code generate Query execute  Query execution on many subthreads synchronized by root thread

10 Client Node Loss/Rejoin Server Cloud  Execution in progress  Rejoin.  Node local recovery  Rejoin cluster  Recover global data at rejoining node  Rejoin cluster  Lose node  Recompile  Re-execute

11 Client Redundant Data Update Model Server Cloud  Updates are standard parallel plans  Optimizer knows all redundant data paths  Generated plan updates all  No significant new technology  Like materialized view & index updates today

12 Fault Isolation Domains  Trade single-node perf for redundant data checks:  Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code  Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most)  Fail fast rather than attempting to repair:  Bring down node for mem-based data structure faults  Never patch inconsistent data…other copies keep system available  If anything goes wrong “fire” the node and continue:  Attempt node restart  Auto-reinstall O/S, DB and recreate DB partition  Mark node “dead” for later replacement

13 Summary  100 MLOC of server-side code and growing:  Can’t fight it & can’t test it …  quality will continue to decline if we don’t do something different  Can’t afford 2 to 3 year dev cycle  60’s large system mentality still prevails:  Optimizing precious machine resources is false economy  Continuing focus on single-system perf dead wrong:  Scalability & system perf rather than individual node performance  Why are we still incrementally attacking an exponential problem?  Any reasonable alternatives to clusters?

Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server.

Similar presentations

Presentation on theme: "Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server.

Similar presentations

Presentation on theme: "Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server."— Presentation transcript:

Similar presentations

About project

Feedback