Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server.

Similar presentations

Presentation on theme: "Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server."— Presentation transcript:

1 Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server

2 2 Overview  The Problem:  S/W size & complexity inevitable  Short cycles reduce S/W reliability  S/W testing is the real issue  Testing doesn’t scale  trading complexity for quality  Cluster-based solution  The Inktomi lesson  Shared-nothing cluster architecture  Redundant data & metadata  Fault isolation domains

3 3 S/W Size & Complexity Inevitable  Successful S/W products grow large  # features used by a given user small  But union of per-user features sets is huge  Reality of commodity, high volume S/W  Large feature sets  Same trend as consumer electronics  Example mid-tier & server-side S/W stack:  SAP: ~47 mloc  DB: ~2 mloc  NT: ~50 mloc  Testing all feature interactions impossible

4 4 Short Cycles Reduce S/W Reliability  Reliable TP systems typically evolve slowly & conservatively  Modern ERP systems can go through 6+ minor revisions/year  Many e-commerce sites change even faster  Fast revisions a competitive advantage  Current testing and release methodology:  As much testing as dev time  Significant additional beta-cycle time  Unacceptable choice:  reliable but slow evolving or fast changing yet unstable and brittle

5 5 Testing the Real Issue  15 yrs ago test teams tiny fraction of dev group  Now tests teams of similar size as dev & growing rapidly  Current test methodology improving incrementally:  Random grammar driven test case generation  Fault injection  Code path coverage tools  Testing remains effective at feature testing  Ineffective at finding inter-feature interactions  Only a tiny fraction of Heisenbugs found in testing ( ability_talk.ppt) ability_talk.ppt  Beta testing because test known to be inadequate  Test team growth scales exponentially with system complexity  Test and beta cycles already intolerably long

6 6 The Inktomi Lesson  Inktomi web search engine (SIGMOD’98)  Quickly evolving software:  Memory leaks, race conditions, etc. considered normal  Don’t attempt to test & beta until quality high  System availability of paramount importance  Individual node availability unimportant  Shared nothing cluster  Exploit ability to fail individual nodes:  Automatic reboots avoid memory leaks  Automatic restart of failed nodes  Fail fast: fail & restart when redundant checks fail  Replace failed hardware weekly (mostly disks)  Dark machine room  No panic midnight calls to admins  Mask failures rather than futile attempt to avoid

7 7 Apply to High Value TP Data?  Inktomi model:  Scales to 100’s of nodes  S/W evolves quickly  Low testing costs and no-beta requirement  Exploits ability to lose individual node without impacting system availability  Ability to temporarily lose some data W/O significantly impacting query quality  Can’t loose data availability in most TP systems  Redundant data allows node loss w/o data availability lost  Inktomi model with redundant data & metadata a solution to exploding test problem

8 8 Client Connection Model/Architecture Server Node Server Cloud  All data & metadata multiply redundant  Shared nothing  Single system image  Symmetric server nodes  Any client connects to any server  All nodes SAN-connected

9 9 Client Compilation & Execution Model Server Cloud Server Thread Lex analyze Parse Normalize Optimize Code generate Query execute  Query execution on many subthreads synchronized by root thread

10 10 Client Node Loss/Rejoin Server Cloud  Execution in progress  Rejoin.  Node local recovery  Rejoin cluster  Recover global data at rejoining node  Rejoin cluster  Lose node  Recompile  Re-execute

11 11 Client Redundant Data Update Model Server Cloud  Updates are standard parallel plans  Optimizer knows all redundant data paths  Generated plan updates all  No significant new technology  Like materialized view & index updates today

12 12 Fault Isolation Domains  Trade single-node perf for redundant data checks:  Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code  Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most)  Fail fast rather than attempting to repair:  Bring down node for mem-based data structure faults  Never patch inconsistent data…other copies keep system available  If anything goes wrong “fire” the node and continue:  Attempt node restart  Auto-reinstall O/S, DB and recreate DB partition  Mark node “dead” for later replacement

13 13 Summary  100 MLOC of server-side code and growing:  Can’t fight it & can’t test it …  quality will continue to decline if we don’t do something different  Can’t afford 2 to 3 year dev cycle  60’s large system mentality still prevails:  Optimizing precious machine resources is false economy  Continuing focus on single-system perf dead wrong:  Scalability & system perf rather than individual node performance  Why are we still incrementally attacking an exponential problem?  Any reasonable alternatives to clusters?

14 Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server

Download ppt "Software Testing Doesn’t Scale James Hamilton Microsoft SQL Server."

Similar presentations

Ads by Google