Download presentation
Presentation is loading. Please wait.
Published byEsther Sargeant Modified over 9 years ago
1
Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server
2
2 Overview The Problem: S/W size & complexity inevitable Short cycles reduce S/W reliability S/W testing is the real issue Testing doesn’t scale trading complexity for quality Cluster-based solution The Inktomi lesson Shared-nothing cluster architecture Redundant data & metadata Fault isolation domains
3
3 S/W Size & Complexity Inevitable Successful S/W products grow large # features used by a given user small But union of per-user features sets is huge Reality of commodity, high volume S/W Large feature sets Same trend as consumer electronics Example mid-tier & server-side S/W stack: SAP: ~47 mloc DB: ~2 mloc NT: ~50 mloc Testing all feature interactions impossible
4
4 Short Cycles Reduce S/W Reliability Reliable TP systems typically evolve slowly & conservatively Modern ERP systems can go through 6+ minor revisions/year Many e-commerce sites change even faster Fast revisions a competitive advantage Current testing and release methodology: As much testing as dev time Significant additional beta-cycle time Unacceptable choice: reliable but slow evolving or fast changing yet unstable and brittle
5
5 Testing the Real Issue 15 yrs ago test teams tiny fraction of dev group Now tests teams of similar size as dev & growing rapidly Current test methodology improving incrementally: Random grammar driven test case generation Fault injection Code path coverage tools Testing remains effective at feature testing Ineffective at finding inter-feature interactions Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt) www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.pptwww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt Beta testing because test known to be inadequate Test team growth scales exponentially with system complexity Test and beta cycles already intolerably long
6
6 The Inktomi Lesson Inktomi web search engine (SIGMOD’98) Quickly evolving software: Memory leaks, race conditions, etc. considered normal Don’t attempt to test & beta until quality high System availability of paramount importance Individual node availability unimportant Shared nothing cluster Exploit ability to fail individual nodes: Automatic reboots avoid memory leaks Automatic restart of failed nodes Fail fast: fail & restart when redundant checks fail Replace failed hardware weekly (mostly disks) Dark machine room No panic midnight calls to admins Mask failures rather than futile attempt to avoid
7
7 Apply to High Value TP Data? Inktomi model: Scales to 100’s of nodes S/W evolves quickly Low testing costs and no-beta requirement Exploits ability to lose individual node without impacting system availability Ability to temporarily lose some data W/O significantly impacting query quality Can’t loose data availability in most TP systems Redundant data allows node loss w/o data availability lost Inktomi model with redundant data & metadata a solution to exploding test problem
8
8 Client Connection Model/Architecture Server Node Server Cloud All data & metadata multiply redundant Shared nothing Single system image Symmetric server nodes Any client connects to any server All nodes SAN-connected
9
9 Client Compilation & Execution Model Server Cloud Server Thread Lex analyze Parse Normalize Optimize Code generate Query execute Query execution on many subthreads synchronized by root thread
10
10 Client Node Loss/Rejoin Server Cloud Execution in progress Rejoin. Node local recovery Rejoin cluster Recover global data at rejoining node Rejoin cluster Lose node Recompile Re-execute
11
11 Client Redundant Data Update Model Server Cloud Updates are standard parallel plans Optimizer knows all redundant data paths Generated plan updates all No significant new technology Like materialized view & index updates today
12
12 Fault Isolation Domains Trade single-node perf for redundant data checks: Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most) Fail fast rather than attempting to repair: Bring down node for mem-based data structure faults Never patch inconsistent data…other copies keep system available If anything goes wrong “fire” the node and continue: Attempt node restart Auto-reinstall O/S, DB and recreate DB partition Mark node “dead” for later replacement
13
13 Summary 100 MLOC of server-side code and growing: Can’t fight it & can’t test it … quality will continue to decline if we don’t do something different Can’t afford 2 to 3 year dev cycle 60’s large system mentality still prevails: Optimizing precious machine resources is false economy Continuing focus on single-system perf dead wrong: Scalability & system perf rather than individual node performance Why are we still incrementally attacking an exponential problem? Any reasonable alternatives to clusters?
14
Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.