Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC Greg Kearney,

Similar presentations


Presentation on theme: "1 Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC Greg Kearney,"— Presentation transcript:

1 1 Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC

2 2 Talk Overview Covers: Software Engineering techniques to help when projects get big –Code structure –Work processes (for programmers) –Testing Does Not Cover: –Game Design / Content Pipeline –Operations / Project Management Covers: Software Engineering techniques to help when projects get big –Code structure –Work processes (for programmers) –Testing Does Not Cover: –Game Design / Content Pipeline –Operations / Project Management

3 3 How to Apply it. We didn’t do all of this right away Improve what you can Don’t change too much at once Prove that it works, and others will take up the cause Iterate We didn’t do all of this right away Improve what you can Don’t change too much at once Prove that it works, and others will take up the cause Iterate

4 4 Match Process to Scale Team Efficiency Process for 5 to 15 programmers Team Size Process for 30 to 50 programmers 0 “Meeting Hell” “Everything’s Broken Hell” Change to a new process +tve

5 5 What You Should Leave With TSO “Lessons Learned” –Where we were with our software process –What we did about it –How it helped Some Rules of Thumb –General practices that tend to smooth software scale –Not a blueprint for MMP development –Useful “frame of reference” TSO “Lessons Learned” –Where we were with our software process –What we did about it –How it helped Some Rules of Thumb –General practices that tend to smooth software scale –Not a blueprint for MMP development –Useful “frame of reference”

6 6 Classes of Lessons Learned & Rules Architecture / Design: Keep it Simple –Minimizing dependencies, fatal couplings –Minimizing complexity, brittleness Workspace Management: Keep it Clean –Code and directory structure –Check in and integration strategies Dev. Support Structure: Make it Easy, Prove it –Testing –Automation Architecture / Design: Keep it Simple –Minimizing dependencies, fatal couplings –Minimizing complexity, brittleness Workspace Management: Keep it Clean –Code and directory structure –Check in and integration strategies Dev. Support Structure: Make it Easy, Prove it –Testing –Automation -All of these had to change as we scaled up. -They eventually exceeded the team’s ability to deal with (using existing tools & processes).

7 7 Non-Geek Analogy Bad flashbacks found at: –Sharpen your tools. –Clean up your mess. –Measure twice, cut once. –Stay with your buddy.

8 8 High “Churn Rate”: large #coders times tightly coupled code equaled frequent breaks –Our code had a deep root system –And we had a forest of changes to make High “Churn Rate”: large #coders times tightly coupled code equaled frequent breaks –Our code had a deep root system –And we had a forest of changes to make “Big root ball” found at: Key Factors Affecting Efficiency

9 9 Evolve Make It Smaller

10 10 “Key Logs”: some issues were preventing other issues from even being worked on Key Factors Affecting Efficiency

11 11 Key Factors Affecting Efficiency A chain of single points of failure took out the entire team Sit on a chair Buy the chair Enter a house Buy a house Enter a city Create an avatar Login

12 12 So, What Did We Do That Worked Switched to a logical architecture with less coupling Switched to a code structure with fewer dependencies Put in scaffolding to keep everyone working Developed sophisticated configuration management Instituted automated testing Metrics, Metrics, Metrics Switched to a logical architecture with less coupling Switched to a code structure with fewer dependencies Put in scaffolding to keep everyone working Developed sophisticated configuration management Instituted automated testing Metrics, Metrics, Metrics

13 13 So, What Did We Do That Didn’t? Long range milestone planning Network emulator(s) Over engineered a few things (too general) Some tasks failed due to: –Not replanning, reviewing long tasks –Not breaking up long tasks Coding standard changed part way through … Long range milestone planning Network emulator(s) Over engineered a few things (too general) Some tasks failed due to: –Not replanning, reviewing long tasks –Not breaking up long tasks Coding standard changed part way through …

14 14 What we were faced with: 750K lines of legacy Windows code Port it to Linux Change from “multiplayer” to Client/Server 18 months Developers must remain alive after shipping Continuous releases starting at Beta 750K lines of legacy Windows code Port it to Linux Change from “multiplayer” to Client/Server 18 months Developers must remain alive after shipping Continuous releases starting at Beta

15 15 Go To Final Architecture ASAP

16 16 Go to final architecture ASAP Client Sim Client Sim Client Sim Client Sim Multiplayer: Here be Sync Hell Evolve Client/Server: Client Sim Client Nice Undemocratic Request/ Command

17 17 Final Architecture ASAP: “Refactoring” Decomposed into Multiple dll’s –Found the Simulator Interfaces Reference Counting Client/Server subclassing Decomposed into Multiple dll’s –Found the Simulator Interfaces Reference Counting Client/Server subclassing How it helped: –Reduced coupling. Even reduced compile times! –Developers in different modules broke each other less often. –We went everywhere and learned the code base.

18 18 Final Architecture ASAP: It Had to Always Run But, clients would not behave predictably We could not even play test Game design was demoralized We needed a bridge, now! But, clients would not behave predictably We could not even play test Game design was demoralized We needed a bridge, now! ? ?

19 19 Final Architecture ASAP: Incremental Sync A quick temporary solution… –Couldn’t wait for final system to be finished –High overhead, couldn’t ship it We took partial state snapshots on the server and restored to them on the client A quick temporary solution… –Couldn’t wait for final system to be finished –High overhead, couldn’t ship it We took partial state snapshots on the server and restored to them on the client How it helped: –Could finally see the game as it would be. –Allowed parallel game design and coding –Bought time to lay in the “right” stuff.

20 20 Final Architecture ASAP: Null View Created Null View HouseSim on Windows –Same interface –Null (text output) implementation Created Null View HouseSim on Windows –Same interface –Null (text output) implementation How it helped –No ifdef’s! –Done under Windows, we could test this first step. –We knew it was working during the port. –Allowed us to port to Linux only the “needed” parts.

21 21 Final Architecture ASAP: More “Bridges” HSB’s: proxy on Linux, pass-through to a Windows Sim. Disabled authentication, etc. HSB’s: proxy on Linux, pass-through to a Windows Sim. Disabled authentication, etc. How it helped –Could exercise Linux components before finishing HouseSim port. –Allowed us to debug server scale, performance and stability issues early. –Make best use of Windows developers. –Allowed single platform development. Faster compiles. How it helped –Could keep working even when some of the system wasn’t available.

22 22 Mainline *Must* Work!

23 23 If Mainline Doesn’t Work, Nobody Works The Mainline source control branch *must* run Never go dark: Demo/Play Test every day If you hit a bug, do you sync to mainline, hoping someone else fixed it? Or did you just add it? The Mainline source control branch *must* run Never go dark: Demo/Play Test every day If you hit a bug, do you sync to mainline, hoping someone else fixed it? Or did you just add it? –If mainline breaks for “only” an hour, the project loses a man-week. –If each developer breaks the mainline “only” once a month, it is broken every day.

24 24 Mainline must work: Sniff Test Mainline was breaking for “simple” things. –Features you “didn’t touch” (and didn’t test). Created an auto-test to exercise all core functions. Quick to run. Fun to watch. Checked results. Mandated that it pass before submitting code changes. Break the build: “feed the pig”. Mainline was breaking for “simple” things. –Features you “didn’t touch” (and didn’t test). Created an auto-test to exercise all core functions. Quick to run. Fun to watch. Checked results. Mandated that it pass before submitting code changes. Break the build: “feed the pig”. How it helped –Very simple test. Amazing difference. –Sometimes we got lazy and trusted it too much. Doh!

25 25 Mainline must work: Stages to “Sandboxing” 1.Got it to build reliably. 2.Instituted Auto-Builds: all on failure. 3.Used a “Pumpkin” to avoid duplicate merge- test cycles, pulling partial submissions,... 4.Used a Pumpkin Queue when we really got rolling 1.Got it to build reliably. 2.Instituted Auto-Builds: all on failure. 3.Used a “Pumpkin” to avoid duplicate merge- test cycles, pulling partial submissions,... 4.Used a Pumpkin Queue when we really got rolling How it helped –Far fewer thumbs twiddled. –The extra process got on some people’s nerves.

26 26 Mainline must work: Sandboxing 5.Finally, went to per-developer branching. –Develop on your own branch. –Submit changes to an integration engineer. –Full Smoke test run per submission/feature. –If it worked, integrated to mainline in priority order, or else it is bounced. 5.Finally, went to per-developer branching. –Develop on your own branch. –Submit changes to an integration engineer. –Full Smoke test run per submission/feature. –If it worked, integrated to mainline in priority order, or else it is bounced. How it helped –Mainline *always* runs. Pull any time. –Releases are not delayed by partial features. –No more code freezes going to release.

27 27 Support Structure

28 28 Background: Support Structure Team size placed design constraints on supporting tools –Automation: big win in big teams –Churn rate: tool accuracy / support cost Types of tools –Data management: collection / corrolation –Testing: controlled, sync’ed, repeatable inputs –Baselines: my bug, your bug, or our bug? Team size placed design constraints on supporting tools –Automation: big win in big teams –Churn rate: tool accuracy / support cost Types of tools –Data management: collection / corrolation –Testing: controlled, sync’ed, repeatable inputs –Baselines: my bug, your bug, or our bug?

29 29 Overview: Support Structure Automated testing: designs to minimize impact of churn rate Automated data collection / corrolation –Distributed sytem == distributed data –Dashboard / Esper / MonkeyWatcher Use case: load testing –Controlled (tunable) inputs, observable results –“Scale&Break” Automated testing: designs to minimize impact of churn rate Automated data collection / corrolation –Distributed sytem == distributed data –Dashboard / Esper / MonkeyWatcher Use case: load testing –Controlled (tunable) inputs, observable results –“Scale&Break”

30 30 Problem: Testing Accuracy Load & Regression: inputs must be –Accurate –Repeatable Churn rate: logic/data in constant motion –How to keep testing client accurate? Solution: game client becomes test client –Exact mimicry –Lower maintenance costs Load & Regression: inputs must be –Accurate –Repeatable Churn rate: logic/data in constant motion –How to keep testing client accurate? Solution: game client becomes test client –Exact mimicry –Lower maintenance costs

31 31 Test Client == Game Client Game GUITest Control Commands State Client-Side Game Logic Presentation Layer Test Client Game Client

32 32 Game Client: How Much To Keep? Game Client View Logic Presentation Layer

33 33 What Level To Test At? Game Client Mouse Clicks Presentation Layer Regression: Too Brittle (pixel shift) Load: Too Bulky Regression: Too Brittle (pixel shift) Load: Too Bulky View Logic

34 34 What Level To Test At? Game Client Internal Events Presentation Layer Regression: Too Brittle (Churn Rate vs Logic & Data) Regression: Too Brittle (Churn Rate vs Logic & Data) View Logic

35 35 Semantic Abstractions NullView Client View Logic Presentation Layer Buy LotEnter Lot Use Object … Buy Object ~ ¾ ~ ¼ Basic gameplay changes less frequently than UI or protocol implementations.

36 36 Scriptable User Play Sessions Test Scripts: Specific / ordered inputs –Single user play session –Multiple user play session SimScript –Collection: Presentation Layer “primitives” –Synchronization: wait_until, remote_command –State probes: arbitrary game state Avatar’s body skill, lamp on/off, … Test Scripts: Specific / ordered inputs –Single user play session –Multiple user play session SimScript –Collection: Presentation Layer “primitives” –Synchronization: wait_until, remote_command –State probes: arbitrary game state Avatar’s body skill, lamp on/off, …

37 37 Scriptable User Play Sessions Scriptable play sessions: big win –Load: tunable based on actual play –Regression: walk a set of avatars thru various play sessions, validating correctness per step Gameplay semantics: very stable –UI / protocols shifted constantly –Game play remained (about) the same Scriptable play sessions: big win –Load: tunable based on actual play –Regression: walk a set of avatars thru various play sessions, validating correctness per step Gameplay semantics: very stable –UI / protocols shifted constantly –Game play remained (about) the same

38 38 Automated Test: Team Baselines Hourly “critical path” stability tests –Sync / clean / build / test –Validate Mainline / Servers Snifftest weather report –Hourly testing –Constant reporting Hourly “critical path” stability tests –Sync / clean / build / test –Validate Mainline / Servers Snifftest weather report –Hourly testing –Constant reporting

39 39 How Automated Testing Helped Current, accurate baseline for developers Scale&break found many bugs Greatly increased stability –Code base was “safe” –Server health was known (and better) Current, accurate baseline for developers Scale&break found many bugs Greatly increased stability –Code base was “safe” –Server health was known (and better)

40 40 Tools & Large Teams High tool ROI –team_size * automation_savings Faster triage –Quickly narrow down problem – across any system component Monitoring tools became a focal point Wiki: central doc repository High tool ROI –team_size * automation_savings Faster triage –Quickly narrow down problem – across any system component Monitoring tools became a focal point Wiki: central doc repository

41 41 Monitoring / Diagnostics When you can measure what you are speaking about and can express it in numbers, you know something about it. But when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind." - Lord Kelvin DeMarco: You cannot control what you cannot measure. Maxwell: To measure is to know. Pasteur: A science is as mature as its measurement tools. DeMarco: You cannot control what you cannot measure. Maxwell: To measure is to know. Pasteur: A science is as mature as its measurement tools.

42 42 Dashboard System resource & health tool –CPU / Memory / Disk / … Central point to access –Status –Test Results –Errors –Logs –Cores –… System resource & health tool –CPU / Memory / Disk / … Central point to access –Status –Test Results –Errors –Logs –Cores –…

43 43 Test Central / Monkey Watcher Test Central UI –Control rig for developers & testers Monkey Watcher –Collects & stores (distributed) test results –Produces summarized reports across tests –Filters known defects –Provides baseline of correctness –Web frontend, unique IDs per test Test Central UI –Control rig for developers & testers Monkey Watcher –Collects & stores (distributed) test results –Produces summarized reports across tests –Filters known defects –Provides baseline of correctness –Web frontend, unique IDs per test

44 44 Esper In-game profiler for a distributed system Internal probes may be viewed –Per process / machine / cluster –Time view or summary view Automated data management –Coders: add one line probe –Esper: data shows up on web site In-game profiler for a distributed system Internal probes may be viewed –Per process / machine / cluster –Time view or summary view Automated data management –Coders: add one line probe –Esper: data shows up on web site

45 45 Use Case: Scale & Break Never too early to begin scaling –Idle: keep doubling server processes –Busy: double #users, dataset size Fix what broke, start again Tune input scripts using Beta data Never too early to begin scaling –Idle: keep doubling server processes –Busy: double #users, dataset size Fix what broke, start again Tune input scripts using Beta data

46 Load Testing: Data Flow Client Metrics Game Traffic Resource Metrics Debugging Data Test Driver CPU Load Control Rig Server Cluster Load Testing Team System Monitors Internal Probes Test Client Test Client Test Client Test Driver CPU Test Client Test Client Test Client Test Driver CPU Test Client Test Client Test Client

47 47 Outline: Wrapup Wins / Losses Rules: Analysis & Discussion Recommended reading Questions Wins / Losses Rules: Analysis & Discussion Recommended reading Questions

48 48 Process: Wins / Losses Wins –Module decomposition Logical: client / server architecture Physical: code structure –Scaffolding for parallel development –Tools to improve workflow –Automated Regression / Load Wins –Module decomposition Logical: client / server architecture Physical: code structure –Scaffolding for parallel development –Tools to improve workflow –Automated Regression / Load

49 49 Process: Wins / Losses Losses –Early lack of tools –#ifdef as a cross-platform port –Single points of failure blocked entire development team Losses –Early lack of tools –#ifdef as a cross-platform port –Single points of failure blocked entire development team

50 50 Not Done Yet: More Challenges How to ship, and ship, and ship… How to balance infrastructure cleanup against new feature development … How to ship, and ship, and ship… How to balance infrastructure cleanup against new feature development …

51 51 Rules of Thumb (1) KISS: software and processes Incremental changes – – :“Baby-Steps” Continual tool/process improvement KISS: software and processes Incremental changes – – :“Baby-Steps” Continual tool/process improvement

52 52 Rules of Thumb (2) Mainline has got to work Get something on the ground. Quickly. Mainline has got to work Get something on the ground. Quickly.

53 53 Rules of Thumb (3) Key Logs: break up quickly, ruthlessly Scaffolding: keep others working Do important things, not urgent things Module separation (logically, physically) If you can’t measure it, you don’t understand it Key Logs: break up quickly, ruthlessly Scaffolding: keep others working Do important things, not urgent things Module separation (logically, physically) If you can’t measure it, you don’t understand it

54 54 Final Rule: “Sharpen The Saw” Efficiency impacted by –Component coupling / team size –Compile / load / test / analyze cycle Tool Justification in large teams –Large large scale –5% gain across 30 programmers –“Fred Brooks”: 31st programmer… Efficiency impacted by –Component coupling / team size –Compile / load / test / analyze cycle Tool Justification in large teams –Large large scale –5% gain across 30 programmers –“Fred Brooks”: 31st programmer…

55 55 Recommended Reading Influences –Extreme Programming –Scott Meyers: large-scale software engineering –Gamma et al: Design Patterns Caveat Emptor: slavish following not encouraged –Consider “ground conditions” for your project Influences –Extreme Programming –Scott Meyers: large-scale software engineering –Gamma et al: Design Patterns Caveat Emptor: slavish following not encouraged –Consider “ground conditions” for your project

56 56 Questions & Answers


Download ppt "1 Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC Greg Kearney,"

Similar presentations


Ads by Google