Presentation is loading. Please wait.

Presentation is loading. Please wait.

GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon.

Similar presentations


Presentation on theme: "GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon."— Presentation transcript:

1 GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon

2 TSO: Overview Initial team: little to no MMP experience Initial team: little to no MMP experience Engineering estimate: switching from 4-8 player peer to peer to MMP client/server would take no additional development time! Engineering estimate: switching from 4-8 player peer to peer to MMP client/server would take no additional development time! No code / architecture / tool support for No code / architecture / tool support for Long-term, continually changing nature of game Long-term, continually changing nature of game Non-deterministic execution, dual platform (win32 / Linux) Non-deterministic execution, dual platform (win32 / Linux) Overall process designed for single-player complexity, small development team Overall process designed for single-player complexity, small development team Limited nightly builds, minimal daily testing Limited nightly builds, minimal daily testing Limited design reviews, limited scalability testing, no “maintainable/extensible” impl. requirement Limited design reviews, limited scalability testing, no “maintainable/extensible” impl. requirement

3 TSO: Case Study Outline (Lessons Learned) Poorly designed SP  MP  MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

4 Scalability (Team Size & Code Size) What were the problems What were the problems Side effect breaks & ability to work in parallel Side effect breaks & ability to work in parallel Limited encapsulation + poor testability + non-determinism = TROUBLE Limited encapsulation + poor testability + non-determinism = TROUBLE Independent module design & impact on overall system (initially, no system architect) Independent module design & impact on overall system (initially, no system architect) #include structure #include structure win32 / Linux, compile times, pre-compiled headers,... win32 / Linux, compile times, pre-compiled headers,... What worked What worked Move to new architecture via Refactoring & Scaffolding Move to new architecture via Refactoring & Scaffolding HSB, incSync, nullView Simulator, nullView client, … HSB, incSync, nullView Simulator, nullView client, … Rolling integrations: never dark Rolling integrations: never dark Sandboxing & pumpkins Sandboxing & pumpkins

5 Scalability (Build & Distribution) To developers, customers & fielded servers To developers, customers & fielded servers What didn’t work (well enough) What didn’t work (well enough) Pulling builds from developer’s workstations Pulling builds from developer’s workstations Shell scripts & manual publication Shell scripts & manual publication What worked well What worked well Heavy automation with web tracking Heavy automation with web tracking Repeatability, Speed, Visibility Repeatability, Speed, Visibility Hierarchies of promotion & test Hierarchies of promotion & test

6 Scalability (Architecture) Logical versus physical versus code structure Logical versus physical versus code structure Only physical was not a major, MAJOR issue Only physical was not a major, MAJOR issue Logical: Replicated computing vs client / server Logical: Replicated computing vs client / server Security & stability implications Security & stability implications Code: Client / server isolation & code sharing Code: Client / server isolation & code sharing Multiple, concurrent logic threads were sharing code&data, each impacting the others Multiple, concurrent logic threads were sharing code&data, each impacting the others Nullview client & simulator Nullview client & simulator Regulators vs Protocols: bug counts & state machines Regulators vs Protocols: bug counts & state machines

7 Go to final architecture ASAP Client Sim Client Sim Client Sim Client Sim Multiplayer: Here be Sync Hell Evolve Client/Server: Client Sim Client Nice Undemocratic Request/ Command

8 Evolve Final Architecture ASAP: Make Everything Smaller&Separate

9 Final Architecture ASAP: Reduce Complexity of Branches Packet Arrival If (client) If (server) #ifdef (nullview) Shared Code Client Event Server Event Client & server teams would constantly break each other via changes to shared state&code More Packets!! Shared State

10 Final Architecture ASAP: “Refactoring” Decomposed into Multiple dll’s Decomposed into Multiple dll’s Found the Simulator Found the Simulator Interfaces Interfaces Reference Counting Reference Counting Client/Server subclassing Client/Server subclassing How it helped: –Reduced coupling. Even reduced compile times! –Developers in different modules broke each other less often. –We went everywhere and learned the code base.

11 Final Architecture ASAP: It Had to Always Run Initially clients wouldn’t behave predictably Initially clients wouldn’t behave predictably We could not even play test We could not even play test Game design was demoralized Game design was demoralized We needed a bridge, now! We needed a bridge, now! ? ?

12 Final Architecture ASAP: Incremental Sync A quick temporary solution… A quick temporary solution… Couldn’t wait for final system to be finished Couldn’t wait for final system to be finished High overhead, couldn’t ship it High overhead, couldn’t ship it We took partial state snapshots on the server and restored to them on the client We took partial state snapshots on the server and restored to them on the client How it helped: –Could finally see the game as it would be. –Allowed parallel game design and coding –Bought time to lay in the “right” stuff.

13 Architecture: Conclusions Keep it simple, stupid! Keep it simple, stupid! Client/server Client/server Keep it clean Keep it clean DLL/module integration points DLL/module integration points #ifdef’s must die! #ifdef’s must die! Keep it alive Keep it alive Plan for a constant system architect role: review all modules for impact on team, other modules & extensibility Plan for a constant system architect role: review all modules for impact on team, other modules & extensibility Expose & control all inter-process communication Expose & control all inter-process communication See Regulators: state machines that control transactions See Regulators: state machines that control transactions

14 TSO: Case Study Outline (Lessons Learned) Poorly designed SP  MP  MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

15 Visibility Problems Problems Debugging a client/server issue was very slow & painful Debugging a client/server issue was very slow & painful Knowing what to work on next was largely guesswork Knowing what to work on next was largely guesswork Reproducing system failures from live environment Reproducing system failures from live environment Knowing how one build or server cluster differed from another was again largely guesswork Knowing how one build or server cluster differed from another was again largely guesswork What we did that worked What we did that worked Log / crash aggregators & filters Log / crash aggregators & filters Live “critical event” monitor Live “critical event” monitor Esper: live player & engine metrics Esper: live player & engine metrics Repeatable load testing Repeatable load testing Web-based Dashboard: health, status, where is everything Web-based Dashboard: health, status, where is everything Fully automated build & publish procedures Fully automated build & publish procedures

16 Visibility via “Bread Crumbs”: Aggregated Instrumentation Flags Trouble Spots Server Crash

17 Quickly Find Trouble Spots DB byte count oscillates out of control, server crashes

18 Drill Down For Details A single DB Request is clearly at fault

19 TSO: Case Study Outline (Lessons Learned) Poorly designed SP  MP  MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

20 Testability Development, release, load: all show stopper problems Development, release, load: all show stopper problems QA coordination / speed / cost QA coordination / speed / cost Repeatablity, non-determinism Repeatablity, non-determinism Need for many, many tests per day, each with multiple inputs (two to two thousand players per test) Need for many, many tests per day, each with multiple inputs (two to two thousand players per test)

21 Testability: What Worked Automated testing for repeatablity & scale Automated testing for repeatablity & scale Scriptable test clients: mirrored actual user play sessions Scriptable test clients: mirrored actual user play sessions Changed the game’s architecture to increase testability Changed the game’s architecture to increase testability External test harnesses to control 50+ test clients per CPU, 4,000+ per session External test harnesses to control 50+ test clients per CPU, 4,000+ per session Push-button UI to configure, run & analyze tests (developer & QA) Push-button UI to configure, run & analyze tests (developer & QA) Constantly updated Baselines, with “Monkey Test” stats Constantly updated Baselines, with “Monkey Test” stats Pre-checkin regression Pre-checkin regression QA: web-driven state machine to control testers & collect/publish results QA: web-driven state machine to control testers & collect/publish results What didn’t work What didn’t work Event Recorders, unit testing Event Recorders, unit testing Manual-only testing Manual-only testing

22 MMP Automated Testing: Approach Push-button ability to run large-scale, repeatable tests Push-button ability to run large-scale, repeatable tests Cost Cost Hardware / Software Hardware / Software Human resources Human resources Process changes Process changes Benefit Benefit Accurate, repeatable measurable tests during development and operations Accurate, repeatable measurable tests during development and operations Stable software, faster, measurable progress Stable software, faster, measurable progress Base key decisions on fact, not opinion Base key decisions on fact, not opinion

23 Why Spend The Time & Money? System complexity, non-determinism, scale System complexity, non-determinism, scale Tests provide hard data in a confusing sea of possibilities Tests provide hard data in a confusing sea of possibilities End users: high Quality of Service bar End users: high Quality of Service bar Dev team: greater comfort & confidence Dev team: greater comfort & confidence Tools augment your team’s ability to do their jobs Tools augment your team’s ability to do their jobs Find problems faster Find problems faster Measure / change / measure: repeat as necessary Measure / change / measure: repeat as necessary Production & executives: come to depend on this data to a high degree Production & executives: come to depend on this data to a high degree

24 Scripted Test Clients Scripts are emulated play sessions: just like somebody plays the game Scripts are emulated play sessions: just like somebody plays the game Command steps: what the player does to the game Command steps: what the player does to the game Validation steps: what the game should do in response Validation steps: what the game should do in response

25 Scripts Tailored To Each Test Application Unit testing: 1 feature = 1 script Unit testing: 1 feature = 1 script Load testing: Representative play session Load testing: Representative play session The average Joe, times thousands The average Joe, times thousands Shipping quality: corner cases, feature completeness Shipping quality: corner cases, feature completeness Integration: test code changes for catastrophic failures Integration: test code changes for catastrophic failures

26 Test Client Game Client Scripted Players: Implementation Script Engine State Game GUI Client-Side Game Logic Commands State Presentation Layer

27 Process Shift: Time Target Launch Amount of work done Project Start MMP Developer Efficiency Strong test support Weak test support Not Good Enough Earlier Tools Investment Equals More Gain

28 Process Shifts: Automated Testing Changes The Shape Of The Development Progress Curve Scale & Feature Completeness Keep Developers moving forward, not bailing water Stability (Code Base & Servers) Focus Developers on key, measurable roadblocks

29 Process Shift: Measurable Targets, Projected Trend Lines Core Functionality Tests, Any Feature (e.g. # clients) Target Complete Time Any Time (e.g. Alpha) First Passing Test Now Actionable progress metrics, early enough to react

30 Process Shift: Load Testing (Before Paying Customers Show Up) Expose issues that only occur at scale Establish hardware requirements Establish play is acceptable @ scale

31 Client-Server Comparison

32 TSO: Case Study Outline (Lessons Learned) Poorly designed SP  MP  MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

33 User Data Oops! Oops! Users stored much more data (with much more variance) that we had planned for Users stored much more data (with much more variance) that we had planned for Caused many DB failures, city failures Caused many DB failures, city failures BIG problem: their persistent data has to work, always, across all builds & DB instances BIG problem: their persistent data has to work, always, across all builds & DB instances What helped What helped Regression testing, each build, against live set of user data Regression testing, each build, against live set of user data What would have helped more What would have helped more Sanity checks against the DB Sanity checks against the DB Range checks against user data Range checks against user data Better code & architecture support for validation of user data Better code & architecture support for validation of user data

34 Patching / New Content / Custom Content Oops! Oops! Initial Patch budget of 1Meg blown in 1 st week of operations Initial Patch budget of 1Meg blown in 1 st week of operations New Content required stronger, more predictable process New Content required stronger, more predictable process Custom Content required infrastructure able to easily add new content, on the fly Custom Content required infrastructure able to easily add new content, on the fly Key Issue: all effort had gone into going Live, not creating a sustainable process once Live Key Issue: all effort had gone into going Live, not creating a sustainable process once Live Conclusion: designing these in would have been much easier than retrofitting… Conclusion: designing these in would have been much easier than retrofitting…

35 Lessons Learned autoTest: Scripted test clients and instrumented code rock! autoTest: Scripted test clients and instrumented code rock! Collection, aggregation and display of test data is vital in making decisions on a day to day basis Collection, aggregation and display of test data is vital in making decisions on a day to day basis Lessen the panic Lessen the panic Scale&Break is a very clarifying experience Scale&Break is a very clarifying experience Stable code&servers greatly ease the pain of building a MMP game Stable code&servers greatly ease the pain of building a MMP game Hard data (not opinion) is both illuminating and calming Hard data (not opinion) is both illuminating and calming autoBuild: make it pushbutton with instant web visibility autoBuild: make it pushbutton with instant web visibility Use early, use often to get bugs out before going live Use early, use often to get bugs out before going live Budget for a strong architect role & a strong design review process for the entire game lifecycle Budget for a strong architect role & a strong design review process for the entire game lifecycle Scalability, testability, patching & new content & long-term persistence are requirements: MUCH cheaper to design in than frantic retrofitting Scalability, testability, patching & new content & long-term persistence are requirements: MUCH cheaper to design in than frantic retrofitting KISS principle is mandatory, as is expecting changes KISS principle is mandatory, as is expecting changes

36 Lessons Learned Visibility: tremendous volumes of data require automated collection&summarization Visibility: tremendous volumes of data require automated collection&summarization Provide drill-down access to details from summary view web pages Provide drill-down access to details from summary view web pages Get some people on board who’ve been burned before: a lot of TSO’s pain could have been easily avoided, but little distributed system experience & MMP design issues existed in early phases of project Get some people on board who’ve been burned before: a lot of TSO’s pain could have been easily avoided, but little distributed system experience & MMP design issues existed in early phases of project Fred Brooks, the 31 st programmer Fred Brooks, the 31 st programmer Strong tools & process pays off for large teams & long-term operations Strong tools & process pays off for large teams & long-term operations Measure & improve your workspace, constantly Measure & improve your workspace, constantly Non-determinism is painful & unavoidable Non-determinism is painful & unavoidable Minimize impact via explicit design support & use strong, constant calibration to understand it Minimize impact via explicit design support & use strong, constant calibration to understand it

37 Biggest Wins Code Isolation Scaffolding Tools: Build / Test / Measure, Information Management Pre-Checkin Regression / Load Testing

38 Biggest Losses Architecture: Massively peer to peer Early lack of tools #ifdef across platform / function “Critical Path” dependencies More Details: www.maggotranch.com/MMP (3 TSO Lessons Learned talks)www.maggotranch.com/MMP (3


Download ppt "GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon."

Similar presentations


Ads by Google