Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hitting 60Hz with the Unreal Engine: Inside the Tech of Mortal Kombat vs DC Universe Jon Greenberg Graphics Programming Lead MK Team.

Similar presentations


Presentation on theme: "Hitting 60Hz with the Unreal Engine: Inside the Tech of Mortal Kombat vs DC Universe Jon Greenberg Graphics Programming Lead MK Team."— Presentation transcript:

1 Hitting 60Hz with the Unreal Engine: Inside the Tech of Mortal Kombat vs DC Universe Jon Greenberg Graphics Programming Lead MK Team

2 Why Bother? In general, “twitch” games require very high framerate.In general, “twitch” games require very high framerate. Fast input response demands fast feedback to playerFast input response demands fast feedback to player Running at 60Hz a basic requirement of fighting genre.Running at 60Hz a basic requirement of fighting genre.

3 Why Is 60 So Rare? Very few games target 60Hz (< 10% of games)Very few games target 60Hz (< 10% of games) Only 16.7 ms in which to do everything vs 33.3 ms at 30Hz. Implies half the time to do everything… this is not correct.Only 16.7 ms in which to do everything vs 33.3 ms at 30Hz. Implies half the time to do everything… this is not correct. In general, this means you have ~1/3 the time, due to fixed cost overhead which can’t be removed.In general, this means you have ~1/3 the time, due to fixed cost overhead which can’t be removed. Customer doesn’t “care” that you have less time to do everything – still wants game to look great.Customer doesn’t “care” that you have less time to do everything – still wants game to look great. Game must hit 60Hz on both PS3 and Xbox 360, and both versions look as close as possible!Game must hit 60Hz on both PS3 and Xbox 360, and both versions look as close as possible!

4 Why 1/3 rd The Time? Game must run at >= 60Hz – not allowed to drop frames (bog).Game must run at >= 60Hz – not allowed to drop frames (bog). This means we have to set aside headroom that can absorb instantaneous spikes.This means we have to set aside headroom that can absorb instantaneous spikes. MK vs DC steady state ~= 9.5 ms per frame.MK vs DC steady state ~= 9.5 ms per frame. Allows for lot of particle effects and variability. Other genres (even other fighting games) likely need a great deal less slack.Allows for lot of particle effects and variability. Other genres (even other fighting games) likely need a great deal less slack. Philosophy: Always address worst-case scenarios up front.Philosophy: Always address worst-case scenarios up front.

5 The Problem (part 1) Midway had decided to use UnrealEngine 3 (UE3) as basic middleware across all internal games. Using UE3 was required by mgmt.Midway had decided to use UnrealEngine 3 (UE3) as basic middleware across all internal games. Using UE3 was required by mgmt. UE3 was (is) designed for 30Hz FPS/3 rd person action genre titles.UE3 was (is) designed for 30Hz FPS/3 rd person action genre titles. We started with the October 2006 (post Gears of War 1) codebase. Some additional features taken from Epic à-la-carte. Ex: MITV, file caching, misc fixes.We started with the October 2006 (post Gears of War 1) codebase. Some additional features taken from Epic à-la-carte. Ex: MITV, file caching, misc fixes. About 22 months to develop the game.About 22 months to develop the game.

6 The Problem (part 2) UE3 brings a lot to the table (nice tools, wide feature set) but imposes a lot of heavy fixed costs.UE3 brings a lot to the table (nice tools, wide feature set) but imposes a lot of heavy fixed costs. There are also some choices made in the engine that have problematic side effects for 60Hz play (UObject overhead, Garbage Collection, etc).There are also some choices made in the engine that have problematic side effects for 60Hz play (UObject overhead, Garbage Collection, etc). Out-of-the-box fixed cost baseline (especially GPU) too high for a 60Hz title.Out-of-the-box fixed cost baseline (especially GPU) too high for a 60Hz title. Eg., Oct06 build GPU baseline ~ 9ms.Eg., Oct06 build GPU baseline ~ 9ms.

7 Breaking it Down GPU OverheadGPU Overhead GPU Fixed costsGPU Fixed costs General rendering overheadGeneral rendering overhead Multipass overheadMultipass overhead Lighting costLighting cost Particle costParticle cost CPU OverheadCPU Overhead Particle costParticle cost Cloth & WaterCloth & Water Render thread virtual overhead/state cachingRender thread virtual overhead/state caching

8 GPU Fixed Costs Post-processingPost-processing Usually the biggest fixed cost.Usually the biggest fixed cost. Combine as many operations together as possible to hide work (ie, Bloom+DOF+Gamma+Resolution retarget)Combine as many operations together as possible to hide work (ie, Bloom+DOF+Gamma+Resolution retarget) Cut as many corners as possible and special case as necessary – eg. we use 1 of 3 different DOF methods depending on the case:Cut as many corners as possible and special case as necessary – eg. we use 1 of 3 different DOF methods depending on the case: Normal gameplay: classic blur cross-fadeNormal gameplay: classic blur cross-fade Main Menu/Cinematics: dialating Poisson discMain Menu/Cinematics: dialating Poisson disc Klose-Kombat: a series of blur planes.Klose-Kombat: a series of blur planes. “Normal” DOF+Bloom effect cost = 1.8 ms“Normal” DOF+Bloom effect cost = 1.8 ms

9 Bloom Bloom is done a little strangely to compensate for linear color range and not having a separate downsample/blur:Bloom is done a little strangely to compensate for linear color range and not having a separate downsample/blur: Per environment thresholding value determines which pixels bloom.Per environment thresholding value determines which pixels bloom. Thresholding is done inside downsample pass and written out into the alpha channel as 0 or 1.Thresholding is done inside downsample pass and written out into the alpha channel as 0 or 1. This bloom mask is then blurred along with color.This bloom mask is then blurred along with color. We had separate thresholding and strength values for characters and the general background to allow the two to be tuned differently.We had separate thresholding and strength values for characters and the general background to allow the two to be tuned differently. Character masks were written/read from stencil buffer.Character masks were written/read from stencil buffer.

10 Distortion Normal UE3 distortion effect has 3ms overhead!Normal UE3 distortion effect has 3ms overhead! Instead, fold Distortion into Translucency.Instead, fold Distortion into Translucency. Sample from a snapshot of opaque pass, and do a depth-based selection to prevent near- distortion.Sample from a snapshot of opaque pass, and do a depth-based selection to prevent near- distortion. Overhead now just “capturing snapshot” - just a copy blit of color buffer ~ 0.4ms.Overhead now just “capturing snapshot” - just a copy blit of color buffer ~ 0.4ms. Now usable everywhere!Now usable everywhere! Optionally support recapture of the “snapshot” per distorting effect to allow for layered distortion effects as well. Needed for water level.Optionally support recapture of the “snapshot” per distorting effect to allow for layered distortion effects as well. Needed for water level.

11 Motion Blur Very expensive to do full-screen.Very expensive to do full-screen. Epic doesn’t support motion blurring of skinned geometry!Epic doesn’t support motion blurring of skinned geometry! Instead, motion blur effects done via rendering velocity-stretched fading geometry.Instead, motion blur effects done via rendering velocity-stretched fading geometry. Required changing GPU skinning (PC/360) and Edge (PS3-SPU) to support skinning against previous bone positions.Required changing GPU skinning (PC/360) and Edge (PS3-SPU) to support skinning against previous bone positions. Requires localized blur-only Z-prepass to prevent additive blur effects from blending badly.Requires localized blur-only Z-prepass to prevent additive blur effects from blending badly.

12 Shadows and MSAA Game made use of MSAA-2x on both platformsGame made use of MSAA-2x on both platforms Resolving MSAA is very expensive on PS3.Resolving MSAA is very expensive on PS3. Combine full-screen modulated shadow blit with MSAA color/depth resolve!Combine full-screen modulated shadow blit with MSAA color/depth resolve! Hide heavy texture bandwidth operations inside math heavy shadow work. Shadow ALU overhead high enough that we can also hide the Distortion copy blit!Hide heavy texture bandwidth operations inside math heavy shadow work. Shadow ALU overhead high enough that we can also hide the Distortion copy blit! No self-shadowing – disabled via stencil mask.No self-shadowing – disabled via stencil mask. Once there’s no self-shadowing anyway, we use proxy shadow characters.Once there’s no self-shadowing anyway, we use proxy shadow characters. Total cost ~= 1.33msTotal cost ~= 1.33ms

13 Fog Fullscreen per-pixel ~2 ms on the GPU.Fullscreen per-pixel ~2 ms on the GPU. Visible vertices < visible pixels!Visible vertices < visible pixels! Per-pixel fog is often overkill. Replaced with per-vertex fog and per-object fog (characters).Per-pixel fog is often overkill. Replaced with per-vertex fog and per-object fog (characters). To keep per-vertex costs low, only support 2 active fog actors.To keep per-vertex costs low, only support 2 active fog actors. Heightfog is optional, and controlled via static branching.Heightfog is optional, and controlled via static branching. Also added optional undulating height fog, via pulsing sine-waves through the fog height.Also added optional undulating height fog, via pulsing sine-waves through the fog height. Dramatically cheaper!Dramatically cheaper!

14 General Rendering 8 bpc render targets, linear color scale of 0..2.8 bpc render targets, linear color scale of 0..2. We light in a combination of γ=1.0 and γ=2.2, depending on what we’re lighting, to save cost.We light in a combination of γ=1.0 and γ=2.2, depending on what we’re lighting, to save cost. Opaque: uses MSAAOpaque: uses MSAA Translucent: post-MSAA resolveTranslucent: post-MSAA resolve Heavy use of Playstation Edge library for skinned and world geometry on PS3.Heavy use of Playstation Edge library for skinned and world geometry on PS3. 3D resolution of the game was 1040x624 which was then scaled up to allow the HUD to render at 1280x720.3D resolution of the game was 1040x624 which was then scaled up to allow the HUD to render at 1280x720.

15 Multipass Overhead Pass-per-light overhead is simply too high.Pass-per-light overhead is simply too high. We’re mostly prelit, so we chose forward rendering.We’re mostly prelit, so we chose forward rendering. Z-Prepass? Typical depth complexity < 1.5.Z-Prepass? Typical depth complexity < 1.5. Loosely sort opaque objects front to back via “rings of detail”. Removing Z-prepass saves ~0.75 ms.Loosely sort opaque objects front to back via “rings of detail”. Removing Z-prepass saves ~0.75 ms. Touch each pixel only once if possible.Touch each pixel only once if possible.

16 World Lighting (static) World is prelit using Illuminate Labs’ Beast, with some “dynamic” RNMs built with Turtle. Dynamic RNMs are animated in materials or via MITVs.World is prelit using Illuminate Labs’ Beast, with some “dynamic” RNMs built with Turtle. Dynamic RNMs are animated in materials or via MITVs. Prelit lighting was a mix of texture and vertex RNM lighting, with a fast-path added to support per-vertex diffuse only RNM evaluation for distant objects.Prelit lighting was a mix of texture and vertex RNM lighting, with a fast-path added to support per-vertex diffuse only RNM evaluation for distant objects.

17 World Lighting (dynamic) Effect point lighting is done via a mix of per- pixel lighting (floors) and per-vertex (the rest of the environment).Effect point lighting is done via a mix of per- pixel lighting (floors) and per-vertex (the rest of the environment). To account for maximum load, shaders are built with three diffuse-only point lights active and burned into the materialTo account for maximum load, shaders are built with three diffuse-only point lights active and burned into the material No branching! All three lights always evaluated.No branching! All three lights always evaluated. These lights are globally assigned and managed in 3-deep FIFO.These lights are globally assigned and managed in 3-deep FIFO.

18 Character Lighting (part 1) Custom lighting model: Irradiance volume of SH coefficient sets.Irradiance volume of SH coefficient sets. Eval gradients to determine an SH-set per object.Eval gradients to determine an SH-set per object. Diffuse light the model using only the first 4 coefficients (“ambient” and “directional” term).Diffuse light the model using only the first 4 coefficients (“ambient” and “directional” term). The 3 effect point lights are evaluated per-vertex and combined into the final diffuse lighting result.The 3 effect point lights are evaluated per-vertex and combined into the final diffuse lighting result. Spec faked via power-scaling of (EN) and multiplying by diffuse lighting.Spec faked via power-scaling of (EN) and multiplying by diffuse lighting.

19 Specularity

20 Particle Effects Very large problem. Cascade not very optimal.Very large problem. Cascade not very optimal. Solution – port Cascade runtime async on separate worked threads (to SPU on PS3)!Solution – port Cascade runtime async on separate worked threads (to SPU on PS3)! All emitters for a particle system updated in single block of async work (particles, emitter state, system state).All emitters for a particle system updated in single block of async work (particles, emitter state, system state). All particle Modules ported to SPU, except for collision (due to data complexity).All particle Modules ported to SPU, except for collision (due to data complexity).

21 Character Lighting (part 2) Skin transmission faked by using (EN) as lerp factor between diffuse lighting and SH ambient term.Skin transmission faked by using (EN) as lerp factor between diffuse lighting and SH ambient term. Rim Lighting: power-scaling (1-EN) for falloff and then mul by hard thresholding (1-EN).Rim Lighting: power-scaling (1-EN) for falloff and then mul by hard thresholding (1-EN). If threshold is raised high enough (~0.7), ends up looking like chrome mapping!.If threshold is raised high enough (~0.7), ends up looking like chrome mapping!. Final rendering cost ~= 0.8ms per characterFinal rendering cost ~= 0.8ms per character Character mesh-chucks batch rendered.Character mesh-chucks batch rendered.

22 Skin and Metal

23 The Story So Far… So far costs are:So far costs are: Misc~ 0.5 msMisc~ 0.5 ms Shadowmaps: 0.5 msShadowmaps: 0.5 ms Characters: 1.6 msCharacters: 1.6 ms Environment: ~4.X msEnvironment: ~4.X ms MSAA Resolve/Shadow: 1.3 msMSAA Resolve/Shadow: 1.3 ms PostFX: 1.8 msPostFX: 1.8 ms Total ~9.X msTotal ~9.X ms What about particle effects?What about particle effects?

24 Particle Effects (CPU load) All per-particle overhead removed from Game/Render thread!All per-particle overhead removed from Game/Render thread! Particle overhead now a simple linear relationship between system count and emitter count.Particle overhead now a simple linear relationship between system count and emitter count. On PC/360, vertex data for sprites created JIT by async worker thread.On PC/360, vertex data for sprites created JIT by async worker thread. No changes/compromises to artist tools or workflow.No changes/compromises to artist tools or workflow.

25 Particle Effects (SPU load) SPUs extremely fast.SPUs extremely fast. Just used basic C++ code (including templates and polymorphism). No need to bother with intrinsics or ASM.Just used basic C++ code (including templates and polymorphism). No need to bother with intrinsics or ASM. Same module code runs on PS3/360.Same module code runs on PS3/360. Complex (dependant) DMAs done synchronously. Simpler to deal with and fast enough that it doesn’t matter.Complex (dependant) DMAs done synchronously. Simpler to deal with and fast enough that it doesn’t matter. Update done via SPURS jobUpdate done via SPURS job

26 Particle Effects (GPU load) GPU overhead less straightforwardGPU overhead less straightforward Attempt 1: Lie to hardware and tell it we’re in MSAA-4x on non-MSAA target. Looks okay on wispy stuff in general (smoke, fire, etc.), but looks terrible on 360.Attempt 1: Lie to hardware and tell it we’re in MSAA-4x on non-MSAA target. Looks okay on wispy stuff in general (smoke, fire, etc.), but looks terrible on 360.

27 Particle Effects (GPU cont…) Attempt 2: for somewhat opaque particles, break effect out into masked pass and unmasked pass, sorting particles for a system front to back before rendering to prime Z.Attempt 2: for somewhat opaque particles, break effect out into masked pass and unmasked pass, sorting particles for a system front to back before rendering to prime Z. 1.Render particles with alpha-test set to =1.0, front to back 2.Render particles with alpha-test set to <1.0, back to front Didn’t help! Alpha-test disables ZCull writes, negating the benefits of the priming pass.Didn’t help! Alpha-test disables ZCull writes, negating the benefits of the priming pass.

28 Attempt 3: Observation – for flipbook effects, lots of time is wasted rendering alpha-0 space around meaningful content.Attempt 3: Observation – for flipbook effects, lots of time is wasted rendering alpha-0 space around meaningful content. Idea: For flipbook effects, reduce particle dimensions (and UVs) to bound content of the particular flipbook page!Idea: For flipbook effects, reduce particle dimensions (and UVs) to bound content of the particular flipbook page! Works great! Dramatic fillrate improvement from doing this (>50%). Requires artist to identify channel to scan for image bounds.Works great! Dramatic fillrate improvement from doing this (>50%). Requires artist to identify channel to scan for image bounds. Particle Effects (GPU cont…)

29 General Render Thread Optimizations Lots of work to reduce unnecessary operations.Lots of work to reduce unnecessary operations. Render thread virtuals = death by a thousand paper cuts.Render thread virtuals = death by a thousand paper cuts. Cache as much state as possible to reduce redundant virtual calls. Eg, replaced FMaterialRenderProxy’s GetMaterial virtual call with a caching call.Cache as much state as possible to reduce redundant virtual calls. Eg, replaced FMaterialRenderProxy’s GetMaterial virtual call with a caching call. Remove tons of unneeded repeated calls to GetXXX() (ie, GetPixelShader) states from inside Shader processing.Remove tons of unneeded repeated calls to GetXXX() (ie, GetPixelShader) states from inside Shader processing.

30 Misc Further optimizations Cloth simulation moved to run async in another thread (SPU on PS3).Cloth simulation moved to run async in another thread (SPU on PS3). Epic’s water simulation code ported to run on SPU on PS3.Epic’s water simulation code ported to run on SPU on PS3. Animation still synchronous Game-thread based, but doesn’t use AnimTrees. Very limited blend options for designers.Animation still synchronous Game-thread based, but doesn’t use AnimTrees. Very limited blend options for designers. No Occlusion pass – Vis is simple frustum culling.No Occlusion pass – Vis is simple frustum culling. Lots of work to reduce amount of memory allocation via pools and isolated heaps. Still, accounts for 25% of CPU time.Lots of work to reduce amount of memory allocation via pools and isolated heaps. Still, accounts for 25% of CPU time.

31 Garbage Collection Based on work by Stranglehold teamBased on work by Stranglehold team Not quite as aggressive as they were, but removes all live calling of GC from gameplay – only called when exiting modes.Not quite as aggressive as they were, but removes all live calling of GC from gameplay – only called when exiting modes. Memory management switched to deferred (by a frame) cleanup of UObjects/AActors.Memory management switched to deferred (by a frame) cleanup of UObjects/AActors. All “loaded” data trapped via RootsetAll “loaded” data trapped via Rootset Introduces UResource class, a reference counting UObject.Introduces UResource class, a reference counting UObject. All USurface derived classes (ie, UMaterial, UTexture, etc) are all reference counted via UResource to prevent unwanted deletion.All USurface derived classes (ie, UMaterial, UTexture, etc) are all reference counted via UResource to prevent unwanted deletion.

32 Additional Game Details We don’t use UnrealScript. Minimally use Kismet. Use our own scripting engine (C/C++- ish) for AI, object management, menu logic, etc.We don’t use UnrealScript. Minimally use Kismet. Use our own scripting engine (C/C++- ish) for AI, object management, menu logic, etc. Game scripts are expected to manage resource lifetimes.Game scripts are expected to manage resource lifetimes. Main advantage – dynamically reloadable for fast iteration!Main advantage – dynamically reloadable for fast iteration! MKScripts describe resource usage to determine cooked resources that need to be added to characters/backgrounds.MKScripts describe resource usage to determine cooked resources that need to be added to characters/backgrounds.

33 Artist Limitations UE3 gives artists a lot of rope to hang themselves with.UE3 gives artists a lot of rope to hang themselves with. Big thing was to limit who could use the Material Editor.Big thing was to limit who could use the Material Editor. All character art uses same small set of materials.All character art uses same small set of materials. Characters budgeted at 20k polys visible at a time.Characters budgeted at 20k polys visible at a time. Backgrounds budgeted based on visible object count and storage limitations more than polycount.Backgrounds budgeted based on visible object count and storage limitations more than polycount. Environment material/lighting complexity managed by the background lead to ensure overall performance hit GPU performance targets, with various metrics helping to tell them where they were.Environment material/lighting complexity managed by the background lead to ensure overall performance hit GPU performance targets, with various metrics helping to tell them where they were.

34 General Recommendations for hitting 60Hz in UE3 Budget performance up front!Budget performance up front! Given Edge and 360’s unified shaders, geometry less of a problem than fillrate.Given Edge and 360’s unified shaders, geometry less of a problem than fillrate. Predetermine valid PostFx and hardwire the majority of permutations.Predetermine valid PostFx and hardwire the majority of permutations. Reduce dynamic critical sectioned memory allocation as much as possible. Massively stalls all performance.Reduce dynamic critical sectioned memory allocation as much as possible. Massively stalls all performance. Use pool allocators whenever possible, and watch for realloc’s.Use pool allocators whenever possible, and watch for realloc’s. Force designers and artists to run with performance metrics on!Force designers and artists to run with performance metrics on!

35 Recommendations for hitting 60Hz in UE3 on PS3 (well, and 360) Consider what can be deferred and/or can be made to run async and consider moving that work.Consider what can be deferred and/or can be made to run async and consider moving that work. Consider using Edge on PS3.Consider using Edge on PS3. Even sync’d work can be done way faster on SPU if divided over multiple SPUs/threads!Even sync’d work can be done way faster on SPU if divided over multiple SPUs/threads! Don’t be intimidated by the SPUs on PS3. Prototype SPU code on 360/PC where its easier to debug.Don’t be intimidated by the SPUs on PS3. Prototype SPU code on 360/PC where its easier to debug. Template heavy C++ might not be ideal performance case for SPUs, but certainly a LOT better than not using them at all.Template heavy C++ might not be ideal performance case for SPUs, but certainly a LOT better than not using them at all.

36 Things We Have Yet to Address Serialization – as we tend to only stream content underneath movie playback or load screens, the CPU impact wasn’t too problematic for us, though it does impact load times.Serialization – as we tend to only stream content underneath movie playback or load screens, the CPU impact wasn’t too problematic for us, though it does impact load times. Animation – need to explore making it run on worker threads/SPU for deferrable (background and LOD’d) objects.Animation – need to explore making it run on worker threads/SPU for deferrable (background and LOD’d) objects.

37 Acknowledgements Nathan MeffordNathan Mefford Chicago ATG TeamChicago ATG Team

38 Questions? Thanks for listening!Thanks for listening!


Download ppt "Hitting 60Hz with the Unreal Engine: Inside the Tech of Mortal Kombat vs DC Universe Jon Greenberg Graphics Programming Lead MK Team."

Similar presentations


Ads by Google