4 Kernel Mode Driver (KMD) Graphics & WDDMSession SpaceKernel Mode Driver (KMD)KernelModeWin32kernelDxgkrnlApplicationD3D RuntimeDWMUser ModeApplication ProcessBefore we take a more in depth look at what GPUView is, and how it can be used to analyze a DX11 game, we first need to have an understanding of the WDDM (Windows Display Driver Model). As of Windows Vista, and now Windows 7, the graphics driver model changed significantly. Their was an emphasis on keeping as much in user mode as possible, and hence you see here, that everything below the central blue line represents user mode work, and everything above represents kernel mode work.On the bottom right side we have the DWM (Desktop Window Manager), which is essentially responsible for rendering windows itself. And we shall see later how this shows up in GPUView.On the bottom left we have an application (your game), making calls into the D3D runtime, which is in turn talking with the user mode graphics driver, supplied by the graphics driver vendor.Above the line we see that the D3D runtime and UMD interact with the Win32 Kernel, the Dxgkrnl and the graphics kernel mode driver.So that‘s the setup we working with on a PC.User Mode Driver (UMD)DWM Process
5 GPU Scheduler Database 4/6/2017 2:25 PMFeeding the GPU…GPUGPU Scheduler Database12DMA BufferWaitWin32k & dxgkrnlKMDKernel ModeD3D RuntimeUMDD3D RuntimeUMDApplication #1Command BufferApplication#2Command BufferJust to continue to paint the picture of what is happening lets look at how the GPU is being fed.Again on the bottom of the screen we have user mode work, and in this case we have multiple D3D applications, talking with the D3D runtime, the UMD and building command buffers.Above the line, in kernel mode, the Win32 kernel, the dxgkrnl talk with the KMD, and produce DMA packets. These are placed in a queue for the relevant GPU. The GPU scheduler finally places the packets in a ring buffer for the GPU to process.The key take away from this slide, should be that feeding the GPU, is actaully not that simple. And actaullly that is partly why GPUView was created, so that developers could get a good handle on what’s going on in the system.User Mode
7 What is GPUView? An additional Microsoft performance tool Compliments existing toolsPart of the Windows 7 SDKBuilt on Event Tracing for WindowsPerfect for monitoring CPU/GPU interaction (even for multiple GPU setups)Allows you to see how well the GPU is being fedSupports DX9, DX10 & DX11 on Win7GPUView is an additional tool you should have in the box. As you will see, it adds something on top of traditional profilers for the both the CPU and GPU. It was introduced by Microsoft, as part of the Windows 7 SDK, and is essentially built on ETW.The tool does work on Windows Vista, but only with a limited set of counters – so we would recommend using Windows 7. Windows XP is not supported.So what is GPUView good for? Basically it is perfect for monitoring CPU / GPU interation, keeping track of how well threaded your app is, and even for GPU / GPU interaction.You‘ll be able to see just how well, you‘re feeding the GPU.It supports DX9, DX10 and DX11 – but only on Windows 7.
8 Capturing Data Run an elevated command prompt \Program Files\Microsoft Windows Performance Toolkit\GPUViewStart your game in windowed modeFor fullscreen mode perhaps use PsExec from a remote machineStart capturing with log.cmdCapture seconds of your gameStop logging with log.cmdOpen merged.etl file with GPUView.exeAs already mentioned GPUView essentially works on ETW, so to perform a capture you basically need to start logging. To do this you need to navigate to the GPUView installed folder, and using the command prompt, start capturing with log.cmd. To stop capturing you simply run log.cmd again.It may be easiest to run your game in windowed mode, and simply alt-tab between the command window and your game window. Though it may sometimes be necessary to run your game in fullscreen mode (maybe you wish to look at a MGPU setup). And in this case you can still alt-tab if you wish, or you could use a hand tool called PsExec to starting the logging from a remote machine.It makes sense to capture for a few (10 or so) seconds, and then stop. If you capture for too long, the ETL (Event Trace Log) will be rather large, and may make viewing the results slow.Once capturing has stopped you can pick up the results in the produced Merged.etl file, and view them with GPUView.exe.
9 Was this tool created for driver programmers? So when you open up one of the merged.etl files, this is what you‘ll be greeted with…I know what you‘re thinking, no really, I know what you‘re thinking. Was this tool created for driver programmers? Yes. And the reason has been alluded to already. The WDDM, and feeding the GPU is more intricate than some may have thought.So GPUView is not the prettiest of performance tools, but it doesn‘t make it any less useful…
10 Navigating the Data Use the mouse to select a region Ctrl+Z zooms in to a selectionZ zooms outUse +/- to see more or less detailCtrl+E opens the event menuClick on objects for additional detailsMore on this later…The main controls you need to be aware of are:Use the mouse to select a region of the capture you‘re interested in, and then hit Ctrl+Z to zoom into that region. Using Z allows you to zoom back out.+/- allow you to see more or fewer threads and processes being displayed.Simply clicking on a data object will bring up details.
11 Zooming in…So here we have zoomed in significantly from the original startup view you saw.Along the top you can see the grey timeline, where you can see both the offset into the log, and the unit of measurement. The unit here is 1 millisecond, so we are roughly 18 seconds into the log.Immediately below that we have the HW GPU queue. The colored bars represent DMA packets of work sitting in the queue. Different colors mean different things, and we‘ll come to those later. The height of the stack of bars, represents the depth of the queue. Bars in the bottom row represent work being executed by the GPU, and bars above the bottom row, are waiting to be executed in the queue. But we will look in more detail at what all of this means.At the bottom we have the SW CPU queue. Again the colored bars represent packets of work sitting in the queue. As above the height of the stack represents the depth of the queue. So lets look at both the GPU and CPU queues in more detail, and explain a little more about what all this color means.
12 DMA Packet Color Coding Various types of DMA packets may be submitted to the GPU:Red: Paging packetBlack: Preemption packetBrown: DWM packetOther Color: Standard packetOther Color + Cross-Hatch: Present packetI have already mentioned that the colored bars represent different kinds of DMA packets. We have met the bottom 3 so far, but not a paging packet, and we shall come to that later on.
13 What does a Standard DMA Packet Represent? Graphics system state objectsDraw commandsReferences to resource allocationsTexturesVertex & Index BuffersRender TargetsConstant Buffers
15 SW Context CPU Queues (1) D3D app stacking up 3 frames of packetsDesktop Window Manager packetHere we have 2 SW CPU queues. The top one is the DWM (Desktop Window Manager), we showed this on some of the first slides, as is responsible for rendering widnows itself. It‘s always colored in brown, and later on we‘ll see how it gets added into the GPU HW queue for rendering.Below this we have another SW CPU queue, that represents packets being queued up by a DX SDK sample (DrawPredicated). The solid blue bars are normal packets for rendering the scene. The blue bars with the red cross-hatch, represent present packets. So its clear to see that we are in fact queueing up 3 frames worth of work.
16 SW Context CPU Queue (2) CPU queue depth is 6 Task submitted to HW queueCPU queue is empty!New Task submitted to CPU queueJust to illustrate this more plainly. Here we have queue depth of 6 on the left side (but in reality this represents 3 frames deep, as 3 of the packets are present packets).When an object reaches the bottom row it is submitted into the GPU HW queue.When spaces occur between objects in the bottom row, the CPU queue has gone empty. No work is being queued up for the GPU, and ultimately the GPU will go idle too.On the right side we start adding new packiets again.
17 SW Context CPU Queues (3) Objects represent work submitted to a GPU contextQueue is represented through time as a stackStack grows on submission of work by the UMDStack shrinks as work is completed by the GPUJust to reiterate, and for the benefit of anyone not seeing me present this.The colored bars (objects) represent work submitted to the GPU. The queue depth grows as work is submitted by the UMD, and shrinks as the GPU completes the work.
18 GPU Processing DMA Packet GPU HW Context Queue (1)Present PacketPreemption packetQueued DMA PacketGPU Processing DMA PacketDWMSo that was the SW CPU queue, let‘s turn our attention to the HW GPU queue.The black colored bar, is a preemption packet. These are used to insert DMA packets into the HW queue. If you look at what follows the preemption packet on the bottom row, it is in fact a brown colored packet, which has originated from the DWM SW queue. So basically these black & brown packets represent windows getting rendered in and amongst a real D3D application.In just the same way as the SW queue, the solid bars and cross-hatch bars represent normal rendering packets and present packets.
19 GPU HW Context Queue (2) GPU starts working on packet GPU finishes working on packetGPU has no work to doAs an example of this we can see exactly when the GPU starts working on a DMA packet, and when it finishes. Likewise we can see precisely when it has nothing to do.There really is no other tool out there that can show this kind of information.
20 GPU HW Context Queue (3) Queue is represented through time as a stack Stack grows on submission of work by the KMDStack shrinks as work is completed by the GPUGaps indicate a CPU side bottleneckJust to reiterate, and for the benefit of anyone not seeing me present this.The colored bars (objects) represent work submitted to the GPU. The queue depth grows as work is submitted by the KMD, and shrinks as the GPU completes the work.Gaps in the GPU HW queue mean that the GPU will be idle, and therefore suggests some kind of CPU side bottleneck.
21 Object Selection Represents latency GPUView allows you to select an object simply by clicking on it with the mouse. Selecting an object in either the SW or HW queue, will highlight the entire lifetime of that obejct in both views.Above you can see how an object has trickled down through the SW queue, and then how it has tricked down in the HW queue.The length of time between entering the SW queue and completing in the SW queue, represents the latency.
22 Object Details (1) Packet type & timing information Allocation references in DMA packetWhen you click on an object you‘ll also see this dialog box appear.It contains alot of detailed information about the specific packet you‘ve clicked on. Such as it‘s type, and time spent in the HW queue.Of special interest is the fact that you can see all of the allocation references within the packet.
23 Preferred memory segment Object Details (2)Preferred memory segmentP0 = PreferredP1 = LessP2 = Least(w) = Writable by GPUThe (w) means that the resource is writable by the GPU, and therefore could be a render target. Just after that we have the preferred memory segment being used by the resource. P0 being most preferred and P2 the least. So if you see resources getting allocated in P2 you may have a performance problem, especially if the resource is large.We‘ll come on to what P0, P1, and P2 mean on the next slide.Simply by highlighting one of the allocation references and hitting the Locate Object button, we can enter the Object Viewer.
24 Clearly the depth buffer Object ViewerSegment Numbers:1 = Vid Mem (CPU visible)2 = Vid Mem (Non visible)3 = PCI Express MemClearly the depth bufferNow we can see what the Preferred segment order is for this particular object.You can see that this object sits in non CPU visble video memory, and is likely the depth buffer.In this way you can take a look at resources, and see if they are ending up in bad segments. You may wish to use another tool in conjunction with this one to fully track down which resource in your game this really is. But the size and format is a good starter.I should mention that this number is AMD specific, and you should consult with your NVIDIA dev rel engineer, to get this right on their HW.
25 Paging Buffer PacketSubmitted as the result of a paging operation (perhaps a large texture)Cause is usually resulting from preparing a DMA bufferLook at the DMA packet that follows the paging operationOne type of packet we haven‘t discussed yet is a paging packet, which is always denoted by a red bar. These packets are submitted as a result of a memory paging operation. Perhaps an application is streaming in a lot of large textrures, for example. To try and find out which resources have caused the paging operation, we recommend you look at the DMA packets that follow the paging packet. It would be of interest to find out if all of the allocations are getting their preferred segments?Obviously paging is not always possible to avoid. But it should be noted that while paging is going on, the GPU is not ghetting on with the job at hand – rendering the scene. So it is best to have a good look at what is causing this kind of behaviour, and whether anything can be done about it.
27 Colored bars represent idle time HW ThreadsColored bars represent idle timeGaps represent workIn addition to the SW and HW queues, there are a number of other views supported in the tool. Above shows the Idle HW threads view, and you can see that this represents a 6 core CPU system. Since this is the idle view, the colored sections represent idle time. The gaps represent work done by various threads. The color for each HW thread above is used wherever that HW thread is used.
28 Thread Execution Thread segments are colored coded: Light blue: Kernel modeDark blue: dxgkrnlRed: KMD (Kernel Mode Driver)If we look more closely at a single threaded SW queue, you can see the thread activity below it. The main green color means that it was HW thread 4. The colored sections represent time spent in various modes. Ideally you don‘t want to see a lot of kernel mode time, and typcially speaking we see far less of this in DX10/11 than in DX9.
29 Charts: FPS / Latency / Memory Additionally there are various charts you can bring up such as FPS, Latency, and Memory consumption.
30 Viewing Events Ctrl+E opens the Event View window Can track whatever events take your interestDX- Create / Destroy AllocationDX BlockSuggests possible resource contentionPerhaps trying to lock an in use bufferYou can bring up the Event View window with Ctrl+E, and track whatever events you like. Of particular interest may be things like DX create / destroy calls, or DX Block. Also worthy of note is that you can trace evenst not related to DX, so GPUView does have the ability to have a slightly wider scope.
31 V-Sync EventJust as a quick example of how an event would show up in GPUView. Here we have enabled the v-sync event, which appears as the regular blue vertical lines. Other events will appear with different colors.
33 DrawPredicated SDK Sample GPU is busy, no gapsCPU queue is buffering up nicelyApp thread not saturatedOk so let‘s take a look at a couple of log files from various DX SDK applications, and see what they can tell us. Firstly I wanted to show you the DrawPredicated DX10 sample. If you recall a predicated draw is a non-blocking conservative way to deal with occlusion testing.As you can see from the GPU HW queue, there are no gaps between the colored bars, and the queue looks to be 2 deep with rendering packets, all of the time. This suggests that the application is GPU limited. This was running on a low end GPU. You can also see the preemption packets, and the DWM packets for rendering windows.Looking now at the SW CPU queue, you can see that there are no gaps between bars, and also that the queue depth looks to be around 3 deep in terms of real rendering packets. This all looks good, the GPU is being fed well.Lastly if you look at the thread bar below the SW CPU queue, you‘ll note there are lots of gaps, indicating that this app is not fully saturated on the main thread.
34 DrawPredicated SDK Sample: + blocking occlusion queries GPU is going idleNot enough being queued upApp thread fully saturatedSo what changes if we add blocking occlusion queries back into the mix?Looking at the HW GPU queue, you‘ll first notice that there are now gaps appearing between bars in the bottom row, which indicates that the GPU is going idle. So it is being starved of useful work to do.Looking at the SW CPU queue, you‘ll also notice that there are gaps appearing between bars, and even worse still, there is clearly not enough work being queued up.The real give away that the application is doing a busy wait for occlsuion query results is the thread activity bar at the bottom, which is now almost completly saturated.Watch out for this kind of behavior!If you‘re going to use occlusion queries then don‘t do a busy wait in teh same frame, and instead try and pick up the result n frames later, where N is the number of GPUs in the sytsem.
35 Getting Occlusion Queries Right Delay picking up results by N framesWhere N = Number of GPUsMay need to artificially inflate occlusion volumes to avoid poping
36 What else could cause this problem? Locking a Render TargetUse CopyResource & Staging TexturesThis is a queued operation
37 ContentStreaming SDK Sample (1) Paging packetsGPU is going idleLet‘s look at another example, here we‘re looking at the ContentStreaming SDK sample. Straight away we can see those red paging packets in the second HW context queue. Note that while that is going on the main rendering context queue is not being fed. So it is important to trya nd avoid this kind of situation if possible. So we see large gaps in the HW GPU queue, meaning the GPU is going idle.Looking at the SW CPU queue, we can also see that while there are no gaps appearing we are still failing to queue up enough work. Again likely due to the IO occuring in the application.Let‘s look at the contents of the packets that follow the paging packets, to see if they get preferred segments…
38 ContentStreaming SDK Sample (2) Large resources not getting preferred segmentsSo clicking on one of those packets brings up the Object details. We can then see from some of the allocation references that a number of the allocations are going into P1 (so not the preferred segment). Critically if we look at thge size of the resource, we can see that they are quite large – on the order of MBs.You should definietly be looking for problems like this occuring. Talk with your local developer relatiosn engineer if you‘re not sure.Specifically I should mention that these segment numbering and meanings differ between AMD and NVIDIA.
39 Avoiding Paging Keep your video memory usage under control Especially in MSAA modesDrop texture resolution for lower end HWAvoid excessively large amounts of dynamic dataTextures & Vertex BuffersIf not sure – talk to us!
40 MultithreadedRendering11 SDK Sample But there is a lot of D3D runtime / driver overheadAdditional threads preparing packetsIn this last example, we look at the MultithreadedREndering11 DX SDK sample. What is interesting to note here, is that tehre are now several more threads preparing rendering command buffers – this is clear to see in the thread activity view.However, what is equally important to note is that there is alot of D3D runtime / UMD overhead.But this talk is not about the use of DCs, if you have questions about this come and talk to us, before wasting valubale time and effort getting nowhere…
41 Multi-Threaded Rendering and Deferred Contexts It is a complex issueDon‘t expect it to be a magic bulletStrongly recommend you talk to developer relations from AMD & NVIDIA
43 Summary Make sure you‘re keeping the ever hungry GPU fed Keep track of CPU/GPU interactionKeep track of your threadsMonitor multi-GPU interactionAdd GPUView to your toolboxIn the end GPUView provides additional information not to be found anywhere else. Why not check it out. Especially if you have some performance woes in your game.Check to see how well you‘re feeding the GPU, keep track of your threads, and even monitor GPU to GPU interaction.That‘s all.
44 Acknowledgments Microsoft for creating GPUView Microsoft for providing background contentMy thanks to Microsoft for creating GPUView in the first place, and also for providing some of the baqkground material for this presentation.