® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Agenda Windows Display Driver Model (WDDM) What is GPUView?
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Performance of Cache Memory
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
Rage Fury MAXX™. The Answer to today’s 3D dilemma High performance AND High quality AND Universal application acceleration.
Tools for Investigating Graphics System Performance
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
3D Graphics - Current Technologies Open GLOpen GL (Open Graphics Language) –SGI Silicon Graphics Direct 3DDirect 3D –Microsoft Direct X Technology Java3DJava3D.
Measuring Performance Chapter 12 CSE807. Performance Measurement To assist in guaranteeing Service Level Agreements For capacity planning For troubleshooting.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.
Parallel Graphics Rendering Matthew Campbell Senior, Computer Science
Presentation by David Fong
Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
High Performance in Broad Reach Games Chas. Boyd
1 Design and Implementation of an Efficient MPEG-4 Interactive Terminal on Embedded Devices Yi-Chin Huang, Tu-Chun Yin, Kou-Shin Yang, Yan-Jun Chang, Meng-Jyi.
Background image by chromosphere.deviantart.com Fella in following slides by devart.deviantart.com DM2336 Programming hardware shaders Dioselin Gonzalez.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Orion Granatir Omar Rodriguez GDC 3/12/10 Don’t Dread Threads.
Topics Introduction Hardware and Software How Computers Store Data
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Computer Graphics Graphics Hardware
Kenneth Hurley Sr. Software Engineer
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
® Platform Engineering, CG sysPerf Toolkit Version , as of Feb 11, 1999 For information on algorithms and performance details contact Ken Tracton.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Platform Architecture Lab USB Performance Analysis of Bulk Traffic Brian Leete
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
NVTune Kenneth Hurley. NVIDIA CONFIDENTIAL NVTune Overview What issues are we trying to solve? Games and applications need to have high frame rates Answer.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
11 INSTALLING AND MANAGING HARDWARE Chapter 6. Chapter 6: Installing and Managing Hardware2 INSTALLING AND MANAGING HARDWARE  Install hardware in a Microsoft.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Processor Architecture
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
Tuning Threaded Code with Intel® Parallel Amplifier.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Computer Graphics Graphics Hardware
Using the VTune Analyzer on Multithreaded Applications
Graphics Processor Graphics Processing Unit
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
The Graphics Rendering Pipeline
Topics Introduction Hardware and Software How Computers Store Data
Computer Graphics Graphics Hardware
Direct Rambus DRAM (aka SyncLink DRAM)
RADEON™ 9700 Architecture and 3D Performance
Instruction Level Parallelism
CSC Multiprocessor Programming, Spring, 2011
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March 17, 1999

® GDC’99 Purpose To provide two tools that give more performance information than you can get anywhere else!

® GDC’99 Finding FPS problems in your Game Finding FPS problems in your Game Measuring concurrency in your Game Measuring concurrency in your Game Pinpointing performance thru API logging in your Game Pinpointing performance thru API logging in your Game Tuning D3D App. Perf. Using IPEAK/GPT

® GDC’99 The Tool Family Tree Your Game DirectX* GFX Driver Intel ® Graphics Hardware VTune Analyzer IPEAK GPT *Third party marks and brands are the property of their respective owners Intel ® Graphics Profiler in VTune™ Analyzer 4.0

® GDC’99 Half-Life* FPS Half-Life* FPS Demo * Other brands and names are property of their respective owners. GPT finds frame rate problems

® GDC’99 GPT Intercepts DX6.1: DirectDraw* and Direct3D Immediate Mode* GPT Intercepts DX6.1: DirectDraw* and Direct3D Immediate Mode* GPT and the Graphics Pipeline Graphics Controller Application API (DirectDraw*/ Direct3D IM*) Display Driver GPT interceptor * Other brands and names are property of their respective owners. –Retained Mode* partially supported –OpenGL* planned

® GDC’99 Remove Graphics Load Remove Graphics Load –To measure load balance of CPU vs Graphics Remove Parallelism Remove Parallelism –To measure concurrency Now let’s take control...

® GDC’99 GPT Can Remove... GPT Can Remove... Measuring Load Balance API (DirectDraw*/ Direct3D IM*) Driver Graphics Controller Display Application GPT interceptor * Other brands and names are property of their respective owners.

® GDC’99 GPT Can Remove... GPT Can Remove... Measuring Load Balance API (DirectDraw*/ Direct3D IM*) Driver Application GPT interceptor * Other brands and names are property of their respective owners. –Graphics Controller Graphics Controller Display

® GDC’99 GPT Can Remove... GPT Can Remove... Measuring Load Balance API (DirectDraw*/ Direct3D IM*) Application GPT interceptor * Other brands and names are property of their respective owners. –Driver CPU Load –Graphics Controller Driver

® GDC’99 GPT Can Remove... GPT Can Remove... Measuring Load Balance Application GPT interceptor * Other brands and names are property of their respective owners. –API CPU Load –Driver CPU Load –Graphics Controller API (DirectDraw*/ Direct3D IM*)

® GDC’99 GPT Can Remove... GPT Can Remove... Measuring Load Balance Application GPT interceptor –API CPU Load –Driver CPU Load –Graphics Controller … and keep the App happy … and keep the App happy

® GDC’99 Comparison of NULL API to Normal fps Unmodified NULL API API Overhead Time

® GDC’99 Comparison of NULL API to Normal fps Unmodified NULL API App Bound Time

® GDC’99 Comparison of NULL API to Normal fps Unmodified NULL API Time

® GDC’99 If performance increases dramatically If performance increases dramatically –Too much graphics –Too little app –add more AI/Physics/... If performance doesn’t increase If performance doesn’t increase –Too much App –could HW do more? –Too little graphics What can be inferred...

® GDC’99 Parallel Performance Parallel Performance –CPU & GC work at same time Serial Performance Serial Performance –CPU waits on GC, vice versa Concurrency CPU 3D HW CPU

® GDC’99 GPT Can Introduce Locks here GPT Can Introduce Locks here Measuring Concurrency API (DirectDraw*/ Direct3D IM*) Driver Graphics Controller Display Application GPT interceptor * Other brands and names are property of their respective owners. //

® GDC’99 GPT Can Introduce Locks here GPT Can Introduce Locks here Measuring Concurrency API (DirectDraw*/ Direct3D IM*) Driver Graphics Controller Display Application GPT interceptor // that Serialize CPU & Graphics Hardware activity here * Other brands and names are property of their respective owners.

® GDC’99 Comparison of Serialize to Normal Time fps Unmodified Serial Concurrency

® GDC’99 Comparison of Serialize to Normal Time fps Unmodified Serial Lack of Concurrency

® GDC’99 If Serial << Normal If Serial << Normal –Good. Wider gap means more concurrency If Serial == Normal If Serial == Normal –Application isn’t benefiting from CPU/GC concurrency –App is causing CPU & GC to serialize –Extreme load imbalance –Either no graphics load, or no CPU load What can be Inferred...

® GDC’99 Half-Life* Load Balance/Concurrency Half-Life* Load Balance/Concurrency Demo GPT finds frame concurrency problems * Other brands and names are property of their respective owners.

® GDC’99 API Logging Direct3D calling DirectDraw calling DirectDraw

® GDC’99 Coverage

® Duration (Frame Marking)

® GDC’99 Half-Life* Load Balance/Concurrency Half-Life* Load Balance/Concurrency Demo GPT pinpoints performance problems * Other brands and names are property of their respective owners.

® GDC’99 GPT quickly finds FPS problems in your game GPT quickly finds FPS problems in your game GPT measures Concurrency & Load Balance GPT measures Concurrency & Load Balance GPT pinpoints API level performance problems GPT pinpoints API level performance problems GPT Summary

® GDC’99 Intel ® Graphics Profiling Capability of VTune™ Performance Analyzer 4.0 What Is It? What Is It? What’s It Do? What’s It Do? Show Me How... Show Me How...

® GDC’99 VTune™ Performance Analyzer 4.0 System monitoring System monitoring Software execution examination Software execution examination Dynamic simulation and analysis Dynamic simulation and analysis What Is It?

® GDC’99 Intel ® Graphics Profiling Capability Integrated into VTune™ Performance Analyzer 4.0 Integrated into VTune™ Performance Analyzer 4.0 3D application profiling 3D application profiling What Is It?

® GDC’99 The Tool Family Tree Intel ® Graphics Profiler in VTune™ Analyzer 4.0 Your Game DirectX* GFX Driver Intel Graphics Hardware VTune Analyzer IPEAK GPT What Is It? *Third party marks and brands are the property of their respective owners

® GDC’99 Architecture Select and view events Select and view events L2 Cache CPU Chip Set Sys Mem PCI Bus Intel ® Graphics Accelerator Local Vid Memory Intel740™Driver Setup Pix Fill Frames/Sec CPU Utilization State Changes AGP –Intel ® Graphics Hardware Driver Tri/Sec,Utilization Pix/Sec,Utilization –Intel ® Graphics Chip 3DPipe 2D Engine 2D What Is It?

® GDC’99 Analyze Intel ® Graphics Hardware Maximum fill rate Clocks app sits idle 3D Clocks can be recovered What’s It Do?

® GDC’99 Watch Intel ® Graphics D3D*/OpenGL* Drivers Total time in driver Total time in driver Duty cycle for average triangle Duty cycle for average triangle Frames per second Frames per second Total time in each driver call back Total time in each driver call back What’s It Do? *Third party marks and brands are the property of their respective owners

® GDC’99 Reports Bottlenecks Triangle packet size Triangle packet size CPU/Intel740™ chip concurrency CPU/Intel740™ chip concurrency Locks to render targets Locks to render targets What’s It Do?

® GDC’99 Get Started Profile your app with VTune™ Analyzer 4.0 Profile your app with VTune™ Analyzer 4.0 Look for hot-spots Look for hot-spots Look at HW/Driver Counter graphs Look at HW/Driver Counter graphs Find the problem then “drill down” to the CPU time frame Find the problem then “drill down” to the CPU time frame Show Me How...

® GDC’99 Find the Bottleneck Serialization vs Concurrency Serialization vs Concurrency –The CPU sits idle (waits for HW) – The graphics HW sits idle (CPU busy) Why? Why? –Improperly placed 2D instructions –Triangle-at-a-time methodology Gfx HW Raster Triangles Raster Triangles... Driver Duty Cycle Driver Duty Cycle One Frame One Frame Processor GfxHW Drv Light/Transform/Game Control GfxHW Drv Light/Transform… Show Me How...

® GDC’99 Demo: Guess What the Bottleneck Is? Show Me How...

® GDC’99 What to Look For An app requires triple buffering... An app requires triple buffering... An app requires MipMapping… An app requires MipMapping… You can gather 3D statistics… You can gather 3D statistics… Show Me How...

® GDC’99 Demo: Guess What the Bottleneck is? Show Me How...

® GDC’99Summary: Intel ® Graphics Profiler is a new capability of VTune™ Analyzer 4.0 Intel ® Graphics Profiler is a new capability of VTune™ Analyzer 4.0 Intel Graphics Profiler monitors graphics HW and driver performance Intel Graphics Profiler monitors graphics HW and driver performance What you learn can apply to other graphics hardware What you learn can apply to other graphics hardware Usage: find the problem, then drill down! Usage: find the problem, then drill down!

® GDC’99 IPEAK GPT IPEAK GPT –Questions, comments - –IPEAK Web site – Intel® Graphics Profiling Capability in Vtune™ Analyzer 4.0 Intel® Graphics Profiling Capability in Vtune™ Analyzer 4.0 – – s/swdev/index.htm Support & Information Download the Demo!!!

® GDC’99 BACKUP

® Installation Included in VTune™ 4.0 Analyzer Installation Included in VTune™ 4.0 Analyzer Installation –Select the Intel740™ Graphics Accelerator counters at the component installation configuration menu Enabling Graphics Profiling Enabling Graphics Profiling –Under “Configure”, under “Options” and “Sampling”, select “Chronology Objects” –Enable the Intel ® Graphics Counters (Intel740™ Graphics Accelerator) –Double click on the Intel740 Chip Counters in the same menu to configure individual counters –Finally, Under “Sampling”, select “Advanced” and enable “Collect Chronology Data” OA Profiler Capability is Included with VTune 4.0 Analyzer

® GDC’99 Tuning your App with the OA Tool Set: Accounting for lost clocks Intel ® Graphics Hardware Intel ® Graphics Hardware –Maximum fill rate is 1 Pix/Clock - 66Meg Pixels/Sec –Clocks between Cmd_Stream_Busy and 66M are clocks the Intel740™ chip sits Idle –3D Clocks not producing a pixel potentially can be recovered by modifying application code D3D*/OpenGL* Driver D3D*/OpenGL* Driver –Total number of CPU clocks used by the driver –Duty cycle for average triangle sizes listed in the SUG can be used to predict where your game should be running –Total from each call backs can be observed to narrow down bottlenecks. –Typical bottlenecks: Triangle packet size, CPU/Intel740 chip concurrency, locks to render targets *Third party marks and brands are the property of their respective owners

® GDC’99 What Intel ® Graphics Counters Tell About Your App % 2D or 3D Cmd Stream Busy % 2D or 3D Cmd Stream Busy –Total amount of time the graphics hardware is in use % 3D Fill Engine Busy vs % 3D Fill Engine Stall % 3D Fill Engine Busy vs % 3D Fill Engine Stall –If very high, app can be fill rate limited (very large tris) –Contrast to see if busy but stalled indicating either waiting for pixel data or waiting for info to finish pixel calculation % 3D Pipeline Busy % 3D Pipeline Busy –If higher than %3D Fill Engine busy indicates too many small triangles and setup limited Graphics Counters Correlate GFX Hardware Events

® GDC’99 What Intel ® Graphics Counters Tell About Your App Pixels (Z Tested) and (Z Failed) Pixels (Z Tested) and (Z Failed) –Number of pixels processed by the gfx card –Z failed / Z tested gives % Z buffer depth Z Writes to Z Buffer Z Writes to Z Buffer –Counts the number of 16-bit Z writes Pixel Reads from Render Buffer Pixel Reads from Render Buffer –You can check what % of your scene gets alpha blended when contrasted with Pixel Writes Color Calculator Stalled by Color Read Color Calculator Stalled by Color Read –If this is high, alpha blending could be causing a bottleneck for local memory bandwidth Counters Used in Combination Uncover Added Information

® GDC’99 What Intel ® Graphics Counters Tell About Your App “Triangles Processed” & “Triangles Rendered” “Triangles Processed” & “Triangles Rendered” –Triangles per second. A large discrepancy indicates zero pixel triangles “AGP Texture Data Bytes Read” “AGP Texture Data Bytes Read” –This is AGP bandwidth being used for textures in bytes. “Texture Cache Busy” & “Texture Cache Fetch Stall” “Texture Cache Busy” & “Texture Cache Fetch Stall” –All texel data goes through the texture cache so this indicates texture usage. –Texture Cache Fetch Stall - very high indicates AGP texture bandwidth is overrun - need mipmapping. What You Learn Can Apply to Other Graphics Hardware