Presentation is loading. Please wait.

Presentation is loading. Please wait.

Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures.

Similar presentations


Presentation on theme: "Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures."— Presentation transcript:

1 Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures for each block in the architecture  Must be able to simulate billions of clock cycles in a few hours  Must be general enough to use for modeling a variety of processor architectures  Must be able to reason about results at the architecture level Solution: Derive an equivalent circuit of lumped thermal resistances and capacitances. This circuit must be derived at the granularity of the processor architecture. Key components:  Floorplanning  Lumped-RC circuit derivation Temperature-Aware GPU Design Jeremy W. Sheaffer, Kevin Skadron, David P. Luebke {jws9c, skadron, luebke}@cs.virginia.edu University of Virginia, Charlottesville, VA 22904 http://qsilver.cs.virginia.edu Cooling for graphics processors is becoming prohibitively expensive, but cooling solutions are designed for worst-case behavior. Since power dissipation is spatially non-uniform across the chip, localized heating occurs much faster than chip-wide heating, which leads to “hot spots” and spatial gradients that can cause accelerated aging and timing errors. Reducing hot spots reduces cooling requirements. In fact, as true worst-case behavior is rare, a solution designed for the worst case is overdesigned for typical operating conditions. However, a package designed for typical behavior could be overcome by some unusual application, requiring dynamic thermal management (DTM). Problem Statement Architecture-Level Thermal Modeling GPU Simulation with Qsilver To study thermal issues in a GPU, we have developed a simulator called Qsilver that: models GPU clock-cycle-by-cycle activity and power in the microarchitecture domain. uses the Chromium † system to intercept a stream of OpenGL calls, annotating it with aggregate information about the vertices and fragments, textures, lighting, and other relevant rendering state Qsilver is useful for: analyzing performance bottlenecks estimating power exploring new graphics architectural ideas We have used Qsilver to analyze a hypothetical fixed-function console-like GPU architecture. For these results, we augment Qsilver with an architectural thermal model called HotSpot ‡ that tracks temperature in each functional unit over time. Default Separating Hot UnitsHigh ResolutionPartitioned High Resolution Framebuffer control Framebuffer and Data Compression Vertex Engine Fragment Engine Rasterizer Host Interface Texture Cache 2D Video Framebuffer control Framebuffer and Data Compression Framebuffer and Data Compression Framebuffer and Data Compression Rasterizer Host Interface Host Interface Host Interface Texture Cache Texture Cache Vertex Engine Fragment Engine 2D Video Unused Floorplans In order to add thermal modeling to Qsilver, the simulator must first be instrumented with an architectural floorplan. From the left, these floorplans are: Default—based on an nVIDIA marketing photo. We use this chip to drive an 800×600, console-like display in our simulations. Separating Hot Units—based on the default floorplan. The two hottest units, framebuffer operations and the vertex engine, are separated. High Resolution—also based on the default, but modified to drive a PC display at 1280×1024. The framebuffer, fragment engine, and texture cache are enlarged to maintain reasonable power densities under higher workload. Partitioned High Resolution—this novel floorplan maintains the functional unit area of the high resolution design, but partitions units into separate blocks per pipe, and separates hot blocks from cooler ones. Floorplan → DefaultSeparating Hot UnitsHigh ResolutionPartitioned High Resolution Technique ↓ Performance Cost Maximum Temperature Performance Cost Maximum Temperature Performance Cost Maximum Temperature Performance Cost Maximum Temperature No DTM 0.0%106.40.0%105.50.0%103.70.0%100.9 Clock Gating 62.0%97.013.6%97.014.8%97.00.7%97.0 Fetch Gating Vertex Fetch 25.9%102.910.2%98.79.2%101.30.5%98.1 Fetch Gating Rasterizer 90.1%98.117.7%97.817.4%97.00.7%97.8 Dynamic Voltage Scaling 13.1%100.73.4%98.23.4%97.40.1%97.0 Multiple Clock Domains 16.7%98.44.1%97.03.7%97.00.5%97.4 Simulator Setup and Output Thermal Simulation Results From left to right, below: No architectural thermal management with the default floorplan yields a very hot vertex engine; the hot units moved apart, combined with DVS make the chip cooler with a less profound thermal spatial gradient; fetch gating on the high resolution system; and DVS on the redesigned high-res chip, where the affect of separating hotspots on spatial gradient is more obvious—combining static and dynamic techniques is a double win. Note that to better illustrate their full dynamic range, these thermal maps are not all on the same scale. http://gfx.cs.virginia.edu http://lava.cs.virginia.edu For these results, our simulator is configured to model a system: Built on a 180nm process at 1.8V and 300MHz Using an aluminum cooling solution with no fan With a temperature sensor on each functional unit block. We assume that the vendor specifies a 100°C maximum safe operating temperature and enable dynamic thermal management at 97°C to account for sensor imprecision. We have implemented the following DTM techniques on Qsilver: Clock Gating—the clock is stopped until the chip drops below the threshold temperature. Fetch Gating—a single stage in the pipeline is slowed down. We implement this in both the vertex fetch and rasterization stages. Dynamic Voltage Scaling—DVS scales the core voltage, and with it frequency, yielding a cubic reduction in power. Multiple Clock Domains—MCD also scales voltage and frequency, but on the granularity of individual functional units. Both DVS and MCD require a sync time ‘penalty’ when they are enabled and disabled. † http://chromium.sourceforge.net/ ‡ http://lava.cs.virginia.edu/HotSpot/


Download ppt "Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures."

Similar presentations


Ads by Google