Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm.

Similar presentations


Presentation on theme: "1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm."— Presentation transcript:

1 1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm

2 2 Presenters: Kevin Skadron (skadron@cs.virginia.edu)skadron@cs.virginia.edu CS Department, University of Virginia David Brooks (dbrooks@eecs.harvard.edu)dbrooks@eecs.harvard.edu CS Department, Harvard University Antonio Gonzalez (antonio@ac.upc.es)antonio@ac.upc.es UPC-Barcelona, and Intel Barcelona Research Center Lev Finkelstein (lev.finkelstein@intel.com)lev.finkelstein@intel.com Intel Haifa Mircea Stan (mircea@virginia.edu)mircea@virginia.edu ECE Department, University of Virginia

3 3 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) 1.5 4.Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs 6.Clustering (Antonio) 1 hr 7.Power distribution (David) 15 min 8.What current chips do (Lev) 45 min 9.HotSpot and sensors (Kevin) 1 hr

4 4 Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David) 5.Optimal DTM (Lev) 6.Clustering (Antonio) 7.Power distribution (David) 8.What current chips do (Lev) 9.HotSpot (Kevin)

5 5 Motivation Power consumption: first-order design constraint  unconstrained power is a theoretical max  peak (  inst.) power is limiting power delivery  sustained power limits thermal design/packaging  max sustained power: thermal “virus”  same as thermal design power  average active power and idle power limit mobile battery life, etc.  Common fallacy: instantaneous power  temperature Power-density is increasing exponentially  Unfortunate corollary of Moore’s Law  thermal effects become more problematic Need Power/Temperature-aware computing!

6 6 Power Dissipation Source: Microprocessor Report

7 7 Effects of Technology Scaling on Power Dissipation Feature size is scaling down –30% Frequency is increasing –~2x Area increases due to microarchitecture improvements –25% (Ideal scaling: decreases by 50%) Active capacitance increases –at least 30% (Ideal scaling: decreases by 30%) Vdd is not scaled down at the same rate as feature size –0-10% (Ideal scaling: 30%) Ideal scaling: P  CV 2 f → 0.7 2 reduction  0.5 Observed scaling → 2 – 2.5x increase Power density becomes a problem! –Especially since the power density is non-uniform

8 8 Trends in Power Density Watts/cm 2 1 10 100 1000  i386 i486 Pentium® Pentium® Pro Pentium® II Pentium® III Hot plate Nuclear Reactor RocketNozzleRocketNozzle * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Pentium® 4

9 9 ITRS Projections These are targets Power-density problem is still getting worse Intel papers suggest that in the 45-75W range, cooling costs $1/W; but then rate of increase goes up: $2, $3/W, probably more! (Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01) ITRS 2001

10 10 Leakage Power The fraction of leakage power is increasing exponentially with each generation Also exponentially dependent on temperature Increasing ratio across generations Source: Sankaranarayanan et al, University of Virginia

11 11 Power-aware figures of merit Power (P): battery time (mobile) (1/W) packaging (high-performance) Energy (PD): battery life (mobile) (MIPS/W) fundamental limits (kT) Energy-delay (PD 2 ): (MIPS 2 /W) performance and low power Energy-delay 2 (PD 3 ): indep. of Vdd (MIPS 3 /W) emphasis on performance  Power-aware  low power  Similar to “old” VLSI complexity (A,AD,AD^2)  None of these are appropriate for thermal  This is a problem Refs: R. Gonzales et al. “Supply and threshold voltage scaling for low power CMOS”, JSSC, Aug. 1997 A. Martin et al. “Design of an Asynchronous MIPS R3000”, ARVLSI’97 J. Ullman, “Computational aspects of VLSI”, CS Press, 1984

12 12 Cooking-aware computing  Some chips rated for 100°C+

13 13 Power and temperature are BAD and can be EVIL Source: Tom’s Hardware Guide http://www6.tomshardware.com/cpu/01q3/010917/heatvideo-01.html

14 14 Other Costs of High Heat Flux Some chips may already be underclocked due to thermal constraints! –(especially mobile and sealed systems)

15 15 Temporal, Spatial Variations Temperature variation of SPEC applu over time Hot spots increase cooling costs  must cool for hot spot

16 16 Application Variations Wide variation across applications Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT) –Leakage is an especially severe problem: exponentially dependent on temperature!

17 17 Heat vs. Temperature Different time scales Heat: no notion of spatial locality Does architecture have a role? Temperature-aware computing: Optimize performance subject to a temperature constraint

18 18 Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David) 5.Optimal DTM (Lev) 6.Clustering (Antonio) 7.Power distribution (David) 8.What current chips do (Lev) 9.HotSpot and sensors (Kevin)

19 19 Thermal issues Temperature affects: Circuit performance Circuit power (leakage) IC reliability IC and system packaging cost Environment

20 20 Performance and leakage Temperature affects : Transistor threshold and mobility Subthreshold leakage, gate leakage Ion, Ioff, Igate, delay ITRS: 85°C for high-performance, 110°C for embedded! Ion NMOS Ioff

21 21 Temperature-aware circuits Robustness constraint: sets Ion/Ioff ratio Robustness and reliability: Ion/Igate ratio Idea: keep ratios constant with T: trade leakage for performance! Ref: “Ghoshal et al. “Refrigeration Technologies…”, ISSCC 2000 Garrett et al. “T3…”, ISCAS 2001

22 22 Resulting performance 25% - 30% extra performance (110 o C to 0 o C) regular TAC

23 23 Reliability The Arrhenius Equation: MTF=A*exp (E a /K*T) MTF: mean time to failure at T A: empirical constant E a : activation energy K: Boltzmann’s constant T: absolute temperature Failure mechanisms: Die metalization (Corrosion, Electromigration, Contact spiking) Oxide (charge trapping, gate oxide breakdown, hot electrons) Device (ionic contamination, second breakdown, surface-charge) Die attach (fracture, thermal breakdown, adhesion fatigue) Interconnect (wirebond failure, flip-chip joint failure) Package (cracking, whisker and dendritic growth, lid seal failure) Most of the above increase with T (Arrhenius) Notable exception: hot electrons are worse at low temperatures More on this later

24 24 Packaging cost From Cray (local power generator and refrigeration)… Source: Gordon Bell, “A Seymour Cray perspective” http://www.research.microsoft.com/users/gbell/craytalk/

25 25 Packaging cost To today… Grid computing: power plants co-located near compute farms IBM S/390: refrigeration Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D

26 26 IBM S/390 refrigeration Complex and expensive Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D

27 27 IBM S/390 processor packaging Processor subassembly: complex! C4: Controlled Collapse Chip Connection (flip-chip) Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D

28 28 Intel Itanium packaging Complex and expensive (note heatpipe) Source: H. Xie et al. “Packaging the Itanium Microprocessor” Electronic Components and Technology Conference 2002

29 29 Intel Pentium 4 packaging Simpler, but still… Source: Intel web site

30 30 Graphics Cards Nvidia GeForce 5900 card Source: Tech-Report.com

31 31 More Graphics Cards

32 32 Under/Overclocking Some chips need to be underclocked –Especially true in constrained form factors Try fitting this in a laptop or Gameboy! Ultra model of Gigabyte's 3D Cooler Series Source: Tom’s Hardware Guide

33 33 Apple G5 – liquid cooling Don’t know details Lots of people in thermal engineering community think liquid is inevitable, especially for server rooms But others say no: –This introduces a whole new kind of leakage problem –Water and electronics don’t mix!

34 34 Environment Environment Protection Agency (EPA): computers consume 10% of commercial electricity consumption –This incl. peripherals, possibly also manufacturing –A DOE report suggested this percentage is much lower –No consensus, but it’s still a lot Equivalent power (with only 30% efficiency) for AC CFCs used for refrigeration Lap burn Fan noise

35 35 Heat mechanisms Conduction Convection Radiation Phase change Heat storage

36 36 Conduction Similar to electrical conduction (e.g. metals are good conductors) Heat flow from high energy to low energy Microscopic (vibration, adjacent molecules, electron transport) No major displacement of molecules Need a material: typically in solids (fluids: distance between mol) Typical example: thermal “slug”, spreader, heatsink Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001 A

37 37 Conduction Not a strong function of temperature But for the high temp. variations on high-perf. chips, (30+°), it matters Note esp. Si vs. Al, Cu Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

38 38 Convection Macroscopic (bulk transport, mix of hot and cold, energy storage) Need material (typically in fluids, liquid, gas) Natural vs. forced (gas or liquid) Typical example: heatsink (fan), liquid cooling Note that convection is profoundly affected by board layout Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

39 39 Radiation Electromagnetic waves (can occur in vacuum) Negligible in typical applications Sometimes the only mechanism (e.g. in space) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

40 40 Carnot Efficiency Note that in all cases, heat transfer is proportional to ΔT This is also one of the reasons energy “harvesting” in computers is probably not cost-effective –ΔT w.r.t. ambient is << 100° For example, with a 25W processor, thermoelectric effect yields only ~50mW –Solbrekken et al, ITHERM’04 This is also why Peltier coolers are not energy efficient –10% eff., vs. 30% for a refrigerator

41 41 Surface-to-surface contacts Not negligible, heat crowding Thermal greases/epoxy (can “pump-out”) Phase Change Films (undergo a transition from solid to semi-solid with the application of heat) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

42 42 Phase-change Thermal solutions evolution: Natural air cooling Forced-air cooling Liquid cooling Phase change (e.g. heat pipe) Refrigeration Phase change: a. Solid changing to a liquid—fusion, or melting, b. Liquid changing to a vapor—evaporation, also boiling, c. Vapor changing to a liquid—condensation, e. Liquid changing to a solid—crystallization, or freezing, f. Solid changing to a vapor—sublimation, g. Vapor changing to a solid—deposition.

43 43 Thermal resistance Θ = rt / A = t / kA

44 44 Thermal capacitance C th = V·C p ·   (Aluminum) = 2,710 kg/m 3 C p (Aluminum) = 875 J/(kg-°C) V = t· A = 0.000025 m 3 C bulk = V·C p ·  = 59.28 J/°C

45 45 Refrigeration “conventional” vs. thermo-electric (TEC) Can get T < T_amb (“negative” Rth!) TEC: Peltier effect (can use for local cooling)

46 46 TEC electro-thermal model

47 47 Simplistic steady-state model All thermal transfer: R = k/A Power density matters! Ohm’s law for thermals (steady-state)  V = I · R ->  T = P · R T_hot = P · Rth + T_amb Ways to reduce T_hot: -reduce P (power-aware) -reduce Rth (packaging) -reduce T_amb (Alaska?) -maybe also take advantage of transients (Cth) T_hot T_amb

48 48 Simplistic dynamic thermal model Electrical-thermal duality V  temp (T) I  power (P) R  thermal resistance (Rth) C  thermal capacitance (Cth) RC  time constant KCL differential eq. I = C · dV/dt + V/R differenceeq.  V = I/C ·  t + V/RC ·  t thermal domain  T = P/C ·  t + T/RC ·  t (T = T_hot – T_amb) One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C T_hot T_amb

49 49 Combined package model Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001 Steady-state Tj – junction temperature Tc – case temperature Ts – heatsink temperature Ta – ambient temperature Note: Θ ja is meaningless! Guts of the component What exactly is T a ? Θjc is better but still sketchy

50 50 Reliability as f(T) Reliability criteria (e.g., DTM thresholds) are typically based on worst-case assumptions But actual behavior is often not worst case So aging occurs more slowly This means the DTM design is over- engineered! We can exploit this, e.g. for DTM or frequency Bank Spend

51 51 EM Model Life Consumption Rate: Apply in a “lumped” fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.]

52 52 Reliability-Aware DTM

53 53 Temperature limits Temperature limits for circuit performance can be measured Temperature limits for reliability are at best an estimate –150° is a reasonable rule of thumb for when immediate damage might occur –Chips are typically specified at lower temperatures, 100-125° for both performance and long-term reliability –Rule of thumb that every 10° halves circuit lifetime is false Originates from a mil-spec that is debunked

54 54 Thermal issues summary Temperature affects performance, power, and reliability Architecture-level: conduction only –Very crude approximation of convection as equivalent resistance –Convection: too complicated Need CFD! –Radiation: can be ignored Use compact models for package Power density is key Temporal, spatial variation are key Hot spots drive thermal design

55 55 Review of Thermal Issues From ITHERM’04 keynote by Ken Goodson, Stanford/Cooligy


Download ppt "1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm."

Similar presentations


Ads by Google