Download presentation
Presentation is loading. Please wait.
Published byElwin Reynolds Modified over 8 years ago
1
1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm
2
2 Presenters: Kevin Skadron (skadron@cs.virginia.edu)skadron@cs.virginia.edu CS Department, University of Virginia David Brooks (dbrooks@eecs.harvard.edu)dbrooks@eecs.harvard.edu CS Department, Harvard University Antonio Gonzalez (antonio@ac.upc.es)antonio@ac.upc.es UPC-Barcelona, and Intel Barcelona Research Center Lev Finkelstein (lev.finkelstein@intel.com)lev.finkelstein@intel.com Intel Haifa Mircea Stan (mircea@virginia.edu)mircea@virginia.edu ECE Department, University of Virginia
3
3 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) 1.5 4.Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs 6.Clustering (Antonio) 1 hr 7.Power distribution (David) 15 min 8.What current chips do (Lev) 45 min 9.HotSpot and sensors (Kevin) 1 hr
4
4 Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David) 5.Optimal DTM (Lev) 6.Clustering (Antonio) 7.Power distribution (David) 8.What current chips do (Lev) 9.HotSpot (Kevin)
5
5 Motivation Power consumption: first-order design constraint unconstrained power is a theoretical max peak ( inst.) power is limiting power delivery sustained power limits thermal design/packaging max sustained power: thermal “virus” same as thermal design power average active power and idle power limit mobile battery life, etc. Common fallacy: instantaneous power temperature Power-density is increasing exponentially Unfortunate corollary of Moore’s Law thermal effects become more problematic Need Power/Temperature-aware computing!
6
6 Power Dissipation Source: Microprocessor Report
7
7 Effects of Technology Scaling on Power Dissipation Feature size is scaling down –30% Frequency is increasing –~2x Area increases due to microarchitecture improvements –25% (Ideal scaling: decreases by 50%) Active capacitance increases –at least 30% (Ideal scaling: decreases by 30%) Vdd is not scaled down at the same rate as feature size –0-10% (Ideal scaling: 30%) Ideal scaling: P CV 2 f → 0.7 2 reduction 0.5 Observed scaling → 2 – 2.5x increase Power density becomes a problem! –Especially since the power density is non-uniform
8
8 Trends in Power Density Watts/cm 2 1 10 100 1000 i386 i486 Pentium® Pentium® Pro Pentium® II Pentium® III Hot plate Nuclear Reactor RocketNozzleRocketNozzle * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Pentium® 4
9
9 ITRS Projections These are targets Power-density problem is still getting worse Intel papers suggest that in the 45-75W range, cooling costs $1/W; but then rate of increase goes up: $2, $3/W, probably more! (Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01) ITRS 2001
10
10 Leakage Power The fraction of leakage power is increasing exponentially with each generation Also exponentially dependent on temperature Increasing ratio across generations Source: Sankaranarayanan et al, University of Virginia
11
11 Power-aware figures of merit Power (P): battery time (mobile) (1/W) packaging (high-performance) Energy (PD): battery life (mobile) (MIPS/W) fundamental limits (kT) Energy-delay (PD 2 ): (MIPS 2 /W) performance and low power Energy-delay 2 (PD 3 ): indep. of Vdd (MIPS 3 /W) emphasis on performance Power-aware low power Similar to “old” VLSI complexity (A,AD,AD^2) None of these are appropriate for thermal This is a problem Refs: R. Gonzales et al. “Supply and threshold voltage scaling for low power CMOS”, JSSC, Aug. 1997 A. Martin et al. “Design of an Asynchronous MIPS R3000”, ARVLSI’97 J. Ullman, “Computational aspects of VLSI”, CS Press, 1984
12
12 Cooking-aware computing Some chips rated for 100°C+
13
13 Power and temperature are BAD and can be EVIL Source: Tom’s Hardware Guide http://www6.tomshardware.com/cpu/01q3/010917/heatvideo-01.html
14
14 Other Costs of High Heat Flux Some chips may already be underclocked due to thermal constraints! –(especially mobile and sealed systems)
15
15 Temporal, Spatial Variations Temperature variation of SPEC applu over time Hot spots increase cooling costs must cool for hot spot
16
16 Application Variations Wide variation across applications Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT) –Leakage is an especially severe problem: exponentially dependent on temperature!
17
17 Heat vs. Temperature Different time scales Heat: no notion of spatial locality Does architecture have a role? Temperature-aware computing: Optimize performance subject to a temperature constraint
18
18 Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David) 5.Optimal DTM (Lev) 6.Clustering (Antonio) 7.Power distribution (David) 8.What current chips do (Lev) 9.HotSpot and sensors (Kevin)
19
19 Thermal issues Temperature affects: Circuit performance Circuit power (leakage) IC reliability IC and system packaging cost Environment
20
20 Performance and leakage Temperature affects : Transistor threshold and mobility Subthreshold leakage, gate leakage Ion, Ioff, Igate, delay ITRS: 85°C for high-performance, 110°C for embedded! Ion NMOS Ioff
21
21 Temperature-aware circuits Robustness constraint: sets Ion/Ioff ratio Robustness and reliability: Ion/Igate ratio Idea: keep ratios constant with T: trade leakage for performance! Ref: “Ghoshal et al. “Refrigeration Technologies…”, ISSCC 2000 Garrett et al. “T3…”, ISCAS 2001
22
22 Resulting performance 25% - 30% extra performance (110 o C to 0 o C) regular TAC
23
23 Reliability The Arrhenius Equation: MTF=A*exp (E a /K*T) MTF: mean time to failure at T A: empirical constant E a : activation energy K: Boltzmann’s constant T: absolute temperature Failure mechanisms: Die metalization (Corrosion, Electromigration, Contact spiking) Oxide (charge trapping, gate oxide breakdown, hot electrons) Device (ionic contamination, second breakdown, surface-charge) Die attach (fracture, thermal breakdown, adhesion fatigue) Interconnect (wirebond failure, flip-chip joint failure) Package (cracking, whisker and dendritic growth, lid seal failure) Most of the above increase with T (Arrhenius) Notable exception: hot electrons are worse at low temperatures More on this later
24
24 Packaging cost From Cray (local power generator and refrigeration)… Source: Gordon Bell, “A Seymour Cray perspective” http://www.research.microsoft.com/users/gbell/craytalk/
25
25 Packaging cost To today… Grid computing: power plants co-located near compute farms IBM S/390: refrigeration Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
26
26 IBM S/390 refrigeration Complex and expensive Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
27
27 IBM S/390 processor packaging Processor subassembly: complex! C4: Controlled Collapse Chip Connection (flip-chip) Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
28
28 Intel Itanium packaging Complex and expensive (note heatpipe) Source: H. Xie et al. “Packaging the Itanium Microprocessor” Electronic Components and Technology Conference 2002
29
29 Intel Pentium 4 packaging Simpler, but still… Source: Intel web site
30
30 Graphics Cards Nvidia GeForce 5900 card Source: Tech-Report.com
31
31 More Graphics Cards
32
32 Under/Overclocking Some chips need to be underclocked –Especially true in constrained form factors Try fitting this in a laptop or Gameboy! Ultra model of Gigabyte's 3D Cooler Series Source: Tom’s Hardware Guide
33
33 Apple G5 – liquid cooling Don’t know details Lots of people in thermal engineering community think liquid is inevitable, especially for server rooms But others say no: –This introduces a whole new kind of leakage problem –Water and electronics don’t mix!
34
34 Environment Environment Protection Agency (EPA): computers consume 10% of commercial electricity consumption –This incl. peripherals, possibly also manufacturing –A DOE report suggested this percentage is much lower –No consensus, but it’s still a lot Equivalent power (with only 30% efficiency) for AC CFCs used for refrigeration Lap burn Fan noise
35
35 Heat mechanisms Conduction Convection Radiation Phase change Heat storage
36
36 Conduction Similar to electrical conduction (e.g. metals are good conductors) Heat flow from high energy to low energy Microscopic (vibration, adjacent molecules, electron transport) No major displacement of molecules Need a material: typically in solids (fluids: distance between mol) Typical example: thermal “slug”, spreader, heatsink Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001 A
37
37 Conduction Not a strong function of temperature But for the high temp. variations on high-perf. chips, (30+°), it matters Note esp. Si vs. Al, Cu Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
38
38 Convection Macroscopic (bulk transport, mix of hot and cold, energy storage) Need material (typically in fluids, liquid, gas) Natural vs. forced (gas or liquid) Typical example: heatsink (fan), liquid cooling Note that convection is profoundly affected by board layout Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
39
39 Radiation Electromagnetic waves (can occur in vacuum) Negligible in typical applications Sometimes the only mechanism (e.g. in space) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
40
40 Carnot Efficiency Note that in all cases, heat transfer is proportional to ΔT This is also one of the reasons energy “harvesting” in computers is probably not cost-effective –ΔT w.r.t. ambient is << 100° For example, with a 25W processor, thermoelectric effect yields only ~50mW –Solbrekken et al, ITHERM’04 This is also why Peltier coolers are not energy efficient –10% eff., vs. 30% for a refrigerator
41
41 Surface-to-surface contacts Not negligible, heat crowding Thermal greases/epoxy (can “pump-out”) Phase Change Films (undergo a transition from solid to semi-solid with the application of heat) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
42
42 Phase-change Thermal solutions evolution: Natural air cooling Forced-air cooling Liquid cooling Phase change (e.g. heat pipe) Refrigeration Phase change: a. Solid changing to a liquid—fusion, or melting, b. Liquid changing to a vapor—evaporation, also boiling, c. Vapor changing to a liquid—condensation, e. Liquid changing to a solid—crystallization, or freezing, f. Solid changing to a vapor—sublimation, g. Vapor changing to a solid—deposition.
43
43 Thermal resistance Θ = rt / A = t / kA
44
44 Thermal capacitance C th = V·C p · (Aluminum) = 2,710 kg/m 3 C p (Aluminum) = 875 J/(kg-°C) V = t· A = 0.000025 m 3 C bulk = V·C p · = 59.28 J/°C
45
45 Refrigeration “conventional” vs. thermo-electric (TEC) Can get T < T_amb (“negative” Rth!) TEC: Peltier effect (can use for local cooling)
46
46 TEC electro-thermal model
47
47 Simplistic steady-state model All thermal transfer: R = k/A Power density matters! Ohm’s law for thermals (steady-state) V = I · R -> T = P · R T_hot = P · Rth + T_amb Ways to reduce T_hot: -reduce P (power-aware) -reduce Rth (packaging) -reduce T_amb (Alaska?) -maybe also take advantage of transients (Cth) T_hot T_amb
48
48 Simplistic dynamic thermal model Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth) RC time constant KCL differential eq. I = C · dV/dt + V/R differenceeq. V = I/C · t + V/RC · t thermal domain T = P/C · t + T/RC · t (T = T_hot – T_amb) One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C T_hot T_amb
49
49 Combined package model Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001 Steady-state Tj – junction temperature Tc – case temperature Ts – heatsink temperature Ta – ambient temperature Note: Θ ja is meaningless! Guts of the component What exactly is T a ? Θjc is better but still sketchy
50
50 Reliability as f(T) Reliability criteria (e.g., DTM thresholds) are typically based on worst-case assumptions But actual behavior is often not worst case So aging occurs more slowly This means the DTM design is over- engineered! We can exploit this, e.g. for DTM or frequency Bank Spend
51
51 EM Model Life Consumption Rate: Apply in a “lumped” fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.]
52
52 Reliability-Aware DTM
53
53 Temperature limits Temperature limits for circuit performance can be measured Temperature limits for reliability are at best an estimate –150° is a reasonable rule of thumb for when immediate damage might occur –Chips are typically specified at lower temperatures, 100-125° for both performance and long-term reliability –Rule of thumb that every 10° halves circuit lifetime is false Originates from a mil-spec that is debunked
54
54 Thermal issues summary Temperature affects performance, power, and reliability Architecture-level: conduction only –Very crude approximation of convection as equivalent resistance –Convection: too complicated Need CFD! –Radiation: can be ignored Use compact models for package Power density is key Temporal, spatial variation are key Hot spots drive thermal design
55
55 Review of Thermal Issues From ITHERM’04 keynote by Ken Goodson, Stanford/Cooligy
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.