Can we Save Energy if we allow Errors in Computing? Janak H. Patel Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Final Project : Pipelined Microprocessor Joseph Kim.

Advertisements

Computer Abstractions and Technology

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

LOGIC GATES ADDERS FLIP-FLOPS REGISTERS Digital Electronics Mark Neil - Microprocessor Course 1.

Power Reduction Techniques For Microprocessor Systems

Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.

CENTRAL PROCESSING UNIT

Computer Organization and Architecture

Introduction to CMOS VLSI Design Lecture 21: Scaling and Economics

Synchronous Digital Design Methodology and Guidelines

Assume array size is 256 (mult: 4ns, add: 2ns)

Introduction to CMOS VLSI Design Lecture 18: Design for Low Power David Harris Harvey Mudd College Spring 2004.

May 14, ISVLSI 09 Algorithms for Estimating Number of Glitches and Dynamic Power in CMOS Circuits with Delay Variations Jins Davis Alexander Vishwani.

Embedded Systems Hardware:

Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Lecture 5 – Power Prof. Luke Theogarajan

Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.

Lecture 7: Power.

GCSE Computing - The CPU

The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Practical Aspects of Logic Gates COE 202 Digital Logic Design Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of Petroleum.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Computing hardware CPU.

Rabie A. Ramadan Lecture 3

CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

ECE Advanced Digital Systems Design Lecture 12 – Timing Analysis Capt Michael Tanner Room 2F46A HQ U.S. Air Force Academy I n t e g r i.

Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,

Low Power – High Speed MCML Circuits (II)

EEE2243 Digital System Design Chapter 7: Advanced Design Considerations by Muhazam Mustapha, extracted from Intel Training Slides, April 2012.

Computer Architecture Lecture 3 Combinational Circuits Ralph Grishman September 2015 NYU.

Morgan Kaufmann Publishers

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Patricia Gonzalez Divya Akella VLSI Class Project.

The Central Processing Unit (CPU)

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CS203 – Advanced Computer Architecture

Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,

Microprocessor Design Process

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Unit 2 Technology Systems

GCSE Computing - The CPU

Computer Maintenance Unit Subtitle: CPU’s Trade & Industrial Education

Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.

Morgan Kaufmann Publishers

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

COSC 2021: Computer Organization Instructor: Dr. Amir Asif

The University of British Columbia

Fundamentals of Computer Science Part i2

Timing Analysis 11/21/2018.

Chapter 1 Introduction.

Post-Silicon Calibration for Large-Volume Products

Digital Circuits and Logic

GCSE Computing - The CPU

Computer Science. The CPU The CPU is made up of 3 main parts : Cache ALU Control Unit.

Presentation transcript:

Can we Save Energy if we allow Errors in Computing? Janak H. Patel Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Auburn University February 18, 2016

2 Outline Why Energy versus Accurate Computing? An example of Energy versus Accuracy Tradeoff Challenge to the Accepted Tradeoff Model A Thought Experiment Two Real Experiments Potential exploits of the new hypothesis Power savings in large data-centers Predicting early failures in microprocessors

3 Why Energy vs. Accuracy? Manufacturing Process Variations unavoidable Gate Oxide thickness (T OX ) Random Doping Fluctuations (RDF) Device geometry, Lithography (W, L) Resulting Transistor Threshold Variations ( σ V T ) Designing for Worst Case for error-free computing Wastes Energy, Performance and Yield Designing for Typical Case may cause errors Tolerating occasional errors saves Energy, gives higher Performance and improves Yield In Future technologies, Errors are inevitable

4 Within-Die Variation Example “Teraflop Research Chip” Courtesy Intel, ISSCC Identical Cores But different Fmax Within a single chip

5 Designing with a Perfect Process S 63 S 62 S 61 S 60 S3S3 S2S2 S1S1 S0S0 15ps60ps45ps30ps 960ps Carry Propagations Slack time/ Guard band/ Safety margin 960ps 40ps A 1GHz 64-bit Adder 1000ps A 0 B 0 A 1 B 1 A 63 B 63 Full Adder S C 15ps

6 Modeling Process Variations Measured Data from Fabrication Analytical Process Models Die-to-Die and Within-Die Variations Full Adder S C 15ps±2ps   15ps 20ps 10ps

7 Worst Case Design S 63 S 62 S 61 S 60 S3S3 S2S2 S1S1 S0S0 ≈15ps≈60ps≈45ps≈30ps 960 ± 128ps Carry Propagations 830ps A 64-bit Adder with Process Variations 1090ps 260ps Increase Voltage to meet 1ns cycle time. Costs More Energy. Increase Cycle Time to 1.1ns. Costs in Performance. Throw away the Slow Chips! Loss in Yield and Revenue A 0 B 0 A 1 B 1 A 63 B 63 Full Adder S C 15 ± 2ps

8 Saving Energy and Performance S 63 S 62 S 61 S 60 S3S3 S2S2 S1S1 S0S0 ≈15ps≈60ps≈45ps≈30ps ≈960ps Carry Propagations ≈960ps A 64-bit Adder 1000ps Errors in most significant bits on some computations If we can live with these errors, we case save Energy, Performance and Yield. 830ps 1090ps A 0 B 0 A 1 B 1 A 63 B 63

9 Energy Savings with Errors Register ALU Register Clock 1 GHz 960 ps ±130 ps ALU Outputs 0ps 960ps 1 2  Errors Per Min/Sec  Voltage Reduction 830ps 1090ps

10 Basics of Power and Energy Energy to Switch a Logic Gate Output ½ CV 2 Joules  C is capacitance as seen by the Gate Output  V is the Supply Voltage Energy spent by the chip during one Clock Cycle kV 2 Joules  k is a constant reflecting average chip Activity and Total Gate Capacitance Power for the chip running at F cycles/sec kV 2 F Joules/sec or Watts

11 Power or Energy? Common Pitfalls Energy = kV 2 Joules per Clock Cycle Power = kV 2 F Joules/sec A Task that takes N Cycles to compute, consumes kV 2 N Joules Independent of Frequency!! Some Observations Modern Chips also have Static or DC power, S. Power = S + kV 2 F Joules/sec N-cycle Task, completes in N/F sec and consumes (S + kV 2 F)*(N/F) = SN/F + kV 2 N Joules Lowering Power by Frequency increases Energy!

12 Energy Savings with Errors Register ALU Register Clock 1 GHz 960 ps ±130 ps ALU Outputs 0ps 960ps 1 2  Errors Per Min/Sec  Voltage Reduction 830ps 1090ps

13 Protecting against Errors If the error rate from added delays remains relatively small, we can utilize some of the established techniques Circuit Level – iROC, Razor, etc. Error Coding – Parity codes, Arithmetic codes, Residue codes, Parity prediction, Algorithm-based fault-tolerance, TMR etc. Time Redundancy – RESO What if the error rate is massive? Are massive errors possible in a good chip?

14 Signal Propagations Combinational Logic Present State Next State FFs clock ps Calculate from timing analysis How many flip-flops on critical-paths? Threshold

15 Consider a 1-Ghz chip with a million flip-flops Let us divide the 1ns Clock period in to 1000 bins Put a FF in bin p if the longest path at its input has a delay of p picoseconds How many FFs are in bins 900ps to 950ps? Path Delay in picoseconds → How many flip-flops on critical-paths?

16 Consider a 1-Ghz chip with a million flip-flops Let us divide the 1ns Clock period in to 1000 bins Put a FF in bin p if the longest path at its input has a delay of p picoseconds How many FFs are in bins 900ps to 950ps? Path Delay in picoseconds → How many flip-flops on critical-paths?

17 Signal Propagations Combinational Logic Present State Next State FFs 1M FFs clock ps 200,000? 100,000? 50,000? How many flip-flops on critical-paths?

18 A Thought Experiment Let us conservatively assume 100,000 ffs are on critical paths (10% of total) Now Reduce the supply Voltage, thus increasing Delay Assume just 10% of critical ffs get its inputs late this cycle This implies 10,000 flip-flops produce errors in a single clock cycle! Massive number of errors result in a few clock cycles

19 Do your own Analysis Analytical Procedure Estimate number (n) of FFs on critical paths from timing analysis or synthesis report - Compute percent (p) of FFs that switch per cycle from functional simulation p X n = errors/cycle For example: Total Number of Flip-Flops: 400,000 5% of these are on critical paths: n = 20,000 FFs Only 5% of these switch every cycle: p = 5% Error Rate = p X n = 1000 errors/cycle In 10 consecutive clock cycles: 10,000 errors!

20 A new hypothesis  Errors/ Cycle NEW (large chips) 1 2 OLD (small chips)  Errors Per Min/sec Zero errors Massive errors Reduce Supply Voltage 

21 Hypothesis of Critical Operation Point In large CMOS circuits there exists a Critical Voltage V C for a fixed Frequency F and ambient temperature T, such that Any voltage below V C, massive errors result Any voltage above V C, no delay related errors occur No tradeoff is possible between Energy and Errors in large chips No practical ways exist to handle massive errors! No error handling needed above V C

22 Experiments to disprove the hypothesis Subject a large chip to slowly decreasing supply voltage, keeping the Frequency fixed. At each step, exercise the chip extensively and monitor continuously for any errors

23 Experiments on Microprocessors PowerPC V, 233MHz Voltage reduced in steps of 10 mV A C-program monitored the errors while stressing the processor Pentium-M (1.308V, 2GHz) down to (0.700V, 0.798GHz) For each of the 9 different frequencies, Voltage reduced in steps of 16 mV Third-party program to keep the cpu 100% busy and report errors (more like a power virus!)

24 Experiment to find which of these two?  Errors/ Cycle NEW Decrease Supply Voltage  1 2 OLD  Errors Per Min/sec Decrease Supply Voltage 

25 Experimental Set-Up for PowerPC A Single-Board-Computer with PowerPC MHz, 2.5V Power Supply A Hewlett-Packard E3631A Power Supply  Digital control in units of 10 miliVolts steps A Blow-Drier to raise the ambient temperature A Program written to stress all major functional blocks Tried to maximize execution rate (load) Tried to maximize logic switching rate Every operation was checked against known good values and instantly reported for any error

26 “Stressing” PowerPC 750 (233 MHz)

27 Results of Lowering Supply Voltage Power PC-750 μ P Nominal Supply Voltage of 2.5 V is reduced in steps of 1/100 th Volt with clock frequency constant at 233MHz No Data Error was ever Observed at user visible Registers!

28 Intel Pentium-M Processor “Speed step” technology with 9 different speeds Rated at 2GHz at core voltage of 1.308V Experiment Choose one of the nine frequencies F1, …, F9 While keeping cpu 100% busy reduce the voltage in steps of 16mV Monitor for errors No errors observed – only crashes!

29 Experiment on Pentium-M No errors! Catastrophic failure!

30 Some Remarks on Experiment Possible explanation for the observations A modern processor has a large number of flip- flops that are not user visible  e.g. Pre-fetch buffers, history tables, reservation stations, write buffers, and state controllers for everything from moving instructions and data to controlling a cache Control Logic fails simultaneously with ALU datapath Massive errors in control and data in a single cycle Instruction flow is completely disrupted. Therefore no error could be reported. Catastrophic failure!

31 Personal Remarks CMOS technology is robust now and will continue to be so for the foreseeable future Process Variation related errors if any, must be massive No industry can survive with massive failures Process variations must remain bounded within some reasonable limits Moore’s Law continues to hold! 45nm with (HiK+MG) has lower RDF and T OX variations than 65nm [Kelin J. Kuhn, Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS, IEDM 2007.]

32 Exploiting Process Variations If the “critical operation point hypothesis” holds Above critical supply voltage V C no errors occur. In large data-centers with 1000’s of microprocessors, operating each with the lowest V C for a given frequency can save lots of power As the number of cores approach 100 or more, it would be imperative to use different voltage- frequency pair (F C, V C ) for each core on the same die.

33 Dynamic Power Savings in Pentium-M

34 Energy Saving Ideas for Microprocessors Simplify CPU Remove Floating Point Unit Remove Divider Remove Multiplier Reduce Word Width  From 32 bits to 16 or 8 bits Simplify Memory Hierarchy Remove Virtual Addresses Manage Memory Hierarchy explicitly

35 Accuracy vs. Quality Energy and Quality of Solution Tradeoff All Iterative Algorithms that improve the Quality of Solution with more iterations (more Energy)  Simulated Annealing, Genetic, Steepest Decent All Algorithms that use Grid of various dimensions, improve the Quality of Solution with finer Grid, or more Resolution (more Energy)  Image Processing, Weather Modeling, Device Physics Simulation Signal Processing Signal/Noise Ratio, Less Power, More Errors

36 Final Observations For Large Chips and Data-centers There is no possible Trade-off between Energy and Inaccurate Computing Must avoid confusion between Energy and Power Frequency can not affect the Energy of a Task  Lower Frequency implies Higher Energy! Critical Voltage is easy to determine for a chip Aging of Chips can be easily detected All future researchers must keep in mind the large number of flips-flops on Critical Paths!

37 Questions? Comments?

38

39

40