Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,

Slides:

Advertisements

Similar presentations

Issues in System on the Chip Clocking November 6th, 2003 SoC Design Conference, Seoul, KOREA Vojin G. Oklobdzija Advanced Computer System Engineering Laboratory.

Advertisements

Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.

CENTRAL PROCESSING UNIT

Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Synchronous Digital Design Methodology and Guidelines

Clock Design Adopted from David Harris of Harvey Mudd College.

Assume array size is 256 (mult: 4ns, add: 2ns)

A 16-Bit Kogge Stone PS-CMOS adder with Signal Completion Seng-Oon Toh, Daniel Huang, Jan Rabaey May 9, 2005 EE241 Final Project.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Computer Systems. Computer System Components Computer Networks.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Logic Design Outline –Logic Design –Schematic Capture –Logic Simulation –Logic Synthesis –Technology Mapping –Logic Verification Goal –Understand logic.

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

Alpha Goal: very fast multiprocessor systems, highly scalable Main trick is high-bandwidth, low-latency data access. How to do it, how to do it?

GCSE Computing - The CPU

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.

Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.

Practical PC, 7th Edition Chapter 17: Looking Under the Hood

Lecture 5. Sequential Logic 3 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.

Computing hardware CPU.

Embedded Supercomputing in FPGAs

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

CENTRAL PROCESSING UNIT – a,b,c & d a - The Purpose of a CPU The CPU is the brain of the computer. The Purpose of the CPU is to process.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Reconfigurable Computing - Verifying Circuit Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe II’ in a good breeze on.

Computer Systems - Processor. Objectives To investigate and understand the structure and role of the processor.

Computer Architecture Lecture 32 Fasih ur Rehman.

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

1 COMP541 Sequential Logic Timing Montek Singh Sep 30, 2015.

Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.

L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study

Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Central Processing Unit (CPU) The Computer’s Brain.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Yuxi Liu The Chinese University of Hong Kong Circuit Timing Problem Driven Optimization.

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Floating-Point FPGA (FPFPGA)

Defining Performance Which airplane has the best performance?

SEU Mitigation Techniques for Virtex FPGAs in Space Applications

THE CPU i Bytes 1.1.

Gouraud-shaded Triangle Rasterization

The University of British Columbia

Timing Analysis 11/21/2018.

Presentation transcript:

Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim, Shao Lin Tang, Michael Yue, Guy Lemieux The University of British Columbia

Overclocking… is it safe? 2 Clock frequency determined by 2 things: CAD timing analysis (timing margins) speed binning of actual wafers + chips (variation) Can you go faster? Yes, if your chips are fast Yes, if your data is not “worst-case”, eg carry propagation Yes, if you do not want “safe” timing margin guardbands

Overclocking… is it safe? 3.8GHz4.2 V GHz !!! 4.210V

Overclocking… is it safe? How fast is too fast? –Blows up –Fails to POST –Fails to boot –Blue screen of death –Random crashes –Data errors in documents and spreadsheet When these problems go away, is it safe? 4

Overclocking… is it safe? Root cause: timing errors –Problem 1: can we detect them? Yes, e.g. using Razor –Problem 2: can we correct them? Yes, using Razor with feed-forward pipelines –Pipeline must be ‘replayed’, input data ‘unfetched’ Not possible with general sequential logic –Need ‘spare’ cycles to ‘unfetch’ input data –Cyclic dependencies make this difficult 5

Overclocking… is it safe? If not for general logic, what about… –Traditional CPU pipelines? Yes: Feed-forward, correctable by Razor-replay –Multi-core CPUs? Yes: Each CPU is a traditional pipeline Loosely coupled –Other CPUs tolerate race conditions 6

Overclocking… is it safe? If not for general logic, what about… –Ambric-style processor arrays? Yes: Like multi-core CPUs Loosely coupled –Neighbour CPUs tolerate uncertainty of arrival time –Tightly coupled processor arrays or CGRAs? No: neighbour CPUs cannot tolerate delays! 7

Main Contribution Extends Razor error correction to… –tightly coupled processor arrays, CGRAs –time-multiplexed FPGAs/CGRAs Tightly coupled means… Pre-scheduled communication –Data must be present during cycle X No “data presence” indicators / handshakes 8

Razor FF 9

Razor FF in Pipeline 10 clk clk + delay

Razor FF in Pipeline 11

How can we overclock? = 27ps clock

How can we overclock? = 17ps clock, most of the time

How can we overclock? = 17ps clock, most of the time

How can we overclock? = 17ps clock, most of the time

How can we overclock? = 17ps clock, most of the time

How can we overclock? = 17ps clock, most of the time

How can we overclock? = 17ps clock, most of the time

19

Array Architecture 20 Tightly coupled communication: - can be FIFO-based or ‘mailbox’-based Fully bypassed: - write on cycle X - read on cycle X+1, ie address provided cycle X

Add “Razor” to Block RAM 21

Processor Error Detection (for East direction only) 22 Processor memory error: Causes stall Incoming stall (from N,S,W): produces Outgoing E stall Early warning signal: prevents incoming data

Processor Error Detection (all four directions) 23

t+1: A  B || B  C ; t+2: B  C ; Writes entering error region…. 24

Razor Stall Propagation (1D) 25

Razor Stall Propagation (1D) 26 1 time X

Razor Stall Propagation (1D) 27 1 time X

Razor Stall Propagation (1D) time X X

Razor Stall Propagation (1D) time X X

Razor Stall Propagation (1D) time X X

Razor Stall Propagation (1D) time X X

Razor Stall Propagation (1D) time X X

Razor Stall Propagation (1D) time X X X

Razor Stall Propagation (1D) time X X X

Razor Stall Propagation (1D) time X X X

Razor Stall Propagation (1D) time X X X

Razor Stall Propagation (1D) time X X X

Razor Stall Propagation (1D) time X X X 3 Errors Detected 2 Stalls to Correct # STALLS < # ERRORS Might be scalable!!

Razor Stall Propagation (2D) 39

Razor Stall Propagation (2D) 40 X

Razor Stall Propagation (2D) 41 X

Razor Stall Propagation (2D) 42 X X

Razor Stall Propagation (2D) 43 X X

Razor Stall Propagation (2D) 44 X X X

Razor Stall Propagation (2D) 45 X X X

Razor Stall Propagation (2D) 46 X X X

Razor Stall Propagation (2D) 47 X X X

Razor Stall Propagation (2D) 48 X X X

Razor Stall Propagation (2D) 49 X X X

Razor Stall Propagation (2D) 50 X X X

Razor Stall Propagation (2D) 51 X X X 3 Errors Detected 2 Stalls to Correct # STALLS < # ERRORS Might be scalable!!

Manual experiment: # stalls vs # errors 52 # errors # stalls

Monte Carlo Simulations… 53

Stalls vs Errors (N x N array) 54

Recovered Utilization (N x N) 55

Experimental Results… 56

Experimental Results Build Processor Array on FPGA… –2 x 2 array in silicon (running) –3 x 3 array in simulation (verification) Static critical path always through ALU + communication channels –Typically multipilier, but only if used (!) –Depends upon values being multiplied Overclock system –F max depends upon multiplier use, data values 57

Place & Route Results Area analysis for processor –2,958 ALMs Regs (baseline) –3,082 ALMs Regs (with Razor) Static timing analysis for array –90 MHz (baseline) –88 MHz (with Razor) Overhead is very low (4% ALMs) 58

System under Test 59

Methodology (baseline) Run once: circuit at low speed –Record correct output vectors For increasingly higher clock speeds –Run circuit with input test vectors –Fail on first error Remember highest clock speed 60

Results (baseline) BenchmarkStatic Timing (CAD) Overclocked (1 st error) Random90 MHz135 MHz Mean90 MHz121 MHz Wang90 MHz131 MHz PR90 MHz136 MHz average90 MHz130.4 MHz 61 Processor arrays can be overclocked –Amount depends on application + data + chip But is it safe? –Our “test jig” tested results offline to find errors –Unsafe! baseline cannot detect errors

Methodology (Razor) Run once: circuit at low speed –Record correct output vectors For increasingly higher clock speeds –For increasingly higher shadow FF delay Run circuit Record # errors, # corrected errors, # stalls Remember highest throughput 62

Results (baseline) BenchmarkStatic Timing (CAD)Overclocked (runs past 1 st error) Stall Rate Random88 MHz163 MHz5.0% Mean88 MHz144 MHz1.3% Wang88 MHz147 MHz0.7% PR88 MHz145 MHz1.7% average88 MHz149.4 MHz2.0% 63 Processor arrays can be overclocked –Even higher rates past 1 st error –Errors require stalls to correct, lowers thru-put –Stop increasing Fmax after thru-put peaks

Results (new Razor) BenchmarkStatic Timing (CAD)Overclocked (runs past 1 st error) Stall Rate Random88 MHz163 MHz5.0% Mean88 MHz144 MHz1.3% Wang88 MHz147 MHz0.7% PR88 MHz145 MHz1.7% average88 MHz149.4 MHz2.0% 64 But is it safe? –Safe! Razor detects and corrects errors –Our “test jig” tested results offline to verify the errors were corrected

Results (Comparison) BenchmarkBaseline STA (safe) Razor-Corrected Effective Throughput Speedup Random90 MHz155 MHz1.72 x Mean90 MHz142 MHz1.58 x Wang90 MHz146 MHz1.62 x PR90 MHz143 MHz1.59 x average90 MHz146.5 MHz1.63 x 65 safelyProcessor arrays can be safely overclocked –63% higher throughput Low area cost (+4% ALMs)

66

Observations / Notes Time-multiplexed CGRAs/FPGAs can also benefit –Just reserve 1-2 clock cycles in the time-mux schedule for error recovery Loosely coupled processor arrays can be overclocked locally –Just add Razor to each processor –No need to propagate stall signals; automatically done through data presence indicators 67

Summary / Conclusions Processor arrays can be safely overclocked –Even with very tightly scheduled communication Processor arrays are scalable –Errors produce stall wavefronts –Several wavefronts merge into a single stall cycle Throughput increased 63% on average –Speedup depends upon benchmark, data values 68