Presented by Euiwoong Lee Accelerators/Specialization/ Emerging Architectures.

Slides:



Advertisements
Similar presentations
Hao wang and Jyh-Charn (Steve) Liu
Advertisements

14. Aug Towards Practical Lattice-Based Public-Key Encryption on Reconfigurable Hardware SAC 2013, Burnaby, Canada Thomas Pöppelmann and Tim Güneysu.
Computer Abstractions and Technology
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Power Management for High- speed Digital Systems Tao Zhao Electrical and Computing Engineering University of Idaho.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
1/31/20081 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Physical Implementation 1)Manufactured Integrated Circuit (IC) Technologies 2)Programmable IC Technology 3)Other Technologies Manufactured IC Technologies.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
Digital to Analog Converters
Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.
Ontogenetic systems Drawing inspiration from growth and healing processes of living organisms… …and applying them to electronic computing systems Phylogeny.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Reconfigurable Devices Presentation for Advanced Digital Electronics (ECNG3011) by Calixte George.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
CAD for Physical Design of VLSI Circuits
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
HYBRID COMPUTATION WITH SPIKES Rahul Sarpeshkar Robert J. Shillman Associate Professor MIT Electrical Engineering and Computer Science Banbury Sejnowski.
NIMIA October 2001, Crema, Italy - Vincenzo Piuri, University of Milan, Italy NEURAL NETWORKS FOR SENSORS AND MEASUREMENT SYSTEMS Part II Vincenzo.
Automated Design of Custom Architecture Tulika Mitra
PROGRAMMABLE LOGIC DEVICES (PLD)
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
J. Christiansen, CERN - EP/MIC
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
An Optoelectronic Neural Network Packet Switch Scheduler K. J. Symington, A. J. Waddie, T. Yasue, M. R. Taghizadeh and J. F. Snowdon.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
EE3A1 Computer Hardware and Digital Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
EEC4133 Computer Organization & Architecture Chapter 9: Advanced Computer Architecture by Muhazam Mustapha, May 2014.
Floyd, Digital Fundamentals, 10 th ed Digital Fundamentals Tenth Edition Floyd © 2008 Pearson Education Chapter 1.
Aerospace Conference ‘12 A Framework to Analyze, Compare, and Optimize High-Performance, On-Board Processing Systems Nicholas Wulf Alan D. George Ann Gordon-Ross.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Reconfigurable architectures ESE 566. Outline Static and Dynamic Configurable Systems –Static SPYDER, RENCO –Dynamic FIREFLY, BIOWATCH PipeRench: Reconfigurable.
VLSI Design & Embedded Systems Conference January 2015 Bengaluru, India Few Good Frequencies for Power-Constrained Test Sindhu Gunasekar and Vishwani D.
Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Scalable Processor Design
Introduction to Reconfigurable Computing
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Anne Pratoomtong ECE734, Spring2002
HIGH LEVEL SYNTHESIS.
Optimization for Fully Connected Neural Network for FPGA application
Final Project presentation
EE 193: Parallel Computing
Introduction to Neural Networks
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Presentation transcript:

Presented by Euiwoong Lee Accelerators/Specialization/ Emerging Architectures

Papers Putnam et al.A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services, in ISCA 2014A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services St. Amant et al. General-purpose code acceleration with limited-precision analog computation, in ISCA αGeneral-purpose code acceleration with limited-precision analog computation Madhavan, Sherwood, Strukov. Race logic: a hardware acceleration for dynamic programming algorithms, in ISCA 2014Race logic: a hardware acceleration for dynamic programming algorithms

Motivation The good days are over. “Iron Triangle” (from St. Amant et al.) Performance Efficiency Generality We can choose any two at the expense of the third.

Application Specific Designs It will be wasteful to run different programs on the same general-purpose processor. One extreme: CPU The other extreme: ASIC (Application-specific integrated circuit) In between? Beyond the extremes?

Application Specific Designs GPU FPGA: A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence "field- programmable". [wikipedia]

Another dimension How to represent numbers? Currently, we use digital, even for real numbers. Analog? Many choices to measure. How to add, subtract, or apply complicated functions to them?

Spectrum Putnam et al. (general purpose?) Image from Esmaeilzadeh et al. MICRO 2012

Spectrum St. Amant et al. (general purpose??) Image from Esmaeilzadeh et al. MICRO 2012 Putnam et al. (general purpose?)

Spectrum St. Amant et al. (general purpose??) Image from Esmaeilzadeh et al. MICRO 2012 Putnam et al. (general purpose?) Madhavan et al. (specific)

Papers Putnam et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, in ISCA 2014A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services St. Amant et al. General-purpose code acceleration with limited-precision analog computation, in ISCA αGeneral-purpose code acceleration with limited-precision analog computation Madhavan, Sherwood, Strukov. Race logic: a hardware acceleration for dynamic programming algorithms, in ISCA 2014Race logic: a hardware acceleration for dynamic programming algorithms

FPGA Image from

FPGA Main Challenge: The need to fi t the accelerated function into the available recon fi gurable area. Current recon fi guration times for standard FPGAs are too slow to make this approach practical. Multiple FPGAs provide scalable area, but cost more, consume more power, and are wasteful when unneeded. Using a single small FPGA per server restricts the workloads that may be accelerated, and may make the associated gains too small to justify the cost.

Large-Scale Datacenter Services [Putnam et al. 14] – 23 authors! Large-Scale Datacenter reduces variance of load. While reliability is important, the scale of the datacenter permits suf fi cient redundancy that a small rate of faults and failures is tolerable.

Large-Scale Datacenter Services [Putnam et al. 14] Specialization of individual servers have issues Loses homogeneity Datacenter services evolve extremely rapidly, making non- programmable hardware features impractical

Implementation Attach one FPGA to each server, and connect 48 servers as 6*8 torus

Experiments Attach one FPGA to each server, and connect 48 servers as 6*8 torus Do it 34 times, so total 1632 servers. Actually ran the Bing web search engine. Improved the throughput of each server by 95%.

Papers Putnam et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, in ISCA 2014A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services St. Amant et al. General-purpose code acceleration with limited-precision analog computation, in ISCA αGeneral-purpose code acceleration with limited-precision analog computation Madhavan, Sherwood, Strukov. Race logic: a hardware acceleration for dynamic programming algorithms, in ISCA 2014Race logic: a hardware acceleration for dynamic programming algorithms

Motivation “Iron Triangle” again Performance Efficiency Generality Is there another component whose sacrifice can possibly improve all three? Precision.

Neural Processing Unit [Esmaeilzadeh et al. 12] Tolerance to approximation is one such program characteristic that is growing increasingly important. Key idea: Learn how an original region of approximable code behaves and replace the original code with an efficient computation of the learned model.

Neural Processing Unit [Esmaeilzadeh et al. 12] Programmer marks “approximable” code. (1) Code observation: Collets data (2) Training: Decides the topology of neural network and its weights. (3) Code generation: Generates a configuration for the NPU that implements the trained neural network, and replaces each call.

Mixed-signal implementation for NPU [St. Amant et al. 14] Numbers are represented in “analog ways” Currents, Voltages, Resistances. Operations Addition: Kirchhoff’s Law (I = I1 + I2) Multiplication: Ohm’s Law (V = I * R) Even non-linear functions possible (transistors with saturation mode)

Issues for analog computation (1) Error (2) The amount of information (3) Good for only specific operations (4) Determining where the D/A boundaries lie (5) How to store?

Issues for analog computation (3) Good for only specific operations (4) Determining where the D/A boundaries lie (5) How to store? Their solution: D-A interface is located in a single-neuron level.

NPU with analogue computation (1) Error Errors are inherent, but NPU is built for approximation anyway. Let the compiler do the “hard work” of estimating / preventing error at once.

Deciding range of values (2) Amount of information: theoretically, can represent all real values? Large value Large voltages and currents => more energy Finer scale Susceptible to noise Their final answer: 8 bits

Deciding topology of network Large degree More parallelism But similar problem as before (e.g. more currents => energy) Their decision: again max. number of inputs = 8

One neuron

NPU

Papers Putnam et al.A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services, in ISCA 2014A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services St. Amant et al. General-purpose code acceleration with limited-precision analog computation, in ISCA αGeneral-purpose code acceleration with limited-precision analog computation Madhavan, Sherwood, Strukov. Race logic: a hardware acceleration for dynamic programming algorithms, in ISCA 2014Race logic: a hardware acceleration for dynamic programming algorithms

Beyond electricity “Real physics” or “Real chemistry” Exotic fully customized systems exploiting novel physics and based on nontraditional technologies D-Wave computer, which utilizes quantum annealing phenomena to solve optimization problems Reaction-diffusion systems made up of 2D chemical substrates can be used to solve 2D Voronoi Diagram Problems

One natural(?) way to represent numbers Time Use the well-studied problem domain of sequence alignment to test the potential of this new logic.

Similarity between sequences Given two strings A and B. How many edits (insertions, deletions, substitutions) do we need to perform to transform A to B? Example: s1 = ACGTGCA s2 = CCTGCAA 3 edits (A -> C, Delete G, Insert A) are enough

Similarity between sequences Generalization Each operation has different scores Even “match” has nonzero score We can maximize / minimize the score (in maximization, insertion / deletion will have lower score than match) In the following example, score for match = insertion = deletion = 1, substitution = 2 (and will minimize).

Dynamic Programming ACGTGCA C C T G C A A

ACGTGCA C2 C T G C A A

ACGTGCA C C T G C A A

ACGTGCA C C T G C A A

ACGTGCA C C T G C A A

ACGTGCA C2 C T G C A A

ACGTGCA C22 C3 T G C A A

ACGTGCA C223 C33 T4 G C A A

ACGTGCA C2234 C334 T44 G5 C A A

ACGTGCA C22 C T G C A A

ACGTGCA C223 C33 T G C A A

ACGTGCA C2234 C334 T44 G C A A

ACGTGCA C22345 C3345 T4455 G555 C A A

Race Logic Utilizes a new data representation to accelerate a broad class of optimization problems, such as those solved by dynamic programming algorithms The core idea of Race Logic is to use race conditions set up in a circuit to perform useful computation.

Race Logic Utilizes a new data representation to accelerate a broad class of optimization problems, such as those solved by dynamic programming algorithms Score δ represented as delay time (synchronized) The core idea of Race Logic is to use race conditions set up in a circuit to perform useful computation. Minimization performed as OR (each cell starts to work when the earliest message arrives).

Race Logic δ deletion δ insertion δ sub

Race Logic (time t) δ deletion δ insertion δ sub 1

Race Logic (time t + δ insertion ) δ deletion δ insertion δ sub 1 1 1

Race Logic (time t + δ sub ) δ deletion δ insertion δ sub

Race Logic Cell (0,0) sends 1. As soon as it receives 1, each cell propagates 1 to three directions after corresponding delays. The best score is just the time when the last cell receives 1! For maximization, we can use AND instead of OR (start to work after receiving 1 from all incoming directions)

Performance Area Scales quadratically with N Latency Linearly with N (assuming all scores are small) Energy (if the whole area is clocked every cycle) Cubic with N

Idea to Save Energy ACGTGCA C22345 C3345 T4455 G555 C A A Two marked areas do not need to be clocked at time 5

Idea to Save Energy ACGTGCA C22345 C3345 T4455 G555 C A A Activate the clock for each region when needed.

Idea to Save Energy ACGTGCA C22345 C3345 T4455 G555 C A A Fine granulity: a large number of multi-cell regions that require every cycle clocking Coarse granulity: clocking one multi-cell region for very long

Results Compared to Lipton and Lopresti 1985 Systolic Array Area scales linearly But still, 4* faster 3* higher throughput for sequence matching per area 5* lower power density

One weakness What if scores are large integers, or even real numbers? Convert to (approximately) equivalent small integers

Conclusion Efforts to run each program on the “right” chip on many levels. FPGA vs ASIC How much portion of program will be specially accelerated? Precision becomes another dimension How to represent data: natural/exotic operations based on science