The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Slides:



Advertisements
Similar presentations
The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Lecture 6: Multicore Systems
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Instruction-Level Parallelism (ILP)
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Alpha 21364: A Scalable Single-chip SMP
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
1/29 UTDSP: A VLIW Programmable DSP Processor Sean Hsien-en Peng Department of Electrical and Computer Engineering University of Toronto October 26 th,
Itanium® 2 Processor Architecture
Lynn Choi School of Electrical Engineering
Variable Word Width Computation for Low Power
Packet Switching on Raw
A Common Machine Language for Communication-Exposed Architectures
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Architecture & Organization 1
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Stream Architecture: Rethinking Media Processor Design
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
CS 704 Advanced Computer Architecture
Introduction to Heterogeneous Parallel Computing
RAW Scott J Weber Diagrams from and summary of:
The Vector-Thread Architecture
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
What is Computer Architecture?
CMSC 611: Advanced Computer Architecture
Presentation transcript:

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal MIT Laboratory For Computer Science

Outline MotivationArchitecture Raw Prototype Networks Signal Processing Applications Status

Wire Delay and Tiled Architectures Problem: The amount of gates we can reach in one cycle is staying constant, but our chips are getting bigger. Solutions: 1.Hide wire delay latency in micro-architecture (Clustering/Hidden communication stalls) 2.Expose the communication to the instruction set level and allow the software exploit locality Fact 1: Number of transistors growing Fact 2: Proportionally wires not getting faster

Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality

Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

What Are We Building? The Raw Prototype 16 Replicated Tiles (Processors) What is in a tile? 8 stage Pipelined MIPS-like 32-bit processor Pipelined Floating Point Unit 32KB Data Cache 32KB Instruction Memory Interconnect Routers

Raw’s Networking Resources 2 Dynamic Networks Fire and Forget Header encodes destination 2 Stage router pipeline 2 Static Networks Software configurable crossbar Interlocked and Flow Controlled 5 Stage static router pipeline 3 cycle nearest-neighbor ALU to ALU communication latency No header overhead, but requires knowledge of communication patterns at compile time

Memory Mapped Communication is Not a First Class Citizen IFRFD ATL M1M2 FP E U TV F4WB To other tiles, through memory system that happens to go over a network.

Raw’s First Class Register- Mapped Communication IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 NetworkInputFIFOs r26 r27 r25 r24 NetworkOutputFIFOs Ex: add r26, r25, r24

Signal Processing Applications Problem: Increase performance of Signal Processing in a scalable fashion Solution: Exploit parallelism in Signal Processing Applications at all levels

Types of Parallelism in Signal Processing DSP Filter Style Fine Grain Dataflow Instruction Level Parallelism Data Parallel Thread Level Parallelism (MPI) Current Architectures Raw

Instruction Level Parallelism RawCC Maps dataflow graphs across tiles ILP across Multiprocessor Heavily Latency sensitive Single cycle reconfigurable communication

Fine Grain Dataflow Ex: Pipelined FIR Filter xnxn x n-1 x n-3 W1W2W0W3  Computation: mul, add Input Operands: x i,  l Output Operands:  k Cycle count ClassFirstSecond Compute22 Communicate03 Overall25

Fine Grain Dataflow Cycle count ClassFirstSecond Compute22 Communicate03 Overall25 First Class Interface Second Class Interface mul $r3, W x, NET_IN_1 add NET_OUT1, NET_IN_2, $r3 ld $r4, NET_IN_1_ADDR ld $r5, NET_IN_2_ADDR mul $r3, W x, $r4 add $r6, $r5, $r3 st NET_OUT_1_ADDR, $r6

DSP Filter Style Off-chipOff-chip Down- Sample FFT Frequency Domain Filter FFT FFT -1 FFTFFT -1

Raw is Composable Mix and match types of parallelism 4-way Threaded Java Application 2-way RawCC Application httpd White balance White balance Aliasing filter mem Zzz.

Raw Status Stats IBM SA-27E.15u 6 Layer Copper 18.2 mm X 18.2 mm die.122 Billion Transistors 2048KB SRAM On-chip 1657 Pin CCGA Package 1080 HSTL Signal IO Operating at Core Speed  225MHz ~25 Watts

The Raw Performance 16 OPS/FLOPS per cycle = 3.6 GFLOPS) 230 Gb/s of on-chip “bisection bandwidth” 201 Gb/s of off-chip I/O bandwidth 115 Gb/s of on-chip memory bandwidth

Raw Status Working: Cycle Accurate Software Simulator RTL Simulation Emulation System RawCC ILP Compiler Current:Verification Backend Completion Tapeout December 2001 Chips Back Summer 2002

Summary Raw’s First Class communication facilitates exploitation of new forms of parallelism in Signal Processing applications

Extra Slides