Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Similar presentations


Presentation on theme: "The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,"— Presentation transcript:

1 The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal http://www.cag.lcs.mit.edu/raw MIT Laboratory For Computer Science

2 Outline MotivationArchitecture Raw Prototype Networks Signal Processing Applications Status

3 Wire Delay and Tiled Architectures Problem: The amount of gates we can reach in one cycle is staying constant, but our chips are getting bigger. Solutions: 1.Hide wire delay latency in micro-architecture (Clustering/Hidden communication stalls) 2.Expose the communication to the instruction set level and allow the software exploit locality Fact 1: Number of transistors growing Fact 2: Proportionally wires not getting faster

4 Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality

5 Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

6 Wire Delay and Tiled Architectures 2.Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

7 What Are We Building? The Raw Prototype 16 Replicated Tiles (Processors) What is in a tile? 8 stage Pipelined MIPS-like 32-bit processor Pipelined Floating Point Unit 32KB Data Cache 32KB Instruction Memory Interconnect Routers

8 Raw’s Networking Resources 2 Dynamic Networks Fire and Forget Header encodes destination 2 Stage router pipeline 2 Static Networks Software configurable crossbar Interlocked and Flow Controlled 5 Stage static router pipeline 3 cycle nearest-neighbor ALU to ALU communication latency No header overhead, but requires knowledge of communication patterns at compile time

9 Memory Mapped Communication is Not a First Class Citizen IFRFD ATL M1M2 FP E U TV F4WB To other tiles, through memory system that happens to go over a network.

10 Raw’s First Class Register- Mapped Communication IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 NetworkInputFIFOs r26 r27 r25 r24 NetworkOutputFIFOs Ex: add r26, r25, r24

11 Signal Processing Applications Problem: Increase performance of Signal Processing in a scalable fashion Solution: Exploit parallelism in Signal Processing Applications at all levels

12 Types of Parallelism in Signal Processing DSP Filter Style Fine Grain Dataflow Instruction Level Parallelism Data Parallel Thread Level Parallelism (MPI) Current Architectures Raw

13 Instruction Level Parallelism RawCC Maps dataflow graphs across tiles ILP across Multiprocessor Heavily Latency sensitive Single cycle reconfigurable communication

14 Fine Grain Dataflow Ex: Pipelined FIR Filter xnxn x n-1 x n-3 W1W2W0W3  Computation: mul, add Input Operands: x i,  l Output Operands:  k Cycle count ClassFirstSecond Compute22 Communicate03 Overall25

15 Fine Grain Dataflow Cycle count ClassFirstSecond Compute22 Communicate03 Overall25 First Class Interface Second Class Interface mul $r3, W x, NET_IN_1 add NET_OUT1, NET_IN_2, $r3 ld $r4, NET_IN_1_ADDR ld $r5, NET_IN_2_ADDR mul $r3, W x, $r4 add $r6, $r5, $r3 st NET_OUT_1_ADDR, $r6

16 DSP Filter Style Off-chipOff-chip Down- Sample FFT Frequency Domain Filter FFT FFT -1 FFTFFT -1

17 Raw is Composable Mix and match types of parallelism 4-way Threaded Java Application 2-way RawCC Application httpd White balance White balance Aliasing filter mem Zzz.

18 Raw Status Stats IBM SA-27E.15u 6 Layer Copper 18.2 mm X 18.2 mm die.122 Billion Transistors 2048KB SRAM On-chip 1657 Pin CCGA Package 1080 HSTL Signal IO Operating at Core Speed  225MHz ~25 Watts

19 The Raw Performance 16 OPS/FLOPS per cycle (@225MHz = 3.6 GFLOPS) 230 Gb/s of on-chip “bisection bandwidth” 201 Gb/s of off-chip I/O bandwidth 115 Gb/s of on-chip memory bandwidth

20 Raw Status Working: Cycle Accurate Software Simulator RTL Simulation Emulation System RawCC ILP Compiler Current:Verification Backend Completion Tapeout December 2001 Chips Back Summer 2002

21 Summary Raw’s First Class communication facilitates exploitation of new forms of parallelism in Signal Processing applications

22 Extra Slides


Download ppt "The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,"

Similar presentations


Ads by Google