Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Similar presentations


Presentation on theme: "The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University"— Presentation transcript:

1 The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

2 March 30, 20012Convergence Workshop Outline Motivation –We need low-power, programmable TeraOps The problem is bandwidth –Growing gap between special-purpose and general-purpose hardware –Its easy to make ALUs, hard to keep them fed A stream processor gives programmable bandwidth –Streams expose locality and concurrency in the application –A bandwidth hierarchy exploits this Imagine is a 20GFLOPS prototype stream processor Many opportunities to do better –Scaling up –Simplifying programming

3 March 30, 20013Convergence Workshop Motivation Some things I’d like to do with a few TeraOps –Have a realistic face-to-face meeting with someone in Boston without riding an airplane 4-8 cameras, extract depth, fit model, compress, render to several screens –High-quality rendering at video rates Ray tracing a 2K x 4K image with 10 5 objects at 60 frames/s

4 March 30, 20014Convergence Workshop The good news – FLOPS are cheap, OPS are cheaper 32-bit FPU – 2GFLOPS/mm 2 – 400GFLOPS/chip 16-bit add – 40GOPS/mm 2 – 8TOPS/chip 460  m 146.7  m Local RF Integer Adder

5 March 30, 20015Convergence Workshop The bad news – General purpose processors can’t harness this

6 March 30, 20016Convergence Workshop Why do Special-Purpose Processors Perform Well? Fed by dedicated wires/memoriesLots (100s) of ALUs

7 March 30, 20017Convergence Workshop Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

8 March 30, 20018Convergence Workshop The problem is bandwidth Can we solve this bandwidth problem without sacrificing programmability?

9 March 30, 20019Convergence Workshop Streams expose locality and concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism

10 March 30, 200110Convergence Workshop A Bandwidth Hierarchy exploits locality and concurrency VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

11 March 30, 200111Convergence Workshop Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

12 March 30, 200112Convergence Workshop The Imagine Stream Processor

13 March 30, 200113Convergence Workshop Arithmetic Clusters

14 March 30, 200114Convergence Workshop Performance 16-bit kernels 16-bit applications floating-point application floating-point kernel

15 March 30, 200115Convergence Workshop Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9

16 March 30, 200116Convergence Workshop A Look Inside an Application Stereo Depth Extraction 320x240 8-bit grayscale images 30 disparity search 220 frames/second 12.7 GOPS 5.7 GOPS/W

17 Load original packed row Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values Stereo Depth Extractor ConvolutionsDisparity Search

18 March 30, 200118Convergence Workshop 7x7 Convolve Kernel

19 March 30, 200119Convergence Workshop Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power –three-level register hierarchy gives 2-10GOPS/W Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort

20 March 30, 200120Convergence Workshop A look forward Next steps –Build some Imagine prototypes Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems Longer term –‘Industrial Strength’ Imagine – 100-200GFLOPS/chip Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth –Graphics extensions Texture cache, raster unit – as SRF clients –A streaming supercomputer 64-bit FP, high-bandwidth global memory, MIMD extensions –Simplified stream programming Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.

21 March 30, 200121Convergence Workshop Take home message VLSI technology enables us to put TeraOPS on a chip Conventional general-purpose architecture cannot exploit this –The problem is bandwidth Casting an application as kernels operating on streams exposes locality and concurrency A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth –Bandwidth hierarchy, compound stream operations Imagine is a prototype stream processor –One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W –Systems scale to TeraFLOPS and more.


Download ppt "The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University"

Similar presentations


Ads by Google