The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

March 30, 20012Convergence Workshop Outline Motivation –We need low-power, programmable TeraOps The problem is bandwidth –Growing gap between special-purpose and general-purpose hardware –Its easy to make ALUs, hard to keep them fed A stream processor gives programmable bandwidth –Streams expose locality and concurrency in the application –A bandwidth hierarchy exploits this Imagine is a 20GFLOPS prototype stream processor Many opportunities to do better –Scaling up –Simplifying programming

March 30, 20013Convergence Workshop Motivation Some things I’d like to do with a few TeraOps –Have a realistic face-to-face meeting with someone in Boston without riding an airplane 4-8 cameras, extract depth, fit model, compress, render to several screens –High-quality rendering at video rates Ray tracing a 2K x 4K image with 10 5 objects at 60 frames/s

March 30, 20014Convergence Workshop The good news – FLOPS are cheap, OPS are cheaper 32-bit FPU – 2GFLOPS/mm 2 – 400GFLOPS/chip 16-bit add – 40GOPS/mm 2 – 8TOPS/chip 460  m 146.7  m Local RF Integer Adder

March 30, 20015Convergence Workshop The bad news – General purpose processors can’t harness this

March 30, 20016Convergence Workshop Why do Special-Purpose Processors Perform Well? Fed by dedicated wires/memoriesLots (100s) of ALUs

March 30, 20017Convergence Workshop Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

March 30, 20018Convergence Workshop The problem is bandwidth Can we solve this bandwidth problem without sacrificing programmability?

March 30, 20019Convergence Workshop Streams expose locality and concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism

March 30, 200110Convergence Workshop A Bandwidth Hierarchy exploits locality and concurrency VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

March 30, 200111Convergence Workshop Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

March 30, 200112Convergence Workshop The Imagine Stream Processor

March 30, 200113Convergence Workshop Arithmetic Clusters

March 30, 200114Convergence Workshop Performance 16-bit kernels 16-bit applications floating-point application floating-point kernel

March 30, 200115Convergence Workshop Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9

March 30, 200116Convergence Workshop A Look Inside an Application Stereo Depth Extraction 320x240 8-bit grayscale images 30 disparity search 220 frames/second 12.7 GOPS 5.7 GOPS/W

Load original packed row Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values Stereo Depth Extractor ConvolutionsDisparity Search

March 30, 200118Convergence Workshop 7x7 Convolve Kernel

March 30, 200119Convergence Workshop Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power –three-level register hierarchy gives 2-10GOPS/W Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort

March 30, 200120Convergence Workshop A look forward Next steps –Build some Imagine prototypes Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems Longer term –‘Industrial Strength’ Imagine – 100-200GFLOPS/chip Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth –Graphics extensions Texture cache, raster unit – as SRF clients –A streaming supercomputer 64-bit FP, high-bandwidth global memory, MIMD extensions –Simplified stream programming Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.

March 30, 200121Convergence Workshop Take home message VLSI technology enables us to put TeraOPS on a chip Conventional general-purpose architecture cannot exploit this –The problem is bandwidth Casting an application as kernels operating on streams exposes locality and concurrency A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth –Bandwidth hierarchy, compound stream operations Imagine is a prototype stream processor –One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W –Systems scale to TeraFLOPS and more.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Similar presentations

Presentation on theme: "The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Similar presentations

Presentation on theme: "The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback