Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Slides:

Advertisements

Similar presentations

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Advertisements

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.

Enhancing GPU for Scientific Computing Some thoughts.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Investigating Architectural Balance using Adaptable Probes.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Sunpyo Hong, Hyesoon Kim

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

My Coordinates Office EM G.27 contact time:

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

ESE532: System-on-a-Chip Architecture

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Vector Processing => Multimedia

Flow Path Model of Superscalars

Stream Architecture: Rethinking Media Processor Design

Lecture on High Performance Processor Architecture (CS05162)

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Mattan Erez The University of Texas at Austin

Chapter 4 Multiprocessors

Ray Tracing on Programmable Graphics Hardware

Main Memory Background

Multicore and GPU Programming

Presentation transcript:

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

HPCA-10NSJ2 Scaling Trends ILP increasingly harder and more expensive to extract CPU data courtesy of Francois Labonte, Stanford University Graphics processors exploit data parallelism NV10 NV35

HPCA-10NSJ3 Renewed Interest in Data Parallelism Data parallel application classes –Media, signal, network processing, scientific simulations, encryption etc. High-end vector machines –Have always been data parallel Academic research –Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. “Main-stream” industry –Sony Emotion Engine, Tarantula etc.

HPCA-10NSJ4 Storage Hierarchy Bandwidth taper Only supports sequential streams/vectors But many data parallel apps with –Data reorderings –Irregular data structures –Conditional accesses DRAM Stream/vector storage +x+x +x+x +x+x Cache

HPCA-10NSJ5 Sequential Streams/Vectors Inefficient  Evaluate arbitrary order access to streams Memory/cache Stream/vector storageCompute units a 00 a 01 a 02 a 03 a 10 a 11 a 12 a 13 a 20 a 21 a 22 a 23 a 30 a 31 a 32 a 33 Time Row major Column major b 13 b 12 b 11 b 10 b 03 b 02 b 01 b 00 b 33 a 13 a 12 a 11 a 10 a 03 a 02 a 01 a 00 a 33 c 00 c 01 c 02 c 03 c 10 c 11 c 12 c 13 c 20 c 21 c 22 c 23 c 30 c 31 c 32 c 33 c 31 c 21 c 11 c 01 c 30 c 20 c 10 c 00 c 33 b 00 b 01 b 02 b 03 b 10 b 11 b 12 b 13 b 20 b 21 b 22 b 23 b 30 b 31 b 32 b 33 b 31 b 21 b 11 b 01 b 30 b 20 b 10 b 00 b 33 Reorder

HPCA-10NSJ6 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ7 Stream Programming Streams of records passing through compute kernels Parallelism –Across stream elements –Across kernels Locality –Within kernels –Between kernels FFT_stage in1 in2 Out Output

HPCA-10NSJ8 Bandwidth Hierarchy Stream programming is well matched to bandwidth hierarchy FFT_stage Memory Stream register file (SRF)Compute units Time

HPCA-10NSJ9 Stream Processors Several lanes –Execute in SIMD –Operate on records Inter-cluster network Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Lane 0 Memory system Memory switch

HPCA-10NSJ10 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ11 Stream-Level Data Reuse Sequential streams only capture in-order reuse Arbitrary access patterns in SRF capture more of available temporal locality Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Stream data reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups

HPCA-10NSJ12 Reordered Reuse Indexed SRF access eliminates reordering through memory

HPCA-10NSJ13 Reordered Reuse Reorder Indexed SRF access eliminates reordering through memory

HPCA-10NSJ14 Intra-stream Reuse Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers Memory/cache Stream register file (SRF)Compute clusters Time A BD CBAD ADBCABDB Compute E FH C G GFEH Replicate

HPCA-10NSJ15 Intra-stream Reuse Memory/cache Stream register file (SRF)Compute clusters Time A BD ADBCABDB Compute E FH C G GFEH Replicate CBAD CBAD Indexed SRF access eliminates –Replication in SRF –Redundant memory transfers

HPCA-10NSJ16 Conditional Accesses Fine-grain conditional accesses –Expensive in SIMD architectures –Translate to conditional address computation

HPCA-10NSJ17 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ18 Base Architecture Each SRF bank accesses block of b contiguous words Compute cluster 0 SRF bank 0 Compute cluster (N-1) SRF bank (N-1) Inter-cluster network b*W

HPCA-10NSJ19 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Compute cluster (N-1) SRF bank (N-1) Inter-cluster network Compute cluster 0 SRF bank 0 Address FIFOs

HPCA-10NSJ20 Base SRF Bank Several SRAM sub-arrays Each access is to one sub-array Compute cluster SRF bank Sub array 0 Sub array 1 Sub array 2 Local word -line drivers Sub array 3

HPCA-10NSJ21 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array 2 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Sub array 3 mux Sub array 0

HPCA-10NSJ22 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Inter-cluster network SRF address network Compute cluster 0 SRF bank 0 Address FIFOs Compute cluster 0 SRF bank 0

HPCA-10NSJ23 Overhead - Area In-lane indexing overheads –11% over sequential SRF Per-sub-array independent addressing overheads Cross-lane indexing overheads –22% over sequential SRF Address switch 1.5% to 3% increase in die area (Imagine processor)

HPCA-10NSJ24 Overhead - Energy 0.1nJ (0.13  m) per indexed SRF access ~4x sequential SRF access > order of magnitude lower than DRAM access 0.25nJ per cache access Each indexed access replaces many SRF and DRAM/cache accesses

HPCA-10NSJ25 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ26 Benchmarks 64x64 2D FFT –2D accesses Rijndael (AES) –Table lookups Merge-sort –Fine-grain conditionals 5x5 convolution filter –Regular neighborhood Irregular graph –Irregular neighborhood access –Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips

HPCA-10NSJ27 Machine Organizations Base (Sequential SRF) Compute clusters SRF banks Base + Cache DRAM Memory switch Inter-cluster net DRAM Memory switch Inter-cluster net Cache Indexed SRF SRF address net DRAM Memory switch Inter-cluster net

HPCA-10NSJ28 Machine Parameters BaseBase + cache Indexed SRF Technology 0.13  m 1GHz Compute8 compute clusters 32GFLOPs (peak) SRF128KB 128GB/s seq. 128KB 128GB/s seq 128GB/s in-lane 32GB/s x-lane Cache128KB 16GB/s DRAM9.14GB/s

HPCA-10NSJ29 Off-chip Memory Bandwidth

HPCA-10NSJ30 Off-chip Memory Bandwidth

HPCA-10NSJ31 Execution Time

HPCA-10NSJ32 Outline Stream processing overview Applications Implementation Results Conclusion

HPCA-10NSJ33 Conclusions Data parallelism increasingly important Current data parallel architectures inefficient for some application classes –Irregular accesses Indexed SRF accesses –Reduce memory traffic –Reduce SRF data replication –Efficiently support complex/conditional stream accesses Performance improvements –3% to 410% for target application classes Low implementation overhead –1.5% to 3% die area

HPCA-10NSJ34 Backups

HPCA-10NSJ35 Indexed Access Instruction Overhead Excludes address issue instructions

HPCA-10NSJ36 Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } LUT.index << a; Indep. instructions; LUT >> b; 2 separate instructions –Address issue –Data read Address-data separation –May require loop unrolling, software pipelining etc.

HPCA-10NSJ37 Sensitivity to SRF Access Latency (1)

HPCA-10NSJ38 Sensitivity to SRF Access Latency (2)

HPCA-10NSJ39 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual Pentium 4 NV30 NV35 Slide from Ian Buck, Stanford University

HPCA-10NSJ40 NVIDIA Graphics growth (225%/yr) 1: Dual textured 2: Programmable Essentially Moore’s Law Cubed. SeasonProductProcess# TransGflops32-bit AA FillMpolysNotes 2H97Riva M520M3M Integrated 2D/3D 1H98Riva ZX.255M731M3M AGP2x 2H98Riva TNT.257M1050M6M 32-bit 1H99TNT2.229M1575M9M AGP4x 2H99GeForce.2223M25120M15M HW T&L 1H00GF2 GTS.1825M35200M 1 25M Per-Pixel Shading 2H00GF2 Ultra.1825M45250M 1 31M 230 Mhz DDR 1H01GeForce3.1557M80500M 1 30M 2 Programmable Slide from Pat Hanrahan, Kurt Akeley

HPCA-10NSJ41 NVIDIA Historicals SeasonProductMT/sYr rateMF/sYr rate 2H97Riva H98Riva ZX H98Riva TNT H99Riva TNT H99GeForce H00GeForce2 GTS H00GeForce2 Ultra H01GeForce H02GeForce Slide from Pat Hanrahan, Kurt Akeley

HPCA-10NSJ42 Base Architecture Stream buffers match SRF bandwidth to compute needs Stream buffers 32b 128b Compute cluster 0 SRF bank 0 32b 128b Compute cluster 7 SRF bank 7 Inter-cluster network

HPCA-10NSJ43 Indexed SRF Architecture Address path from clusters Lower indexed access bandwidth Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b 128b Compute cluster 0 SRF bank 0

HPCA-10NSJ44 Base SRF Bank Several SRAM sub-arrays Sub array 1 Sub array 2 Sub array 3 Local WL drivers Sub array Compute cluster SRF bank

HPCA-10NSJ45 Indexed SRF Bank Extra 8:1 mux at sub-array output –Allows 4x 1-word accesses Compute cluster SRF bank Sub array 1 Sub array Sub array 2 Sub array 3 Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. Pre-decode & row dec. 8:1 mux

HPCA-10NSJ46 Cross-lane Indexed SRF Address switch added Inter-cluster network used for cross-lane SRF data Stream buffers Address FIFOs Inter-cluster network Compute cluster 7 SRF bank 7 32b Compute cluster 0 SRF bank 0 SRF address network