1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter Hardwired vs Microprogrammed Control Multithreading
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
1 Multithreaded Architectures Lecture 3 of 4 Supercomputing ’93 Tutorial Friday, November 19, 1993 Portland, Oregon Rishiyur S. Nikhil Digital Equipment.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Evaluating Register File Size
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
RegLess: Just-in-Time Operand Staging for GPUs
Presented by: Isaac Martin
Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
Stream-based Memory Specialization for General Purpose Processors
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

2 Massively Parallel Computing CUDA/OpenCL are gaining track in high-performance computing (HPC) – Same code; different data GPUs deliver better FLOPS per Watt – Available in mobile systems and supercomputers But… GPGPUs still suffer from von-Neumann inefficiencies 2

3 November 11, 2015 von-Neumann inefficiencies Fetch/Decode/Issue each instruction – Even though most instructions come from loops Explicit storage needed for communicating values between instructions – Register file; stack – Data travels between execution units and storage 3 [Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10] Compo nent Inst. fetch Pipeline registers Data cache Register file ControlALU Power [%] 33%22%19%10% 6%

4 November 11, 2015 Quantifying inefficiencies: instruction pipeline Every instruction fetched, decoded and issued Very wasteful Most of the execution time is spent in (tight) loops Avg. pipeline power consumption: – NVIDIA Tesla >10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi ~15% of processor power [Leng et al. ISCA’13] 4

5 November 11, 2015 Quantifying Inefficiencies: Register File Communication via bulletin board – 40% of values only read once [Gebhart et al. ISCA’11] Avg. register file power consumption: – NVIDIA Tesla 5-10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi >15% of processor power [Leng et al. ISCA’13] 5

6 November 11, 2015 Alternatives to von-Neumann: Dataflow/spatial computing Processor is a grid of functional units Computation graph is mapped to the grid – Statically, at compile time No energy wasted on pipeline – Instructions are statically mapped to nodes No energy wasted on RF and data transfers – No centralized register file needed – Save static power and area (128KB on Fermi) 6

7 November 11, 2015 Spatial/Dataflow Computing 7 int temp1 = a[threadId] * b[threadId]; int temp2 = 5 * temp1; if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;} else result[threadId] = temp2; athreadIdxentryb IMM_5S_LOAS1S_LOAD2 ALU1_mulALU2_mulJOIN1 IMM_3ALU4_ashlALU3_icmpIMM_256 if_elseif_then S_SOTRE3resultS_SOTRE4

8 November 11, 2015 SGMF: A Massively Multithreaded Dataflow Architecture  Every thread is a flow through the dataflow graph  Many threads execute (flow) in parallel 8

9 November 11, 2015 Execution Overview: Dynamic Dataflow Each flow/thread is associated with a token Execute the operation when tokens match Parallelism is determined by the number of tokens in the system 9 OoO LD/ST units token matching

10 November 11, 2015 DESIGN ISSUES A Massively Multithreaded Dataflow Processor 10

11 November 11, 2015 Multithreading Design Issues: Preventing Deadlocks Imbalanced out-of-order memory responses may trigger deadlocks 11 Deadlock due to limited buffer space OoO LD/ST units Solution: load-store units limit bypassing to the size of the token buffer

12 November 11, 2015 Design issues: Variable path lengths  Short paths must wait for long paths 12 a b c x x + + x Bubble Solution: equalize paths’ lengths

13 November 11, 2015 Design issues: Variable path lengths  Solution: inject buffers to equalize path lengths  Done in two phases:  Before mapping & Noc configuration– All the routes between each two connected nodes U and V are equalized by insertion of buffers  After mapping & Noc configuration – The path length may be altered, the buffer lengths need recalibration * *B B - Buffer a b c x

14 November 11, 2015 ARCHITECTURE A Massively Multithreaded Dataflow Processor 14

15 November 11, 2015 Architecture overview  Heterogeneous grid of tiles 1.Compute tiles: very similar to CUDA cores 2.LD/ST tiles: buffer and throttle data 3.Control tiles:pipeline buffering and join ops. 4.Special tiles:deal with non-pipelined operations  Reference point: – A single grid is the equivalent of a single NVIDIA Streaming Multiprocessor (SM) – Total buffering capacity in SGMF is less than 30% of that of an NVIDIA Fermi register file 15

16 November 11, 2015 Architecture overview 16

17 November 11, 2015 Interconnect Switches are connected using a folded cube [Properties and performance of folded hypercubes., El-Amawy et al., IEEE TPDS 1991] 8 “almost-NN” Static Switching Determined at compile time 17

18 November 11, 2015 EVALUATION A Massively Multithreaded Dataflow Processor 18

19 November 11, 2015 Methodology  The main HW blocks were Implemented in Verilog  Synthesized to a 65nm process – Validate timing and connectivity – Estimate area and power consumption – The size of one SGMF core synthesized with 65nm process is 54.3mm 2 – When scaled down to 40nm, each SGMF core would occupy 21.18mm 2 – Nvidia Fermi GTX480 card (40nm) occupies 529mm 2  Cycle accurate simulations based on GPGPUSim – We Integrated synthesis results into the GPGPUSim/Wattch power model  Benchmarks from Rodinia suite – CUDA kernels, compiled for SGMF 19

20 November 11, 2015 Single core system SGMF vs. Fermi – Performance

21 November 11, 2015 Single core system Energy savings 21

22 November 11, core system SGMF vs. Fermi – Performance 22

23 November 11, core system Energy savings 23

24 November 11, 2015 Conclusions von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing SGMF can potentially achieve much better performance/power than current GPGPUs – Almost 2 x speedup (average) and 50 % energy saving – Need to tune the memory system Greatly motivates further research – Compilation, place&route, connectivity, … 24

25 November 11, 2015 Thank you! Questions?