Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Similar presentations


Presentation on theme: "1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015."— Presentation transcript:

1 1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015

2 2 Massively Parallel Computing CUDA/OpenCL are gaining track in high-performance computing (HPC) – Same code; different data GPUs deliver better FLOPS per Watt – Available in mobile systems and supercomputers But… GPGPUs still suffer from von-Neumann inefficiencies 2

3 3 November 11, 2015 von-Neumann inefficiencies Fetch/Decode/Issue each instruction – Even though most instructions come from loops Explicit storage needed for communicating values between instructions – Register file; stack – Data travels between execution units and storage 3 [Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10] Compo nent Inst. fetch Pipeline registers Data cache Register file ControlALU Power [%] 33%22%19%10% 6%

4 4 November 11, 2015 Quantifying inefficiencies: instruction pipeline Every instruction fetched, decoded and issued Very wasteful Most of the execution time is spent in (tight) loops Avg. pipeline power consumption: – NVIDIA Tesla >10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi ~15% of processor power [Leng et al. ISCA’13] 4

5 5 November 11, 2015 Quantifying Inefficiencies: Register File Communication via bulletin board – 40% of values only read once [Gebhart et al. ISCA’11] Avg. register file power consumption: – NVIDIA Tesla 5-10% of processor power [Hong and Kim. ISCA’10] – NVIDIA Fermi >15% of processor power [Leng et al. ISCA’13] 5

6 6 November 11, 2015 Alternatives to von-Neumann: Dataflow/spatial computing Processor is a grid of functional units Computation graph is mapped to the grid – Statically, at compile time No energy wasted on pipeline – Instructions are statically mapped to nodes No energy wasted on RF and data transfers – No centralized register file needed – Save static power and area (128KB on Fermi) 6

7 7 November 11, 2015 Spatial/Dataflow Computing 7 int temp1 = a[threadId] * b[threadId]; int temp2 = 5 * temp1; if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;} else result[threadId] = temp2; athreadIdxentryb IMM_5S_LOAS1S_LOAD2 ALU1_mulALU2_mulJOIN1 IMM_3ALU4_ashlALU3_icmpIMM_256 if_elseif_then S_SOTRE3resultS_SOTRE4

8 8 November 11, 2015 SGMF: A Massively Multithreaded Dataflow Architecture  Every thread is a flow through the dataflow graph  Many threads execute (flow) in parallel 8

9 9 November 11, 2015 Execution Overview: Dynamic Dataflow Each flow/thread is associated with a token Execute the operation when tokens match Parallelism is determined by the number of tokens in the system 9 OoO LD/ST units token matching

10 10 November 11, 2015 DESIGN ISSUES A Massively Multithreaded Dataflow Processor 10

11 11 November 11, 2015 Multithreading Design Issues: Preventing Deadlocks Imbalanced out-of-order memory responses may trigger deadlocks 11 Deadlock due to limited buffer space OoO LD/ST units Solution: load-store units limit bypassing to the size of the token buffer

12 12 November 11, 2015 Design issues: Variable path lengths  Short paths must wait for long paths 12 a b c x x + + x Bubble Solution: equalize paths’ lengths

13 13 November 11, 2015 Design issues: Variable path lengths  Solution: inject buffers to equalize path lengths  Done in two phases:  Before mapping & Noc configuration– All the routes between each two connected nodes U and V are equalized by insertion of buffers  After mapping & Noc configuration – The path length may be altered, the buffer lengths need recalibration 13 ++ * *B B - Buffer a b c x

14 14 November 11, 2015 ARCHITECTURE A Massively Multithreaded Dataflow Processor 14

15 15 November 11, 2015 Architecture overview  Heterogeneous grid of tiles 1.Compute tiles: very similar to CUDA cores 2.LD/ST tiles: buffer and throttle data 3.Control tiles:pipeline buffering and join ops. 4.Special tiles:deal with non-pipelined operations  Reference point: – A single grid is the equivalent of a single NVIDIA Streaming Multiprocessor (SM) – Total buffering capacity in SGMF is less than 30% of that of an NVIDIA Fermi register file 15

16 16 November 11, 2015 Architecture overview 16

17 17 November 11, 2015 Interconnect Switches are connected using a folded cube [Properties and performance of folded hypercubes., El-Amawy et al., IEEE TPDS 1991] 8 “almost-NN” Static Switching Determined at compile time 17

18 18 November 11, 2015 EVALUATION A Massively Multithreaded Dataflow Processor 18

19 19 November 11, 2015 Methodology  The main HW blocks were Implemented in Verilog  Synthesized to a 65nm process – Validate timing and connectivity – Estimate area and power consumption – The size of one SGMF core synthesized with 65nm process is 54.3mm 2 – When scaled down to 40nm, each SGMF core would occupy 21.18mm 2 – Nvidia Fermi GTX480 card (40nm) occupies 529mm 2  Cycle accurate simulations based on GPGPUSim – We Integrated synthesis results into the GPGPUSim/Wattch power model  Benchmarks from Rodinia suite – CUDA kernels, compiled for SGMF 19

20 20 November 11, 2015 Single core system SGMF vs. Fermi – Performance

21 21 November 11, 2015 Single core system Energy savings 21

22 22 November 11, 2015 16-core system SGMF vs. Fermi – Performance 22

23 23 November 11, 2015 16-core system Energy savings 23

24 24 November 11, 2015 Conclusions von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing SGMF can potentially achieve much better performance/power than current GPGPUs – Almost 2 x speedup (average) and 50 % energy saving – Need to tune the memory system Greatly motivates further research – Compilation, place&route, connectivity, … 24

25 25 November 11, 2015 Thank you! Questions?


Download ppt "1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015."

Similar presentations


Ads by Google