Presentation is loading. Please wait.

Presentation is loading. Please wait.

The MachSuite Benchmark

Similar presentations


Presentation on theme: "The MachSuite Benchmark"— Presentation transcript:

1 The MachSuite Benchmark
Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks

2 Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & SoCs Cool tool.. Actually work!! GYW

3 Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cool tool.. Actually work!! GYW

4 Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD ASICs Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

5 Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

6 Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

7 Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition GYW

8 Please do not distribute
4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Well defined specs GYW

9 Please do not distribute
4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Workload definition, common baseline Well defined specs GYW

10 Please do not distribute
4/22/2017 Tower of Babel Effect Big Problem. Intro: Number of benchmarks that occur 25 recent Arch CAD papers FFT: Of 25 papers only 1 used across all 8 Come back later Problem: 64 used only ONCE Want general mechanisms/solutions need standards to measure contributions. GYW

11 MachSuite is/has 19 application specific accelerator workloads
HLS and Aladdin compatible Workloads researchers are using today Diverse workloads for app space coverage Establishes standards without stifling creativity

12 Why MachSuite Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

13 Why machsuite Comparing benchmarks

14 Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]

15 Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { BackProp, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [MachSuite, IISWC/BARC]

16 Existing Benchmarks not Applicable
Many Existing GPU Benchmarks Rodinia, Parboil, SHOC.. GPU and Accelerator design spaces differ Tuned for GPU architecture Implemented in CUDA/OpenCL GPU workloads subset of accelerators

17 Why machsuite simulator/hls friendly

18 Works with Accelerator CAD Tools
Functions Units Resource Sharing Loop Pipelining Memory Bandwidth Vivado HLS Directives C Code RTL (Hardware Description Language) High-Level Synthesis

19 Works with Simulators MachSuite

20 Functions Unit Selection
Works with Simulators MachSuite Functions Unit Selection Loop Pipelining Memory Bandwidth Directives Trade-off Power/Performance

21 Why machsuite workload diversity and coverage

22 Incorporates Applications of Interest

23 Covers Application Space
FFT GEMM STENCIL 12 of 13 Dwarves

24 MachSuite Design Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

25 Machsuite design kernel selection

26 Kernel Selection Kernel = A specific problem E.g: SORT

27 Kernel Selection Kernel = A specific problem The Problem E.g: SORT
Not all using the same kernels Comparing similar sounding kernels doesn’t work Let’s just pick one

28 Machsuite design algorithm choice

29 Algorithm Choice Algorithm = A specific solution A type of kernel
E.g: Merge or Radix SORT

30 Algorithm Choice Algorithm = A specific solution The problem
A type of kernel E.g: Merge or Radix SORT The problem Reporting kernel too high level Ideal algorithms different across SoCs Standardization without limitation

31 Machsuite design implementation details

32 Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil

33 Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil The problem Can cause misleading results Performance depends on tuning Separate signal from noise

34 Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation Shows Space of possiple hardware designs.. This is a subset, there are THOUSANDS. GYW

35 Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] ~ 10x Performance, same power GYW

36 Root Causing Inefficiency
Please do not distribute 4/22/2017 Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] GYW

37 Please do not distribute
4/22/2017 What Happened “Unoptimized C Code” Pipelining result: Target II: 1, Final II: 30 “Optimized C Code” Target II: 1, Final II: 8 Pareto points from design space search Each dot has same directives Only difference is C code implementation 3.75x GYW

38 What Happened Unoptimized C Code
Please do not distribute 4/22/2017 What Happened Unoptimized C Code for i = 1 : Block for radixID : Radix bucket[i*Block+radixID ] += bucket[i*Block+ radixID-1]; Cyclic partitioning Still performing local scans serially All targeting the same “bank” Inner loop unrolled!!! GYW

39 Please do not distribute
4/22/2017 What Happened Optimized C Code for radixID : Radix for i = 1 : Block bucket[i*Block +radixID ] += bucket[i*Block + radixID-1]; Cyclic partitioning Now, when you pipeline loop you utilize bandwidth. Each “mini scan” pipeline gets its own bank Inner loop unrolled GYW

40 Please do not distribute
4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

41 Please do not distribute
4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

42 Please do not distribute
4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

43 MachSuite 19 application specific accelerator workloads
Benchmarks work with HLS and Aladdin Represents workloads researchers are using Diverse workloads, broad application space Standards with limited restrictions

44 MachSuite Available on GitHub http://breagen.github.io/MachSuite/
Publications Aladdin: [ ISCA’14 ] MachSuite: [ IISWC’14 ] Quantifying Acceleration: [ ISLPED’13 ]


Download ppt "The MachSuite Benchmark"

Similar presentations


Ads by Google