The MachSuite Benchmark

The MachSuite Benchmark
Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks

Who Cares about Accelerators
Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & SoCs Cool tool.. Actually work!! GYW

Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cool tool.. Actually work!! GYW

Please do not distribute 4/22/2017 Who Cares about Accelerators Architecture CAD ASICs Cause: Transistors scaling Effect: Specialization & SoCs Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

Please do not distribute
4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Cause: RTL design costs Effect: C-to-RTL tools Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Cause: Performance needs Effect: Build tuned IC Keep doing what they do H265, speech recognition GYW

4/22/2017 What’s Next Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition GYW

4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Well defined specs GYW

4/22/2017 What’s Missing Architecture CAD ASICs System Integration Composability Flexibility Faster Turn Around Larger App Space Complex Designs Not much change Need high perf ICs H.266 Keep doing what they do H265, speech recognition Workload definition, common baseline Well defined specs GYW

4/22/2017 Tower of Babel Effect Big Problem. Intro: Number of benchmarks that occur 25 recent Arch CAD papers FFT: Of 25 papers only 1 used across all 8 Come back later Problem: 64 used only ONCE Want general mechanisms/solutions need standards to measure contributions. GYW

MachSuite is/has 19 application specific accelerator workloads
HLS and Aladdin compatible Workloads researchers are using today Diverse workloads for app space coverage Establishes standards without stifling creativity

Why MachSuite Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

Why machsuite Comparing benchmarks

Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]

Existing Benchmarks are Insufficient
High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { BackProp, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [MachSuite, IISWC/BARC]

Existing Benchmarks not Applicable
Many Existing GPU Benchmarks Rodinia, Parboil, SHOC.. GPU and Accelerator design spaces differ Tuned for GPU architecture Implemented in CUDA/OpenCL GPU workloads subset of accelerators

Why machsuite simulator/hls friendly

Works with Accelerator CAD Tools
Functions Units Resource Sharing Loop Pipelining Memory Bandwidth Vivado HLS Directives C Code RTL (Hardware Description Language) High-Level Synthesis

Works with Simulators MachSuite

Functions Unit Selection
Works with Simulators MachSuite Functions Unit Selection Loop Pipelining Memory Bandwidth Directives Trade-off Power/Performance

Why machsuite workload diversity and coverage

Incorporates Applications of Interest

Covers Application Space
FFT GEMM STENCIL 12 of 13 Dwarves

MachSuite Design Existing Benchmarks are not applicable/sufficient
Works with Accelerator Simulators and CAD tools Representative applications covering wide space Kernel Selection Algorithm Choice Implementation Details

Machsuite design kernel selection

Kernel Selection Kernel = A specific problem E.g: SORT

Kernel Selection Kernel = A specific problem The Problem E.g: SORT
Not all using the same kernels Comparing similar sounding kernels doesn’t work Let’s just pick one

Machsuite design algorithm choice

Algorithm Choice Algorithm = A specific solution A type of kernel
E.g: Merge or Radix SORT

Algorithm Choice Algorithm = A specific solution The problem
A type of kernel E.g: Merge or Radix SORT The problem Reporting kernel too high level Ideal algorithms different across SoCs Standardization without limitation

Machsuite design implementation details

Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil

Implementation Details
Implementation = Specific code for algorithm E.g: Stencil in Rodinia vs Parboil The problem Can cause misleading results Performance depends on tuning Separate signal from noise

Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation Shows Space of possiple hardware designs.. This is a subset, there are THOUSANDS. GYW

Performance Variance due to Implementation Details
Please do not distribute 4/22/2017 Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] ~ 10x Performance, same power GYW

Root Causing Inefficiency
Please do not distribute 4/22/2017 Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN So we need to pick one [MS gives you] Or at least have a way to talk about changes [MS gives you] with marked up code. -> It’s like a simulator, if you change something you should report your configuration in your evaluation section. [details in tutorial] GYW

4/22/2017 What Happened “Unoptimized C Code” Pipelining result: Target II: 1, Final II: 30 “Optimized C Code” Target II: 1, Final II: 8 Pareto points from design space search Each dot has same directives Only difference is C code implementation 3.75x GYW

What Happened Unoptimized C Code
Please do not distribute 4/22/2017 What Happened Unoptimized C Code for i = 1 : Block for radixID : Radix bucket[i*Block+radixID ] += bucket[i*Block+ radixID-1]; Cyclic partitioning Still performing local scans serially All targeting the same “bank” Inner loop unrolled!!! GYW

4/22/2017 What Happened Optimized C Code for radixID : Radix for i = 1 : Block bucket[i*Block +radixID ] += bucket[i*Block + radixID-1]; Cyclic partitioning Now, when you pipeline loop you utilize bandwidth. Each “mini scan” pipeline gets its own bank Inner loop unrolled GYW

4/22/2017 Solution MEMORY MEMORY SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

4/22/2017 Solution MEMORY MEMORY ✔ SCAN Accelerator SCAN Accelerator Now, sequential accesses map to different SRAMs And you can utilize the available bandwidth. GYW

MachSuite 19 application specific accelerator workloads
Benchmarks work with HLS and Aladdin Represents workloads researchers are using Diverse workloads, broad application space Standards with limited restrictions

MachSuite Available on GitHub http://breagen.github.io/MachSuite/
Publications Aladdin: [ ISCA’14 ] MachSuite: [ IISWC’14 ] Quantifying Acceleration: [ ISLPED’13 ]

The MachSuite Benchmark

Similar presentations

Presentation on theme: "The MachSuite Benchmark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The MachSuite Benchmark

Similar presentations

Presentation on theme: "The MachSuite Benchmark"— Presentation transcript:

Similar presentations

About project

Feedback