Presentation is loading. Please wait.

Presentation is loading. Please wait.

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Similar presentations


Presentation on theme: "PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,"— Presentation transcript:

1 PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong, Glenn Reinman Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research [ICCAD 2015]

2 The Power Wall and Customized Computing
Parallelization Customization Adapt the architecture to application domain Source: Shekhar Borkar, Intel

3 The Trend of Accelerator-Rich Architecture (ARA)
From ARC [DAC 12] & CHARM [DAC 14] Global Accelerator Manager (GAM) with shared TLB

4 Our Motivation and Goal
A stack of research tools for accelerator-rich architecture Standalone accelerator simulation: Aladdin Standalone accelerator generation: HLS System-level HLS-based ARA simulation: PARADE System-level pre-RTL SoC simulation: gem5 + Aladdin ARA FPGA prototyping: ARAPrototyper Spare the community the difficulties we have encountered Accelerate the adoption of accelerator-rich architecture (ARA) early-stage late-stage

5 PARADE: Platform for Accelerator-Rich Architectural Design & Exploration [ICCAD 15]
extended gem5 (McPAT) for X86 CPU, with OS auto-generated accelerators based on HLS (AutoPilot) added SPM, DMA, GAM & TLB model extended Garnet (DSENT) for NoC extended Ruby (CACTI) for coherent cache hierarchy gem5 memory model [ISPASS 14]

6 HLS-based Automatic Accelerator Generation
Source Code High-Level Synthesis C function to accelerate Application Dataflow Simulation Module Generator RTL Synthesis RTL model Accelerators chaining info Timing info e.g., II, clk Simulation Module Simulation module info Program Generator Handles accelerator communication, task buffer, interrupts, … Generated Program Output Tool Input

7 Tutorial Agenda Building the PARADE simulator Creating a benchmark
Running a benchmark on PARADE Performance and energy analysis using PARADE Performance breakdown Energy breakdown Simulation speed of PARADE Summary of PARADE features

8 Building PARADE Use an existing accelerator module: VectorAddSample
#include "VectorAddSample.hh” in LCAccOperatingModeInclude.hh VectorAddSample.hh contains accelerator modeling details REGISTER_OPMODE(OperatingMode_VectorAddSample) in LCAccOperatingModeListing.hh g_LCAccInterface->AddOperatingMode(g_LCAccDeviceHandle[0], "VectorAddSample"); in startup() mem/ruby/system/System.cc The same compiling command as gem5 scons PROTOCOL=MESI_Two_Level_Trace build/X86/gem5.opt

9 Adding an Accelerator Simulation Module
User high-level description: VectorAddSample.type Specify an unique accelerator module name and opcode 1390 uw 52806 um2 10 2 (1GHz) 1 Replace timing by our HLS/RTL tool ./run.autopilot.sh VectorAddSample.type Specify accelerator inputs and outputs Specify accelerator computation body: one iteration in the loop

10 Auto-Generated Accelerator Simulation Module
Auto-generated accelerator module: VectorAddSample.hh each accelerator inherits the LCAccOperationMode class accelerator module name and opcode accelerator timing info from HLS/RTL auto-generated SPM address mapping model SPM/DMA timing, and computation latency mono AccGen.exe src/modules/LCAcc VectorAddSample.type

11 Creating a Benchmark To use existing accelerators, just call accelerator API base BenchmarkNode class include accelerator API inherit BenchmarkNode call the CreateBuffer API to init the accelerator, write the program description to memory buffer call the run_buf API to read the accelerator, run it, handle the communication with CPU and GAM Create a BenchmarkNode, call Initialize() and Run()

12 Within the Accelerator Library (for Benchmarks)
3 CPU GAM 1 New ISA lcacc-req type lcacc-rsrv id, time lcacc-cmd id, cmd, addr lcacc-free id 2 4 7 4 5 5 Acc Mem Task description 6 4 Request available accelerators (lcacc-req) Response available ones & waiting time Request reservation (lcacc-rsv) and wait Reserve accelerator, send it the core ID The core shares a task description and start the accelerator (lcacc-cmd) Read task & start work Work done, notify the core Free accelerators (lcacc-free) Users don’t have to worry about these, we provide a dataflow language and tool to automatically generate the library

13 Creating an Accelerator Library (for Benchmarks)
To create an accelerator library, specify accelerator dataflow (mono ApGen.exe VectorAddSample.txt VectorAddSampleLCacc.h) input and output entire data size and tile size task based on tile size (chunk) use double SPM buffer declare the accelerator create SPM for input/output data transfer based on tile: input LLC/DRAM -> SPM & output SPM -> LLC/DRAM trigger accelerator within tile: input SPM -> Register & output Register -> SPM

14 Running a Benchmark on PARADE
Similar gem5 command to run benchmarks ./gem5.opt --outdir=./TDLCA_BlackScholes/ configs/example/fs.py full-system config --checkpoint-dir=./ckpt-1core/ --restore-with-cpu=timing -r 1 -n restore checkpoint -s W timing warmup initialization, then switch to OoO --ruby --l2_size=64kB --num-l2caches= banked 2MB LLC with Ruby --mem-size=2GB --num-dirs= GB memory with 4 DDR3 controllers --garnet=fixed --topology=Mesh --mesh-rows= *8 mesh with Garnet --lcacc --accelerators= copy of accelerator --script=./configs/boot/BlackScholes.td.rcS BlackScholes boot script >& TDLCA_BlackScholes/result.txt redirect output to result.txt Output statistics of PARADE Stats.txt, result.txt, and visual.txt (visualization trace)

15 Execution Cycles for BlackScholes

16 Execution Cycles for BlackScholes (cont.)
Total cycles: check stats.txt system.switch_cpus_1.numCycles CPU/SW computation time (assume perfect cache) Change configuration to use 20MB L2 cache with 1 cycle latency ARA computation/communication time Result.txt contains start, and end time for each task computation and data transfer Postprocessing “ResultParser result.txt”, it will generate

17 Issuing IPC Breakdown for BlackScholes
No issuing instruction For the accelerator version, it’s customized to a fully-utilized 234-stage deep pipeline

18 Issuing IPC Breakdown for BlackScholes (cont.)
Check stats.txt ARA version based on HLS ./run.autopilot.sh BlackScholes.type

19 # of Cache/Memory Access for BlackScholes
42X L1D reduction by removing register spilling removed instructions no notable reduction

20 # of Cache/Memory Access for BlackScholes (cont.)
Check stats.txt

21 Cache/Memory Bandwidth for BlackScholes
Bandwidth = Access * 64B * 10-6/ (Cycles/2GHz) (MB/s) improved LLC/DRAM effective bandwidth

22 Energy Breakdown for Deblur

23 Energy Breakdown for Deblur (cont.)
Energy = power * time, we measure power directly Accelerator power: given by run.autopilot.sh DRAM power (gem5 integrated Micron model, ISPASS 14) system.mem_ctrls0.averagePower:: in stats.txt Core power (McPAT) generate.mcpat.xml.sh convert gem5 statistics to mcpat xml generate.mcpat.energy.sh generate power numbers

24 Energy Breakdown for Deblur (cont.)
LLC power (CACTI, integrated with McPAT) NoC power (DSENT) run generate.dsent.sh

25 Simulation Speed of PARADE
KIPS: Kilo Instructions simulated Per Second 15X

26 PARADE: Platform for ARA Design & Exploration
Based on widely-used gem5 and support full-system X86 HLS support to model customized accelerators Global Accelerator Management (GAM) Coherent cache/SPM with shared memory (Ruby) Customizable Network-on-Chip simulation (Garnet) Power/area simulation McPAT for CPUs; high-level synthesis (AutoPilot) for accelerators CACTI for caches; DSENT for NoC, Micron model for DRAM New feature: address translation support (HPCA 17)

27 Supporting Address Translation for ARA
The source code for address translation support will be added to PARADE soon! On average: 7.6X speedup over naïve IOMMU design, only 6.4% gap between ideal translation [HPCA 17] Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinman

28 Thank You! Zhenman is on the academic job market, please check his website:


Download ppt "PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,"

Similar presentations


Ads by Google