Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implementation of Efficient Check-pointing and Restart on CPU - GPU

Similar presentations


Presentation on theme: "Implementation of Efficient Check-pointing and Restart on CPU - GPU"— Presentation transcript:

1

2 Implementation of Efficient Check-pointing and Restart on CPU - GPU
Sumanth Suraneni Sharath Prasad Harsha Sutaone 9/17/2018

3 Introduction GPU. CPU – GPU systems Checkpoint GPU on CPU-GPU
Restart from checkpoint on CPU 9/17/2018

4 Motivation CPU – GPUs general purpose workloads
Dependability an issue in future GPUs GPU fault tolerance is a nascent field Checkpointing implementations on GPUs are at application level We explore micro-architectural changes to GPUs 9/17/2018

5 Background OpenCL Programming Model Southern Islands Architecture
Multi2Sim CPU-GPU Simulator 9/17/2018

6 OpenCL Programming Model
9/17/2018

7 OpenCL Programming Model
Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing 9/17/2018

8 OpenCL Programming Model
Work-item Grouping into Work-groups and Wavefronts 9/17/2018

9 Southern Islands Architecture
9/17/2018

10 Southern Islands Architecture
Compute Unit 9/17/2018

11 Southern Islands Architecture
Kernel State 9/17/2018

12 Multi2Sim CPU-GPU Simulator
Software entities defined in the OpenCL Programming Model An ND-Range is formed of work-groups, which are, in turn, sets of work-items executing the same OpenCl C Kernel code 9/17/2018

13 Multi2Sim CPU-GPU Simulator
Interaction between user code, OS-code, and hardware, comparing native and simulated environments 9/17/2018

14 Multi2Sim CPU-GPU Simulator
Running an OpenCL Kernel on a Southern islands GPU Block Diagram of a Compute Unit 9/17/2018

15 Implementation SIEmuCreate() SIEmuRun() si_wavefront_execute()
Assign global memory List running and waiting work-groups SIEmuRun() Dequeue & Enqueue running work-groups and waiting work-groups Work-group create si_wavefront_execute() Instruction dump Next PC = Current PC + Instruction Size 9/17/2018

16 Implementation Checkpoint Implementation ND-Range : ID, work dimension, number of VGPRs & SGPRs used Work-group : ID, work-groups finished, wavefronts completed & at barrier, wavefront count Wavefront : ID, SREGs, execution state of wavefront, instruction count. Work-item : ID, VREGs, global memory access size & address 9/17/2018

17 Implementation LDS (Local Data Share) Global memory
LDS module of executing work-group. All pages are stored. Global memory Stored until global memory top. 9/17/2018

18 Implementation Completed Work-groups
Store the list of finished work-groups in a file. Unexecuted Wavefronts during checkpoint Store into a separate file while writing the checkpoint file. Read from the file to start execution during restart. 9/17/2018

19 Implementation Checkpoint Checkpoint Trace 9/17/2018

20 Implementation Restart Restart Trace 9/17/2018

21 Implementation Verification Strategy 9/17/2018

22 Evaluation Workgroups 9/17/2018

23 Evaluation Instruction Count 9/17/2018

24 Evaluation Checkpoint Size 9/17/2018

25 Evaluation LDS Comparison 9/17/2018

26 Bugs Encountered LDS misalignment. 9/17/2018

27 Bugs Encountered Unexecuted wave front during checkpoint 9/17/2018

28 Future Scope Further minimization of LDS snapshot
Keeping track of pages modified and storing only those Implementing a driver call to checkpoint Hardware Complexity of the implementation Compression algorithms during multiple checkpoints 9/17/2018

29 THANK YOU 9/17/2018


Download ppt "Implementation of Efficient Check-pointing and Restart on CPU - GPU"

Similar presentations


Ads by Google