John D. Leidel1,2, Xi Wang1, Yong Chen2

Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications
John D. Leidel1,2, Xi Wang1, Yong Chen2 1Texas Tech University, 2Tactical Computing Laboratories IA3 2017: Seventh Workshop on Irregular Applications: Architectures and Algorithms

Overview Introduction Context Switching/Management
GC64 Micro Architecture Research Results Conclusions & Future Research

Introduction

Data Intensive Computing
Data intensive algorithms & applications Sparse data structures Sparse Matrix/SpMV Graph computations Network Theory Machine Learning Driving characteristics Non-Unit Stride Memory Access Scatters/Gathers Memory Intensive Often cache unfriendly Non deterministic D.Bader, D.Ediger, K.Jiang and J.Riedy. Characterizing and Analyzing Massive Spatio-Temproal Graphs

GoblinCore-64 RISC-V (RV64G) ISA Support for Micron HMC memory
Open architecture for programmable data intensive computing HW/SW is BSD licensed! RISC-V (RV64G) ISA Support for Micron HMC memory Dynamic memory coalescing Support for PGAS Partitioned global memory space in physical memory Microarchitecture support for latency hiding

Context switching/mgmt
Previous approaches Context switching/mgmt

Context Switching/Mgmt
Tera MTA/Cray XMT Convey CHOMP FPGA-based multithreaded ISA Time division multiplexed thread execution Single cycle context switching Context switching mechanism connected to register hazarding Thread streams Stream queuing via barrel mechanism One cycle per stream Single cycle context switch Sun UltraSPARC Niagara IBM Cyclops64 SPARC RISC pipeline Four threads per core Similar barrel context management to MTA BlueGene/C design using Power ISA 80 cores/socket; 2 thread units/core Non-preemptive thread execution Threads sleep on wait states rather than context switching

GoblinCore-64 Architecture and ISA
Micro architecture

RISC-V ISA Requirements
Rocket Core: 5-stage, in-order pipelined design ISA: RV64I: 64bit integer arithmetic and addressing support M-Extension: 64-bit integer multiplication and division A-Extension: Atomic instructions Optional Support: F-Extension: Single-precision floating point arithmetic support and storage D-Extension: Double-precision floating point arithmetic support and storage RC128I: Extended (scalable) 128-bit addressing support

GC64 Task Processor Integer arithmetic unit
Floating point arithmetic unit Thread control unit Spawning new tasks Joining tasks Incrementing task execution pressure (gcount register) Enforcing context switch events Multiple “Task Units” that represent an individual thread/task Attempt to always keep the pipeline full Replicated, unique register files One thread permitted to inject instructions at a time to the pipeline

GC64 Hierarchy Task Group GC64 SoC

GC64 Task Proc Context Switching
GC64 ctx method couples the compiler’s notion of instruction “cost” to the hardware’s ability to detect and enforce context switch events Instruction code is derived from the compiler’s cost table Compiler can now optimize execution of task/thread context in parallel apps! We implement our method as an extension to the current 5-stage pipeline control path Carries the Task Unit (CTX.ID) identifier through the various pipeline stages Permits us to have instructions from multiple Task Units in flight

GC64 Context Switching cont.
During the instruction crack/decode phase of the pipeline we perform a table lookup of the relative instruction cost We condense the table to hold only opcodes rather than the entire ISA encoding set to minimize space The cost value for the respective instruction is accumulated into the respective Task Unit’s GCOUNT register This value represents the relative pressure that the respective task or thread is inducing on the pipeline When the GCOUNT value exceeds a predetermined value, the Task Control Unit initiates a context switch

GC64 Concurrency Instructions
IWAIT Rd, Rs1, Rs2 Pend the execution of the next instruction until the register hazard on the register index at Rd has been cleared The pend is upheld while: rs2 < rs1 CTXSW Set the gcount register to an overflow state, thus forcing the current task to context switch Full architecture spec w/ ISA extensions:

GC64 Context Switching Example
Two Threads : y = a*x + b Cost Table Register to memory = 30 units Arithmetic = 20 units Max Pressure = 50 units

GC64 Cost Table

GC64 Context Switch Performance
Research Results

Simulation Infrastructure
GC64 Simulator is a mixture of multiple environments RISC-V Simulator “Spike” Our GC64-specific modifications to Spike GC64-specific low-latency tracing tools We boot and execute our environment using a real Linux kernel environment Mimics real user environment with system calls, I/O, etc All of our benchmark applications are compiled using GCC for RISC-V None are hand-tuned or utilize GC64-specific runtime mechanisms Test Metrics Frequency/distribution of context switch events by opcode Application runtime Config Executed using 64 and 128 as max thread pressure 2, 8 threads per core*** ***due to limitations in RISC-V Linux kernel

Benchmark Workloads Graph Analytics Benchmark Suite
Barcelona OpenMP Task Suite (BOTS) Alignment: Protein alignment application FFT: 1D Cooley-Tukey FFT Fib: Fibonacci sequence Health: N-body streaming health informatics Sort: recursive, parallel vector sort SparseLU: sparse LU factorization Strassen: Strassen multiplication of square matrices Graph Analytics Benchmark Suite BC: Betweenness centrality BFS: Breadth first search CC: Connected components PR: PageRank SSSP: Single source shortest path TC: Triangle counting

Benchmark Workloads cont.
NAS Parallel Benchmarks (C-version) CG: Conjugate gradient solver EP: Embarrassingly parallel FT: 3D forward and inverse FFT MG: multigrid solver IS: integer sorting LU: lower-upper Gauss-Seidel solver SP: Scalar penta-diagonal solver Misc. HPCG: High Performance Conjugate Gradient Solver Lulesh: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics proxy app STREAM: John McCalpin’s classic memory bandwidth benchmark Matmul: Performs a 512x512 blocked, dense matrix multiplication (DP Float)

Context Switch Opcode Distribution
Significant number of events on LUI (benign operation) I/O Operations Memory Ops are dominant! Conclusion: The GC64 method induces a more balanced set of executing threads, especially with respect to memory operations

Context Switch Application Speedup
Best application speedup: Bots.Fib test using 8 threads/core and max pressure of 64 14X speedup Worst application speedup: BOTS.Health workload using 2 threads/core and max pressure of 128 ~1X speedup Average speedup: 3.2X! Conclusion Utilizing the GC64 context switching mechanism increases performance across nearly all cases Does NOT inhibit performance!

Conclusion GoblinCore-64 Performance
Designed to support scalable execution of commodity programming models for data intensive computing Simple, RISC ISA and micro architecture Low latency context switching and ISA mechanisms to support scalable concurrency High bandwidth memory subsystem Scalable physical memory layer Performance Scalability to 18,446,744,073,709,551,616 threads Up to 14X performance improvement core-per-core over standard RISC-V environment Average performance increase of 3.2X per core! Simple, hardware-software coupled methods that increase performance without monumental effort to redesign applications/algorithms

Future Research Additional testing with updated RISC-V Linux kernel to determine optimal Task Unit to pipeline ratio Additional testing with different scheduling pressure thresholds Especially tested against multiple compiler implementations Optimized GCC + LLVM Un-optimized GCC + LLVM Implement our approach using BOOM Out of order RISC-V implementation Research the use of additional task queuing/priority mechanisms Similar in design to MTA barrel Research additional debugging/system software mechanisms What does our method do to a stable debugging environment? What about performance analysis mechanisms? Perf counters?

Questions/Comments John Leidel (jleidel@tactcomplabs.com)
Xi Wang Yong Chen

Additional technical information
Background material

GC64 Task Control Instructions
SPAWN Rd, Rs1 Spawn a new task using the context at Rs1 JOIN Rd, Rs1 Join a task using the task context at Rs1 GETTASK RD, TCTX Retrieve the task context value for the encountering task unit “Where do I get my new tasks from?” SETTASK TCTX, RD Set the task context value for the encountering task unit “This is where I get my new tasks from!” GETTID RD, GTID “What is my {task,thread,etc} id? Modifying Task Queue Values GETTQ RD, TQ SETTQ TQ, RS1 GETTE RD, TE SETTE TE, RS1

Task Unit Smallest unit of divisible concurrency Contains
RISC-V Integer Registers RISC-V Floating Point Registers GC64 Machine State Registers

L2: Task Group

Task Unit Registers TCTX TID TQ TE GCONST GARCH
64bit register that holds the address of the current task context TID 64bit register that holds the current task ID TQ 64bit register that holds the address of the task queue TE Task exception register: exceptions specific to task operations GCONST Constant register defining the locality of the given task unit GARCH Architecture description register

Context switch benchmark timing

NASPB Context switch benchmark timing

John D. Leidel1,2, Xi Wang1, Yong Chen2

Similar presentations

Presentation on theme: "John D. Leidel1,2, Xi Wang1, Yong Chen2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John D. Leidel1,2, Xi Wang1, Yong Chen2

Similar presentations

Presentation on theme: "John D. Leidel1,2, Xi Wang1, Yong Chen2"— Presentation transcript:

Similar presentations

About project

Feedback