Ph.D. in Computer Science School of Computing, Informatics, and Decision Systems Engineering Compiler and Architecture Design for Coarse-Grained Programmable Accelerators Mahdi Hamzeh June 26, 2015
Trends in Silicon Computing Heterogeneity Multi-cores Multi-cores Multi-threading Multi-threading Multi-threading μ-architecture μ-architecture μ-architecture μ-architecture Technology Technology Technology Technology Technology 6/26/15 Compiler and Architecture Design for CGRAs
Why Heterogonous Computing? Efficient Resource Allocation Based on Run-Time Info Each exhibit interesting feature for a class of computation Applications execute in phases Phase: a different class of computation A significant silicon area will be dark 1 Power GPU FPGA HP Core LP Core DSP HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
HW Accelerators are Expensive! High design, test, verification cost HW ACC and FPGA Engineering cost Time to market HW ACC System Design Cost FPGA GPU DSP Building specialized HW ACC is expensive and time consuming HP Core LP Core Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
HW Accelerators: Low Utilization, Limited Programmability Specialized for one application HW ACC Specialized for a class of computation DSP, GPU Run-time configuration overhead FPGA LP Core HP Core Flexibility FPGA GPU DSP HW ACC is only do well in one app, cannot use it in other app even if close computation class phase HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
Software Programmable Accelerators: Opportunities and Challenges Programmability Compiler support: drives down costs HW ACC DSP GPU FPGA Performance Flexibility HP Core LP Core System Design Cost HP Core LP Core DSP HW ACC GPU FPGA Performance SW ACC SW acc to close cost gap SW ACC 6/26/15 Compiler and Architecture Design for CGRAs
Coarse-Grained Reconfigurable Architectures 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Designs in Literature ADRES 60 GOPS/w 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Designs in Literature TilePro64 192 GOPS @23W 6/26/15 Compiler and Architecture Design for CGRAs
Problems Addressed in this Dissertation CGRA Compiler Problems Problem Definition Complexity Analysis Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
CGRA accelerates loops using modulo scheduling Execution Trace Target Application Specified in C Serial region Prolog Repetitive region Loop Serial region Epilog 6/26/15 Compiler and Architecture Design for CGRAs
II is the performance metric Modulo Scheduling Time 4 b 1 2 3 4 1 2 3 4 a 2 a a a a b b b b b b 1 2 3 4 1 2 3 4 c d 1 2 3 4 1 b II is the performance metric c c c c d d d d 1 2 3 4 1 2 3 4 f f f f e e e e 2 g g g g 1 2 3 4 1 2 3 4 3 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Modulo Scheduling: Problem Definition Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Modulo Scheduling: Problem Definition Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Problem Definition Important characteristics Routing, re-computing, or both EPIMorphism between computation graph and resource graph Identified the list of necessary conditions scheduled computation graph should hold Mapping is NP-Complete 3-partition problem 6/26/15 Compiler and Architecture Design for CGRAs
Problems Addressed in This Dissertation Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Modulo Scheduling Policies Brute Force Edge Centric Integrated Methods Node Centric Modulo Scheduling Policies Nature Inspired Existing literature addressing this problem using following policies Partitioning Decomposition methods Nature Inspired 6/26/15 Compiler and Architecture Design for CGRAs
Assumption and Limitations Memory miss, stop the execution A ld/st queue to resolve memory dependencies Support only single assignment instructions No system call No Function call Single exit condition 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs EPIMap Decomposition Scheduling Placement Constructive Evolve computation graph based on resource graph Adjust resource graph (MII) Efficient placement How we address it. Why we do it better? 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs EPIMap notable features and policies 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Re-Scheduling 6/26/15 Compiler and Architecture Design for CGRAs
Resource Allocation Problem 6/26/15 Compiler and Architecture Design for CGRAs
Resource Allocation: Supporting Multi-cycle Operation 6/26/15 Compiler and Architecture Design for CGRAs
Resource Allocation: Supporting Pipelined Resources f 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs
Rotating and Non-Rotating Register Files 6/26/15 Compiler and Architecture Design for CGRAs
Problems Addressed in This Dissertation Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation Control Flow Acceleration CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
Control Flow Acceleration 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Partial Predication 3 a b c f e h et ef a b a b h a b h et ef c h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Full Predication b h a 4 a a b c f e h b h b h a c e b e e b a c e f 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Dual-Issue a b c f e h et ef a b c f h e 6/26/15 Compiler and Architecture Design for CGRAs
Mapping with Dual-Issue 2 b a b c f h e a b h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Hardware Support 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs CGRA Compiler Flow 6/26/15 Compiler and Architecture Design for CGRAs
State-of-the-art before EPIMap/REGIMap DRESC: A simulated annealing based mapping algorithm Integrated Mapping policy Supports multi-cycle operations Supports pipelined PEs Extended with register allocation Has been shown to generate mapping better than other mapping algorithms 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs EPIMap DRESC: Simulated annealing based MII= Min (ResMII, RecMII) 4 X 4 CGRA Mesh interconnect 1 cycle latency 6/26/15 Compiler and Architecture Design for CGRAs
Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs
Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs
Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs
Mapping and Register Allocation-Pipelined PEs 6/26/15 Compiler and Architecture Design for CGRAs
Mapping and Register Allocation-Pipelined PEs 6/26/15 Compiler and Architecture Design for CGRAs
Summary of EPIMap/REGIMap vs. DRESC Performance Ratio Compilation Time Ratio Single cycle (NO-RA) 1.31X 138X Single cycle – 2 Regs 1.73X 240X Single cycle - 4 Regs 1.6X 209X Single cycle - 8 Regs 1.5X 163X Pipelined (NO-RA) 1.45X 192X Pipelined- 2 Regs 1.83X 317X Pipelined- 4 Regs 1.81X 289X Pipelined- 8 Regs 1.68X 227X 6/26/15 Compiler and Architecture Design for CGRAs
Mapping Loops With Conditional Instructions 6/26/15 Compiler and Architecture Design for CGRAs
CGRA Research Framework 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Summary Problem definition Supports routing Re-computation Complexity analysis Reduction from 3-partition problem Counter intuitive discovery, re-computation can improve performance Computation graph and necessary conditions EPIMap Approximate II progressively Effective iterative scheduling algorithm 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Summary Placement problem formulation Support of multi-cycle operations Support of pipelined resources Constructive method REGIMap Integrated placement and register allocation Support of conditionals Full predication Partial predication Dual-issue Integration with llvm compiler framework 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Summary CGRA design ISA Rotating and non-rotating register files Dual-issue support RTL implementation and synthesis CGRA simulation framework CGRA model in gem5 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Future Directions Support of system call Mapping with memory optimization Software prefetching in mapping Just-in-time compilation of kernels Offload decision at run-time Speculative execution support for CGRAs 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Backup 6/26/15 Compiler and Architecture Design for CGRAs
Backup-Scheduling Success 6/26/15 Compiler and Architecture Design for CGRAs
Clique-Resource Allocation Attempts 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs
Compiler and Architecture Design for CGRAs Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs