Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1

Coarse-Grained Reconfigurable Architecture (CGRA)
FU LRF Array of PEs connected in a mesh-like interconnect Characterized by array size, node functionalities, interconnect, register file configurations Execute compute intensive kernels in multimedia applications 2

CGRA : Attractive Alternative to ASICs
Suitable for running multimedia applications on embedded systems High computation throughput Low power consumption and scalability High flexibility with fast configuration Morphosys : 8x8 array with RISC processor SIMD style execution of loops Piperench : 1-D reconfigurable hardware Virtualize hardware pipeline ADRES : 8x8 array with tightly coupled VLIW Modulo scheduling with simulated annealing 3

Scheduling in CGRA Different from conventional VLIW
Sparse interconnect and distributed register files No dedicated routing resources Need a good compiler to exploit the abundance of computing resources FU0 LRF FU1 LRF Central RF FU0 FU1 FU2 FU3 FU2 LRF FU3 LRF Conventional VLIW CGRA 4

Objectives of This Work
Modulo scheduling technique for CGRAs Exploit loop-level parallelism by overlapping execution of iterations Targeting low-cost CGRAs Achieve quality schedule under restriction of hardware Fast compilation time 5

Modulo Scheduling Basics
Expose loop-level parallelism by overlapping execution of iterations Initiation interval (II) Each iteration is executed every II cycles A B C A B C II A B C A B C Overlapped Execution 6

Modulo Scheduling for CGRA
Mapping DFG onto 3-D scheduling space Limited number of scheduling slots : (number of PEs) x II Minimize routing cost (number of slots used for routing) Sparse interconnect and distributed register files Ensure routability of operands DFG II time Scheduling Space 4x4 CGRA 7

Our Approach Systematic approach to generate good schedule in reasonable time Minimize routing cost Convert scheduling problem into graph embedding Leverage graph embedding algorithm Ensure routability of operands Skewed scheduling space Create a narrow, but tall scheduling space 8

1 : Minimize Routing Cost
Routing cost : number of PEs used for routing Determined by positions of producer and consumer Minimize distance between producers and consumers Height-based list scheduling Schedule operations in the order of dependence height Place consumers close to producers Need to carefully place operations in the same height 9

Scheduling Example – Routing Cost
time PE 0 PE 1 PE 2 PE 3 1 2 3 1 2 3 1 2 3 4 5 4’ 5’ 4 5 6 6 Routing Cost = 2 time PE 0 PE 1 PE 2 PE 3 1 2 3 DFG 1 2 3 PE 0 PE 1 PE 2 PE 3 4 5 6 1x4 CGRA Routing Cost = 0 Common consumer information is important ! 10

Affinity Graph Heuristic
Consider placement of operations with same height together Use common consumer information Affinity value between operations Measured by the distance of common consumers in DFG Construct affinity graph Nodes : operations, edges : affinity values Place operations with affinity edges close to each other 11

Affinity Graph Example
1 2 3 4 5 height 3 1 3 2 5 4 height 2 height 1 Affinity Graph DFG Mapping onto CGRA PE 1 3 2 5 4 1 3 2 5 4 2x4 CGRA Drawing affinity graph onto scheduling space Bad mapping Good mapping 12

Leveraging Graph Embedding
Drawing a graph onto a target space Grid layout algorithm by Li & Kurata Embed complicated biochemical networks onto 2-D grid space Simulated annealing Our scheduling problem is a graph embedding problem Draw affinity graph onto scheduling space minimizing edge length Process Flow of Grid Layout [Li 2005] 13

2 : Ensure Routability of Operands
Resources are repeatedly used every II cycles Routing can fail due to previously scheduled operations Backtracking : hard to make forward progress for CGRA Take preventative approach time PE 0 PE 1 PE 2 1 2 3 4 5 1 2 1 2 II 3 4 3 4 PE 0 PE 1 PE 2 5 6 5 6 7 1x3 CGRA 7 DFG Routing failed for Op 7 ! 14

Skewed Scheduling Space
Should prevent routing failures in advance time PE 0 PE 1 PE 2 1 2 3 4 5 1 2 1 2 5 6 3 4 Skew scheduling space Staggering down to the right 7 Create a narrow, but tall scheduling space Operations can be routed to the right Dynamically adjust scheduling space 15

System Flow 16

Experimental Setup Twelve innermost loop kernels from various domains
Three designs with different RF configurations Evaluate the impact of register file sharing Dedicated RF Shared RF Central RF 17

Evaluation of Affinity Heuristic
Results of acyclic scheduling Average of 59% reduction in routing cost 18

Modulo Graph Embedding vs. Simulated Annealing
Utilization = (# slots used for computation) / (# total slots) Time : (~ 5 sec) vs. (5 min ~ 3 hours) 19

Impact of Register File Configurations
20

Conclusions Modulo scheduler targeting low-cost CGRAs
Provide high computation throughput, scalability, power efficiency Two heuristics to generate a good schedule Affinity graph heuristic Skewed scheduling space Average utilizations of 56-68% for three designs Systematic approach allows fast compilation time All benchmarks finished within 5s 21

Questions ? 22

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Similar presentations

Presentation on theme: "Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Similar presentations

Presentation on theme: "Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke"— Presentation transcript:

Similar presentations

About project

Feedback