Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Similar presentations


Presentation on theme: "Hyunchul Park†, Kevin Fan†, Scott Mahlke†,"— Presentation transcript:

1 Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures
Hyunchul Park†, Kevin Fan†, Scott Mahlke†, Taewook Oh‡, Heeseok Kim‡, Hong-seok Kim‡ † University of Michigan ‡ Samsung Advanced Institute of Technology October 28, 2008 1

2 Coarse-Grained Reconfigurable Architecture (CGRA)
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 2

3 CGRA : Attractive Alternative to ASICs
Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW 3

4 Scheduling in CGRA Sparse interconnect and distributed register files
No dedicated routing resources : FUs are used for routing Need explicit routing of operands by compiler FU RF FU RF FU RF FU RF Central RF FU RF FU RF FU RF FU RF FU FU FU FU FU RF FU RF FU RF FU RF Conventional VLIW FU RF FU RF FU RF FU RF CGRA 4

5 Scheduling Difficulties
VLIW : routing is guaranteed by central RF CGRA : Multiple possible routes Compiler is responsible for finding routes Routing can easily fail by other operations time time VLIW CGRA 5

6 Objective of This Work Modulo scheduling technique for CGRAs
Exploit loop-level parallelism by overlapping execution of iterations Customized approach based on characteristics of CGRAs Achieve fast compile time and good performance Huge scheduling space, distributed resources Naïve approach can result in either poor solution or long compile time 6

7 Traditional Approach : Node-centric
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 C P1 P2 C C 1 C C C 2 3 4 C C C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C C C 8 9 10 Operations are placed first, then routing is performed Visit all candidate slots to find the solution 7

8 Node-centric Inefficiency 1
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 C P1 P2 C 1 C C 2 3 4 C C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C 8 9 10 Attempt routing to non-reachable slots by edge P1 to C 8

9 Node-centric Inefficiency 2
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 C C 1 C 2 3 4 C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C 8 9 10 Repeat the same routing already performed 9

10 Our Approach : Edge-centric
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 C 1 C 2 3 4 1 C 5 6 7 2 C C C 8 9 10 3 4 Node-centric Edge-centric Start routing without placing the operation Placement occurs during routing 10

11 Benefit 1 : Less Routing Calls
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 1 2 3 4 1 5 6 7 2 C C 8 9 10 3 4 Node-centric Edge-centric 11 routing calls for P1  C 1 routing call for P1  C Reduce compile time with less number of routing calls 11

12 Benefit 2 : Global View node-centric edge-centric time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P P 10 1 C 1 1 C C 2 2 Assume slot 0 is a precious resource (better to save it for later use) Node-centric greedily picks slot 1 Edge-centric can avoid slot 0 by simply assigning a high cost 12

13 Edge-centric Modulo Scheduling
It’s all about edges Scheduling is constructed by routing ‘edges’ Placement is integrated into routing process Global perspective for EMS Scheduling order of edges Prioritize edges to determine scheduling order Routing optimization Develop contention model for routing resources 13

14 1: Edge Prioritization Focus on # consumers Height-based priority
Simple edges / High fanout edges Height-based priority Give high priority to high fanout edges Edges scheduled later will likely use extra resources Extra resources in simple edges are just being wasted Extra resources in high-fanout edges can be helpful Other consumers can make use of those 14

15 Fanout Clustering Our approach : the opposite
Give priority to simple edges Operations connected in simple edges form a cluster Schedule simple edges within a cluster Schedule high-fanout edges when consumers are visited 17 of 81 loops in H.264 show better throughput Only 1 shows worse throughput 15

16 2: Routing Optimization
Routing is guided by cost associated with each routing slot Intelligent routing cost metrics are important Minimize # routing resources for current edge Static cost : fixed positive cost for each resource Minimize # routing resources for other edges to prod/cons Affinity cost : use common consumer information Avoid routing failures for other edges Probabilistic cost : predict future resource usage routing cost = F(static cost, affinity cost, probabilistic cost) 16

17 Affinity Cost Heuristic
time FU 0 FU 1 FU 2 FU 3 1 2 3 time FU 0 FU 1 FU 2 FU 3 1 2 3 A B C A B A B C C FU 0 FU 1 FU 2 FU 3 Routing Cost = 2 Routing Cost = 0 Affinity cost : utilize common consumer information Affinity value : how close common consumer is in DFG Place operations with high affinity close to each other 17

18 Probabilistic Cost Heuristic
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 5 6 7 P1 P2 P1 C1 C2 P2 . C1 C2 ST Three possible routes, all using same # routing slots 18

19 Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 5 6 7 P1 P2 P1 ST C1 C2 P2 . C1 C2 ST Need to consider other unplaced edges/operations Slots that might be used for routing P2  C2 Slots that might be used for placing ST 19

20 Probabilistic Cost Heuritsic
time FU 0 FU 1 FU 2 FU 3 FU 4 0.33 1 2 1.0 3 4 5 0.5 6 7 P1 P2 P1 C1 C2 P2 . X X C1 C2 ST Probabilities on future usage of slots are calculated and guide routing of P1  C1 Route in the middle is selected 20

21 EMS System Flow Schedule Preprocessing DFG CGRA Select target edge
Cost calculation Fanout clustering Final schedule DFG Perform routing Prioritize edges Place operations Route to others CGRA 21

22 Experimental Setup 214 loops from highly optimized media applications
H.264, 3D graphics, AAC, MP3 Target architecture 4x4 heterogeneous CGRA (6 memory, 4 multiply) Local RF for each PE Mesh-plus interconnect : mesh + 2 hop connections Compared to 3 other solutions IMS : iterative modulo scheduling, no routing optimization NMS : same heuristics as EMS, but in a node-centric way DRESC : IMEC’s simulated annealing 22

23 Results Performance : normalized throughput of loops
+10% 0.5x +24% 2x -2% -18x Performance : normalized throughput of loops Max throughput is determined by # ops in a loop and # resources Compile time : for all 214 loops 23

24 Conclusion EMS is a good match for scheduling in CGRA
Routing is more important than placement Edge-centric approach allows fast compile time 18x speed up over simulated annealing Intelligent routing cost metrics allows good performance 24% improvement over IMS, 98% performance of existing solution 24

25 Questions ? 25


Download ppt "Hyunchul Park†, Kevin Fan†, Scott Mahlke†,"

Similar presentations


Ads by Google