Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,"— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

2 Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing –10-100 Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data 3.5G (HSDPA) WiMax Stereo Headset TV out PC / Mac Memory card 20 GB HD [ARM 2005]

3 University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators Order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9.C Automated C  gates solution Correct by construction Close designer productivity gap Achieve short time-to-market

4 University of Michigan Electrical Engineering and Computer Science 4 Loop Accelerator Template Parameterized execution resources, storage, connectivity Hardware realization of modulo scheduled loop

5 University of Michigan Electrical Engineering and Computer Science 5 Loop Accelerator Design Flow FU Alloc.c C Code, Performance (Throughput) Abstract Arch 1 Modulo Schedule Op1 Op2 Op3 … time FUs Scheduled Ops 2 RF FU Build Datapath Concrete Arch 3 FU Instantiate Arch Synthesize Verilog, Control Signals.v Loop Accelerator 54

6 University of Michigan Electrical Engineering and Computer Science 6 Modulo Scheduling and Datapath Derivation Schedule to abstract architecture (FUs) Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code Datapath MEM+ 12 ADD LOAD time 1 time 4 FU1FU2 Schedule...

7 University of Michigan Electrical Engineering and Computer Science 7 Cost Sensitive Scheduling Different scheduling alternatives not equal +1+1 LD 1 +1+1 +2+2 LD 2 +2+2 time FU1FU2FU3 FU1FU2FU3 0 1 2 +1+1 +2+2 LD 2 LD 1 time FU1FU2FU3 FU1FU2FU3 0 1 2 Traditional scheduling is hardware unaware Intelligent scheduling needed to reduce hardware cost

8 University of Michigan Electrical Engineering and Computer Science 8 Scheduling to Reduce Cost Hardware cost is function of final schedule Increased hardware sharing = reduced cost 1 2 FU Reusing hardware is “free” Traditional metrics (register pressure) not sufficient 3 4 FU No additional cost for longer lifetime FU

9 University of Michigan Electrical Engineering and Computer Science 9 Initial Approach: Greedy Standard iterative modulo scheduler, augmented with hardware cost model Choose alternative which increases cost the least while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model } Hardware cost = FU cost + Storage cost + Wire cost + -*<<

10 University of Michigan Electrical Engineering and Computer Science 10 Results – Greedy Scheduling 5% average cost savings Local scope  local minima Much more cost savings possible FUStorageMUX

11 University of Michigan Electrical Engineering and Computer Science 11 Optimal Modulo Scheduling +1+1 +2+2 LD 3 -5-5 +4+4 (1,0) (1,1)(3,0)(3,1) (2,0) (2,1) Op1 Op2 Op3 Loop Search Space (FU #, time) Optimal modulo scheduling extends [Eichenberger ’97] Storage cost =  width i  depth i FU cost =  cost(FU i )

12 University of Michigan Electrical Engineering and Computer Science 12 Results – Optimal Scheduling 27% average cost savings FUStorageMUX

13 University of Michigan Electrical Engineering and Computer Science 13 Problem Decomposition Exact solutions are not practical –(#FU  II  stages) ^ #ops possible schedules –20 lines of C code  100 hours –Excessive runtimes even for modest-size loops Decompose into more manageable sub-problems –Partitioned scheduling –Time-space decomposition

14 University of Michigan Electrical Engineering and Computer Science 14 Partitioned Scheduling Partition the operations into small groups Schedule groups of operations sequentially –Account for hardware contribution of previously scheduled groups –Backtrack if infeasible state reached 12 43 5 Optimal Modulo Scheduler 1 3 5 Optimal Modulo Scheduler 12 43 5

15 University of Michigan Electrical Engineering and Computer Science 15 Operation Partitioning Traditional partitioning: minimize edge cuts –Does not necessarily lead to good cost Goal: maximize hardware sharing opportunities within a group + LD + << + * + LD +

16 University of Michigan Electrical Engineering and Computer Science 16 Results – Partitioned Scheduling 8% average cost savings With large number of partitions, similar to greedy FUStorageMUX

17 University of Michigan Electrical Engineering and Computer Science 17 Partition Size for Sharp Improve cost by considering more ops at a time

18 University of Michigan Electrical Engineering and Computer Science 18 Time-Space Decomposition 12 43 5 1 3 52 time 0: time 1: 4 1 3 52 time FU1FU2FU3 0 1 4 1 3 5 2 FU 1: FU 2: 4 FU 3: 1 35 2 time FU1FU2FU3 0 1 4 Time, space Space, time Reduce scheduling complexity View all operations together Optimize for register depth during time assignment, register width and FU cost during space assignment

19 University of Michigan Electrical Engineering and Computer Science 19 Results – Time-Space Scheduling Time, space: 19% average cost savings Space, time: 20% average cost savings FUStorageMUX

20 University of Michigan Electrical Engineering and Computer Science 20 Real Cost Savings Viterbi, naïve scheduler, 0.66 mm 2 Viterbi, space-time decomposed scheduler, 0.37 mm 2 43.2% overall area savings

21 University of Michigan Electrical Engineering and Computer Science 21 Conclusion Automated C  loop accelerator synthesis system Modulo scheduler must be cost aware Decomposition methods make problem tractable –20% average cost savings with space-time decomposition –Importance of global view of all operations Individual savings up to 43% Compile times of 1 minute – 30 minutes

22 University of Michigan Electrical Engineering and Computer Science 22 Questions? For more information: http://cccp.eecs.umich.edu


Download ppt "University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,"

Similar presentations


Ads by Google