Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths Manjunath Kudlur, Kevin Fan, Michael Chu, Rajiv Ravindran, Nathan Clark, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

2 Electrical Engineering and Computer Science Introduction Bypass network : Important component of datapath Allows for data forwarding to reduce pipeline stalls Full bypass: any FU can bypass from any other FU and from any pipeline stage Cost of full bypass increases quadratically with number of FUs # paths = (# FU) 2  bypassable stages  input ports per FU  output ports per FU

3 University of Michigan Electrical Engineering and Computer Science Case for Bypass Customization Only few bypasses are heavily utilized The heavily utilized bypasses vary widely from application to application Customize bypass network in an application specific processor by removing under-utilized paths

4 University of Michigan Electrical Engineering and Computer Science Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File

5 University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B DFG

6 University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 1 Cycle

7 University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 2 Cycles

8 University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Latency of an operation no longer constant –Varies per consumer Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 3 Cycles Bypass Customization introduces non-uniform operation latencies

9 University of Michigan Electrical Engineering and Computer Science Effects on List Scheduler (LS) Used widely in many compilation systems Assign each operation to a free FU at the earliest time (Greedy!) When more than one free FU available, pick one arbitrarily WHILE (Readylist is non-empty) DO op  Next unscheduled operation in priority order ; stime  Earliest time when op can be scheduled ; WHILE (no free resource available to execute op at stime) DO stime  stime + 1 ; END res  Free resource capable of executing op; schedule (op, res, stime) ; END

10 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine ABC Operations have 1-cycle latency. Machine with full bypass network. 1 234 56 DFG

11 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine 1 234 56 CycleABC 01 1 2 ABC Operations have 1-cycle latency. Machine with full bypass network.

12 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine 1 234 56 CycleABC 01 1234 256 Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

13 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine 1 234 56 CycleABC 01 1234 256 Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

14 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine 1 234 56 CycleABC 01 1234 256 Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

15 University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine 1 234 56 CycleABC 01 1234 256 Schedule length = 3 cycles Choice of FU does not affect schedule length in a machine with full bypass. ABC Operations have 1-cycle latency. Machine with full bypass network.

16 University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist. 1 234 56 CycleABC 01 12 2 334 465 Schedule length = 5 cycles

17 University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine 1 234 56 CycleABC 01 1234 265 Schedule length = 3 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

18 University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine 1 234 56 CycleABC 01 123 2 34 456 Schedule length = 5 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

19 University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine 1 234 56 CycleABC 01 123 2 34 456 Schedule length = 5 cycles Choice of FU affects schedule length drastically in a machine with partial bypass. Arbitrary choice no good! ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

20 University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist. 1 234 Partial DFG CycleABC i i+1 i+2 i+3 i+4 Consider Scheduling Op1 Earliest Time

21 University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist. 1 234 CycleABC i i+11 i+22 i+3 i+434 Greedily scheduling op 1at cycle i+1 delays ops 3 and 4

22 University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist. 1 234 CycleABC i i+11 i+223 i+3 i+44 Greedily scheduling op 1 at cycle i+1 delays op 4

23 University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist. 1 234 CycleABC i i+1 i+21 i+3234 Delayed 1 cycle Delaying ops could improve schedule. Being Greedy no good!

24 University of Michigan Electrical Engineering and Computer Science FLASH : Goals Keep the List Scheduling framework, it is fast and widely used Effectively deal with non-uniform latencies –Intelligently select from among multiple co- equal choices –Avoid greedy choices by delaying schedule slots

25 University of Michigan Electrical Engineering and Computer Science Observation I A B Consider FU choices for operation A :

26 University of Michigan Electrical Engineering and Computer Science Observation I A B No Good! Consider FU choices for operation A : 3 cycle delay

27 University of Michigan Electrical Engineering and Computer Science Observation I A B Good! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No delay

28 University of Michigan Electrical Engineering and Computer Science Observation I A B C Good ??? Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No Delay 3 cycle delay

29 University of Michigan Electrical Engineering and Computer Science Observation I A B C Better! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed Same observation extends to consumer’s consumer, and so on No Delay An FU which does not delay the consumers is a good choice

30 University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Slack 1Slack 0

31 University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Good ??? All consumers are not equal No Delay 3 cycle delay

32 University of Michigan Electrical Engineering and Computer Science Observation II All consumers are not equal Its better to delay a non- critical consumer Criticality  Consider FU choices for operation A : A BC D Better! An FU which does not delay a critical chain of consumers is a good choice No Delay 3 cycle delay 1 SLACK

33 University of Michigan Electrical Engineering and Computer Science The FLASH Technique Compute the merit (FLASH_RANK) of each FU choice for an operation FLASH_RANK - weighted estimate of schedule lengths of the dependence chains of an operation Schedule the operation on the FU with the best FLASH_RANK Avoid greediness by delaying schedule slot, if necessary FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

34 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

35 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Green FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

36 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Green FU) = MAX 1 1 + 1 X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

37 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Green FU) = MAX 0.5, 1 0 + 1 X Cycle 1 Cycle 4 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

38 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Green FU) = MAX 0.5, 4 = 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

39 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Yellow FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

40 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Yellow FU) = MAX 1 1 + 1 X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

41 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Yellow FU) = MAX 0.5, 1 0 + 1 X Cycle 1 Cycle 2 2 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

42 University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Yellow FU) = MAX 0.5, 2 = 2 Choose Yellow FU for op A FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

43 University of Michigan Electrical Engineering and Computer Science Some Practical Considerations Impractical to estimate schedule length of entire dependence chain (few 10s of operations) –Truncate dependence chains to manageable depths, say 2 or 3 (Look Ahead depth) Impractical to calculate schedule lengths of all dependence chains together –Many dependence chains originate from an operation –Consider dependence chains independently –Ignore resource constraint between dependence chains

44 University of Michigan Electrical Engineering and Computer Science Experiments Implemented in TRIMARAN compiler framework Evaluated MediaBench and SPECint2000 Machine is a 9 wide VLIW (4I, 2F, 2M, 1B) Application specific bypass network [Fan ’03] –30% cost of a full bypass network

45 University of Michigan Electrical Engineering and Computer Science Comparisons Baseline is the performance achieved by the traditional list scheduler Global Resource Preference (GRP) algorithm [Fan ’03] –Global pre-scheduling phase assigns FU preferences to operations based on Bottom-Up Greedy (BUG) schedule estimates –List scheduler uses these preferences as hints while scheduling

46 University of Michigan Electrical Engineering and Computer Science FLASH vs. GRP

47 University of Michigan Electrical Engineering and Computer Science Bypass Utilization

48 University of Michigan Electrical Engineering and Computer Science Conclusion Developed a effective scheduling heuristic for machines with customized bypass interconnect –Intelligent FU choice –Avoid greediness Average performance improvement of 25% over baseline –Bypass paths utilized better Could be applied to other cases of non- uniform latencies

49 University of Michigan Electrical Engineering and Computer Science Questions

50 University of Michigan Electrical Engineering and Computer Science Backup

51 University of Michigan Electrical Engineering and Computer Science Backup

52 University of Michigan Electrical Engineering and Computer Science Backup


Download ppt "University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized."

Similar presentations


Ads by Google