Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

Similar presentations


Presentation on theme: "Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently."— Presentation transcript:

1 Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently with the System Technology Lab at Intel Corporation

2 2 Introduction to Deterministic Predicate- aware Scheduling (DPAS) [Smelyanskiy03]  Predication eliminates branch instructions but increases resource requirements  Predicate-aware scheduling oversubscribes resources reduces resource requirements reduces schedule length A br cond B D C F T TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1 3 C if p2 4 D TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1C if p2 3 D

3 3 Motivation for Probabilistic Predicate-aware Scheduling (PPAS)  DPAS can only combine A5 with A2, A3 and A4  What about combining A2 with A3 ? A3 with A4 ? A2 with A6 ?  PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict A2 … A3 … A4 … A6 M2 … br A5 … A1 M1 … 2

4 4 Characteristics of Predicated Code  52% of time is spent in cyclic regions  Cyclic PPAS might eliminate up to 38% of all dynamic operations from cyclic regions

5 5 Outline  Motivation  Resource Pressure Problem in Predicated Code  Probabilistic Predicate Aware Architecture  Probabilistic Predicate-aware Modulo Scheduling  Performance Results  Conclusions

6 6 Modulo Scheduling Example +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br 2 1 0 freq=0.3 1 1  This control path is taken 30% of the time  Assumed machine: 1 ALU, 1 MEMORY and 1 BRANCH units T

7 7 Traditional Modulo Schedule (Rau 94) TimeIteration i Iteration i + 1 0 +1+1 1 2 p1=cmpp 3 + 2 if p1 br 4 +1+1 5 + 3 if p1 6 st p1=cmpp 7 + 2 if p1 br 8 9 + 3 if p1 10 st Modulo Schedule Modulo Scheduled Loop Kernel ALUMEMBR I0 +1+1 I1 + 3 if p1 I2 p1=cmpp st I3 + 2 if p1 br II=4 II=5

8 8 Probabilistic Predicate-Aware Modulo Scheduling Deterministic Predicate-Aware Modulo Schedule TimeAMB 0 +1+1 1 + 3 if p1 2 p1=cmpp st 3 + 2 if p1 br Probabilistic Predicate-Aware Modulo Schedule TimeAMB 0+1+1 1+ 2 if p1+ 3 if p1 2p1 = cmppstbr 0.18 expected delay due to conflicts +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br 2 1 0 freq=0.3 1 / 2 II = 4II = 3.18 Baseline Modulo Schedule TimeAMB 0 +1+1 1 + 3 if p1 2 p1=cmpp st 3 + 2 if p1 br II = 4

9 9 Must-use ResourcesMay-use Baseline Architecture Model  Predicate Register File is only accessed in EXECUTE stage  Resources from FETCH to EXECUTE are unconditionally reserved FETCHDISPATCH DECODE REGISTER READ WRITE BACK Predicate Register File PRED READ & EXECUTE

10 10 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK Predicate Register File (PRF) EXECUTE Extended Predicate-Aware Architecture Resource Conflict Detection and Recovery Unit stall conflict detection conflict recovery  Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles

11 11 Expected Delay Model  ev is execution vector delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1 P(ev) is probability of occurrence of ev  P(ev) is computed using disjointness and implication, and assuming independence otherwise  Example (assume 3 operations, one FU and CDRL=1) ED cfl (op1 if p1, op2 if p2, op3 if p3) =(1 + 3 - 1) × P(p1=T, p2=T, p3=T) + (1 + 2 - 1) × P(p1=T, p2=T, p3=F) + (1 + 2 - 1) × P(p1=F, p2=T, p3=T) + (1 + 2 - 1) × P(p1=T, p2=F, p3=T)

12 12 Modulo Scheduling using Expected Delay Model (scheduling operation + 3 if p1) +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br 2 1 0 freq=0.3 2 1 + 3 if p1 0brst p1=cmpp 2 0.18+ 2 if p1 1 0+1+1 0 0.18 2  0.3  0.3 = 0.60 2  1.0  0.3 = 0.60 2  1.0  0.3 = 2 1 0 2 1 0 2 1 0 0 1  P conf (+ 2, + 3 ) = 1  P conf (+ 1, + 3 ) = 1  P conf ( p1=pred, + 3 ) = 0 0 0 0 0 Expected Delay due to Conflicts (CDRL = 1) 3 br p1=cmpp 2 1 + 3 if p16 5 + 2 if p14 + 3 if p17 st8 +1+1 0 BR may MEM may A may Time total expected delay due to conflicts 0.18 SRT MRT

13 13 Modulo Scheduling using Expected Delay Model (Finding Expected Initiation Interval, II exp )  More than one way to achieve the same (eg. 3.2) TimeALUMEMBR 0+1+1 1+ 2 if p1+ 3 if p1 2 p1=cmpp stbr < 0.2total expected conflict delay TimeALUMEMBR 0+1+1 + 2 if p1st 1 p1=cmpp + 3 if p1br < 1.2total expected conflict delay start with and increase till or sched. found of schedule found becomes new upper bound becomes new lower bound if no schedule found  Use binary search to find upper bound = lower bound = 13

14 14 Performance Results  Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling  Compiler Support Trimaran and ELCOR [Trimaran99]  Mediabench [Lee97] benchmark suite was evaluated  Processor Models (BA – base, PA – predicate-aware) Fetch WidthInt ALUcmpp latencyMemoryCDRL BASE4211- DPAS4231- PPAS42310 and 1 BASE6412- DPAS6432- PPAS64320 and 1 6-wide 4-wide

15 15 Cyclic PPAS Speedup over BASE (4-wide machine)  4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS  Increased CDRL has degraded performance

16 16 Various Scheduling Measurements (4-wide machine, CDRL = 0)  Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS  Expected delay mode accurately predicts delay due to conflict  Predicate-aware scheduling increases the epilogue size and required more rotating registers than BASE 29.94.71.6%20.821.0PPAS 18.42.20.0%23.5 DPAS 14.81.20.0%27.6 BASE # Rotating Registers Epilogue Size Absolute Error II runtime II compile

17 17 Overall Speedup over BASE with Cyclic PPAS  Only 52% of regions are scheduled with cyclic PPAS  Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4% better than base

18 18 Summary of PPAS  PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts compiler maximizes sharing in view of expected conflict PPAS architecture detects and recovers from conflicts  PPAS improves performance by  For further discussion, see http://www.eecs.umich.edu/~msmelyan/publications.html Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Ph.D. Dissertation, University of Michigan, 2004 Overall (cmpplat=3, CDRL=0) Cyclicvs. Basevs. DPAS 4-wide20%10%6% 6-wide8%4%3%

19 Questions?

20 Backup Foils

21 21 Resource Conflict Detection and Recovery Unit A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 one operation per assigned FU  Design alternatives to dispatch conflicting operations  Conflict Detection and Recovery Latency (CDRL) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2A3 ALU1ALU2 one operation per any FU (not evaluated) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 CDRL = 0 A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0conflict detected (dispatch bubble) 1A1A5 2A2 3A3 ALU1ALU2 CDRL = 1

22 22 Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)


Download ppt "Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently."

Similar presentations


Ads by Google