Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Similar presentations


Presentation on theme: "Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2."— Presentation transcript:

1 Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2 UW-Madison, 3 Intel

2 Outline Interaction Cost Hardware profiler Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Icost case study: designing a deep pipeline Icost “shotgun” profiler Replace current performance counters Quantitative: interaction cost (icost)

3 Why?  -architectural parallelism complicates performance understanding Bottleneck analysis is hard A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing Two parallel cache misses A multiply and window stall

4 What we want from bottleneck analysis Performance cost (or reward)  speedup when the bottleneck is removed Q: What if two bottlenecks interact?

5 Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction 1000 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

6 Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

7 Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss #2... 3. Negative icost ?

8 Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

9 Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

10 Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss #2... 3. Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

11 Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

12 Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions!

13 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

14 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

15 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

16 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

17 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

18 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

19 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

20 Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

21 Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

22 Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss0.0... Total100.0

23 Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window-4.2-15.3-24.5 DL1+bw10.06.015.5 DL1+bmisp-7.0-3.4-0.3 DL1+dmiss-1.4-0.4-1.4 DL1+alu-1.6-8.2-4.7 DL1+imiss0.10.00.4... Total100.0

24 Vortex Breakdowns, enlarging the window 64128256 DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

25 Vortex Breakdowns, enlarging the window 64128256 DL125.88.93.9 DL1+window-24.5-7.7-2.6 DL1+bw15.516.713.2 DL1+bmisp-0.3-0.6-0.8 DL1+dmiss-1.4-2.1-2.8 DL1+alu-4.7-2.5-0.4 DL1+imiss0.40.50.3... Total100.080.875.0

26 Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Exploiting serial interactions Outline Icost “shotgun” profiler Overcome the limitations of performance counters Interaction Cost Hardware profiler

27 Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

28 Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

29 “Shotgun” genome sequencing DNA

30 “Shotgun” genome sequencing DNA

31 “Shotgun” genome sequencing... DNA

32 “Shotgun” genome sequencing... Find overlaps among samples DNA

33 Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

34 ... Profiler hardware requirements

35 ... Profiler hardware requirements Match!

36 Bottleneck analysis is complicated by parallelism Conclusion Parallelism is interpreted with interaction cost (icost) Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters

37 Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge Decode, rename Multiply + pipe latency Icache miss

38 Profiler software requirements Software puts the graph together Skeleton sample Detailed samples (with matching PC)

39 Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. EEEEE FFFFF CCCCC E F C 1 2 1 12323 1111 0 1 0 1 1 01010 2 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 3 DL1 access

40 Compare Icost and Sensitivity Study

41 Sensitivity Study Advantages More information e.g., concave or convex curves Interaction Cost Advantages Easy (automatic) interpretation Sign and magnitude have well defined meanings Concise communication DL1 and ROB interact serially


Download ppt "Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2."

Similar presentations


Ads by Google