Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.

Similar presentations


Presentation on theme: "Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William."— Presentation transcript:

1 www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William E. Weihl and many more.

2 2 / 30 Introduction 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) 28 0 0xcac0 ldq at, 8(a2) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) Changing this ONE instruction will make my Java programs run 2.3% faster! HOW CAN I FIND IT? HOW DO I FIX IT??

3 3 / 30 The Options  Read the source-not always useful  Read the assembly-hard, not always useful  Simulation-very slow, infeasible  Instrumentation-slow, interference  Sample-based profiling-not enough detail  Or use periodic interpretation

4 4 / 30 It’s Not Easy  A true story – Sometimes program X runs twice as long as usual Sometimes program X runs twice as long as usual Variance due to # of bytes in environment vars! Variance due to # of bytes in environment vars! –Base address of main() ’s stack had dramatic effect Simulation eventually revealed the problem Simulation eventually revealed the problem  Information requirements Detailed instruction behavior profile Detailed instruction behavior profile Contents of registers Contents of registers Correlated data for nearby instructions Correlated data for nearby instructions

5 5 / 30 Outline  Out-of-order Processors  Performance Problems  Why Interpretation?  Profiling Infrastructure  An Example  Evaluation  Future Work  Summary

6 6 / 30 Out-of-order Processors  Try to exploit instruction-level parallelism Fetch, issue 4 instructions at a time Fetch, issue 4 instructions at a time Many function units Many function units Retire up to 11 instructions in a cycle Retire up to 11 instructions in a cycle  Fetch in-order  Execute out of order  Retire in-order

7 7 / 30 Enemies of Performance  Bad cache utilization  Static stalls / dependences  Branch misprediction  Illegal re-ordering Pipeline traps! }

8 8 / 30 Traps  Processor detects that it let “bad things” happen wrong instructions executed wrong instructions executed instructions may have seen incorrect data instructions may have seen incorrect data è up to 80 in-flight instructions thrown out!  Branch mispredict:... beq execute fetch [Predict !taken]... fetch execute... TAKEN!... fetch... ABORTED!

9 9 / 30 Memory Order Traps  Memory operations are freely reordered  Must enforce consistent view of memory  Problems are detected dynamically (a) reordered operations to overlapping bytes - “order” trap load from X... store to X... load from X program order:execute order:

10 10 / 30 Troll Traps L1 data cache L2 cache Load from Y Load from X ?? (b) accesses resulting in contention for a cache line - “troll” trap not allowed to have more than one outstanding fill request not allowed to have more than one outstanding fill request unspecified ordering of responses from L2 cache unspecified ordering of responses from L2 cache replay the load until the fill happens replay the load until the fill happens Y X Miss!

11 11 / 30 Wrong Size Traps (c) wide load follows narrow store - “size” trap Store-long mem(x) Load-quad mem(x) Store queue L1 data cache Load-quad mem(x)

12 12 / 30 A Better Way  Need a runtime solution Notice when two instructions in a trace “match” Notice when two instructions in a trace “match” Observe effective addresses of memory ops Observe effective addresses of memory ops  Interpret instruction traces Emulate (most) operations Emulate (most) operations Apply statistically to cover whole system Apply statistically to cover whole system  Extends the power of sample-based profiling

13 13 / 30 Available Information  Control Flow – Edge Frequencies Return address (in register or on stack) Return address (in register or on stack) Branch taken direction Branch taken direction  Computed values Function arguments, results Function arguments, results Load/store addresses Load/store addresses  Possible replay trap culprits

14 14 / 30 ProfileMe on Alpha 21264a fetchmapissueexec retire icache branch predictor interrupt Fetch counter overflow? pc notrap? replay? mispredict? dtbmiss? … ProfileMe tag! tagged? taken? capture! internal processor registers imiss? retired? random selection map stall?

15 15 / 30 ProfileMe Interrupt execute instructions Read counters; get PID/PC Program instruction stream ProfileMe interrupt Log event in hash table interrupt returns execute instructions !

16 16 / 30 Interpretation - Value Profiling execute native interpret in interrupt handler Program instruction stream ProfileMe interrupt log register contents with profile data interrupt returns execute native Register contents Update regs, memory New register values ! Partial CFG

17 17 / 30 Interpreter Details  Initial register values delivered with interrupt  Interpret n instructions or until bail PALcode (OS support) PALcode (OS support) Page fault Page fault  Branches and jumps are interpreted can’t detect mispredicts can’t detect mispredicts  Memory accesses are performed can’t detect cache misses can’t detect cache misses  Final register state updated

18 18 / 30 Values Captured Arithmetic-result value Memory op-effective address Indirect jump-destination address … and current return address in all cases

19 19 / 30 Interpretation - Replay Traps execute native interpret Program instruction stream register dependence effective addresses ProfileMe interrupt analyze report possible culprits as value samples interrupt returns execute native !

20 20 / 30 Example Profile - MTRT > dcpiprof $labels $db -pm replays mtrtbase.exe Column Total Period (for events) ------ ----- ------ replays:count 397 126976 =========================================================== replays :count % procedure image 100 25.19%...OctNode.Intersect(...) mtrtbase.exe 51 12.85% java.io.BufferedInputStream.read() mtrtbase.exe 48 12.09%...Vector.Dot(...) mtrtbase.exe

21 21 / 30 Replays in OctNode.Intersect > dcpilist $labels $db -pm replays \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...): replays :count code elided 0 0x2002d2a0 stt $f8, 104(sp) 0 0x2002d2a4 bis a0, a0, s5 0 0x2002d2a8 bis a1, a1, s6 0 0x2002d2ac bis a2, a2, s4 0 0x2002d2b0 stt $f19, 8(sp) 0 0x2002d2b4 bsr ra, 0x20022250 0 0x2002d2b8 bis v0, v0, a0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0x2002d2c8 bis v0, v0, s2 43 0x2002d2cc ldq at, 0(a0) 0 0x2002d2d0 bsr ra, 0x20027a50 Order? Wrong Size? Troll? Queue Full?

22 22 / 30 Replay Trap Value Profile > dcpilist $labels $db -pm replays -vreplay \ '...OctNode.Intersect(...)’ mtrtbase.exe...OctNode.Intersect(...): replays :count vtot thld nv code elided 0 0x2002d2a0 stt $f8, 104(sp) 5 1.0 1 (100.0% 0x2002b4f8) 0 0x2002d2a4 bis a0, a0, s5 0 0.0 0 0 0x2002d2a8 bis a1, a1, s6 0 0.0 0 0 0x2002d2ac bis a2, a2, s4 0 0.0 0 0 0x2002d2b0 stt $f19, 8(sp) 0 0.0 0 0 0x2002d2b4 bsr ra, 0x20022250 0 0.0 0 0 0x2002d2b8 bis v0, v0, a0 0 0.0 0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0.0 0 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0.0 0 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0.0 0 0 0x2002d2c8 bis v0, v0, s2 0 0.0 0 43 0x2002d2cc ldq at, 0(a0) 25 1.0 1 (100.0% 0x203f10d0) 0 0x2002d2d0 bsr ra, 0x20027a50 0 0.0 0 Possible Conflicting Instruction (accesses overlapping bytes)

23 23 / 30 Conflicting Instruction > dcpilist -vreplay -vshow 1 $labels $db -pm repl '0x203f10d0' \ mtrtbase.exe comp_alloc_fast: replays :count vtot thld nv 0 0x203f10c0 ldq t1, 64(s0) 88 1.0 4 (48.9% 0x203f10d8) 0 0x203f10c4 ldq v0, 56(s0) 98 1.0 12 (43.9% 0x203f10dc) 0 0x203f10c8 subq t1, a2, t1 0 0.0 0 0 0x203f10cc blt t1, 0x203f1134 0 0.0 0 1 0x203f10d0 stl a1, 0(v0) 16 1.0 16 (6.2% T 0x2002b464) 0 0x203f10d4 addq v0, a2, t2 0 0.0 0 0 0x203f10d8 stq t1, 64(s0) 43 1.0 2 (97.7% 0x203f10d8) 1 0x203f10dc stq t2, 56(s0) 46 1.0 6 (89.1% 0x203f10dc) 0 0x203f10e0 ret zero, (ra), 1 0 0.0 0 4-byte method pointer write in code for JVM’s new; 8-byte object header read for null check  wrong_size replay trap for every allocation. Fix with 4-byte reads for null check! 2.3% speedup across SPECjvm98 (yes it matters!!)

24 24 / 30 Avoiding Traps  “Build a better …” {program,compiler,processor}  Change access widths  Try to get loads/stores further apart  Correct unfortunate data alignment  Avoid filling load/store queues  Improve instruction slotting

25 25 / 30 Interpretation Parameters  Frequency don’t need to interpret on every interrupt don’t need to interpret on every interrupt  Duration longer runs find more possible traps... longer runs find more possible traps... (interacting instructions can be > 80 apart!)...but they are more expensive...but they are more expensive –we are running at highest priority –more time interpreting –more culprits data to collect

26 26 / 30 Evaluation - Overhead  Single runs of 11 early cpu2000 int benchmarks  Dual 667 MHz Alpha 21264a  Paths of 128 every 128 interrupts  225/sec ? ?

27 27 / 30 Future Work  Measure overhead for other frequencies/lengths  Evaluate ability to actually find culprits  Optimize data flow  Sample unbiasing more likely to discover culprits nearby more likely to discover culprits nearby more interpretation windows will cover both instrs. more interpretation windows will cover both instrs.  Try to filter more unlikely culprits

28 28 / 30 Summary  Low-impact way to get trace information No special requirements for processor No special requirements for processor Benefits of statistical sampling Benefits of statistical sampling  Manageable overhead  Useful applications Value profiling - code specialization, online optim. Value profiling - code specialization, online optim. Path profiling - edge counts Path profiling - edge counts Pipeline trap explanation - replay trap culprits Pipeline trap explanation - replay trap culprits

29 www.compaq.com


Download ppt "Www.compaq.com Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William."

Similar presentations


Ads by Google