Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)

Similar presentations


Presentation on theme: "Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)"— Presentation transcript:

1 Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)

2 The Basics z#1: a well-known problem: On-chip Communication z#2: a well-known opportunity: Program Predictability z#3: our novel approach to #1 using #2

3 Problem: Communication zCores becoming “communication limited” yRather than “capacity limited” zMany, many transistors on a chip, but… zCan’t bring them all to bear on one thread yControl/data dependences = freq. communication

4 Best core << chip size zSweet spot for core size yFurther size increases either hurts Mhz or IPC How can we maximize core’s efficiency? Chip Core

5 Opportunity: Predictability zMany programs behaviors are predictable yControl flow, dependences, values, stalls, etc. zWidely exploited by processors/compilers yBut, not to help increase effective core size yCore resources used to make, validate pred’s zExample: perfectly-biased branch bne 100% 0%

6 Speculative Execution z Execute code before/after branch in parallel z Branch is fetched, predicted, executed, retired All of this occurs in the core Branch predictor Uses space in I-cache Uses execution resources Not just the branch, but its backwards slice

7 Trace/Superblock Formation zOptimize code assuming the predicted path yReduces cost of branch and surrounding code yPrediction implicitly encoded in executable zCode still verifies prediction yBranch & slice still fetched, executed, committed, etc. All of this occurs on the core bne 100% 0% CC

8 Why waste core resources? zThe branch is perfectly predictable! The core should only execute instructions that are not statically predictable!

9 If not in the core, where? zAnywhere else on chip! zBecause it is predictable: yDoesn’t prevent forward progress yWe can tolerate latency to verify prediction Prediction Instruction Storage Verify Prediction

10 A concrete example: Master/Slave Speculative Parallelization zExecute “distilled program” on one processor yA version of program with predictable inst’s removed yFaster than original, but not guaranteed to be correct z Verify predictions by executing original program y Parallelize verification by splitting it into “tasks” Master core: Executes distilled program Slave cores: Parallel execution of original program

11 Talk Outline zRemoving predictability from programs y“Approximation” zExternally verifying distilled programs yMaster/Slave Speculative Parallelization (MSSP) zResults Summary zSummary

12 Approximation Transformations zPretend you’ve proven the common case yPreserve correctness in the common case yBreak correctness in uncommon case xUse profile to know the common case bne 99% 1% A BC A BC AB C

13 Not just for branches zValues: ld r13, 0(X) Load is highly invariant (usually gets 7) zMemory Dependences: st r12, 0(A) ld r11, 0(B) A and B may alias never If rarely alias in practice? If almost always alias? always addi $zero, 7, r13 mv r12, r11

14 Enables Traditional Optimizations exit printf fwrite csEOF printf fprintf Two dominant paths Many static paths Approximate away unimportant paths From bzip2

15 Two dominant paths Enables Traditional Optimizations Many static paths Approximate away unimportant paths From bzip2

16 Two dominant paths Enables Traditional Optimizations Many static paths Very straightforward structure Easy for compiler to optimize Approximate away unimportant paths From bzip2

17 Two dominant paths Enables Traditional Optimizations Many static paths Very straightforward structure Easy for compiler to optimize Approximate away unimportant paths From bzip2

18 Effect of Approximation z Equivalent % of the time, better execution characteristics yFewer dynamic instructions: ~1/3 of original code ySmaller static size: ~2/5 of original code yFewer taken branches: ~1/4 of original code ySmaller fraction of loads/stores z Shorter than best non-speculative code yRemoving checks: code incorrect.001% of the time Distilled Code Original Code

19 Talk Outline zRemoving predictability from programs y“Approximation” zExternally verifying distilled programs yMaster/Slave Speculative Parallelization (MSSP) zResults Summary zSummary

20 Goal zAchieve performance of distilled program zRetain correctness of original program zApproach: yUse distilled code to speed original program

21 Checkpoint parallelization zCut original program into “tasks” yAssign tasks to processors zProvide each a checkpoint of registers & memory yCompletely decouples task execution yTasks retrieve all live-ins from checkpoint zCheckpoints taken from distilled program yCaptured in hardware yStored as a “diff” from architected state

22 Master core: Executes distilled program Slave cores: Parallel execution of original program

23 Example Execution MasterSlave1Slave2Slave3 A’ A Start Master and Slave from architected state B’ B Take checkpoint, use to start next task Verify B’s inputs with A’s outputs; commit state Detected at end of B C C’ Bad C C’ C Squash, restart from architected state

24 MSSP Critical Path MasterSlave1Slave2Slave3 A’ A B’ B C C’ C If checkpoints correct:  through distilled program  no communication latency  verification in background If bad checkpoints are rare:  performance of distilled program  tolerant of communication latency Bad checkpoints:  through original program  interprocessor comm.

25 Talk Outline zRemoving predictability from programs y“Approximation” zExternally verifying distilled programs yMaster/Slave Speculative Parallelism (MSSP) zResults Summary zSummary

26 Methodology zFirst-cut distiller yStatic binary-to-binary translator ySimple control flow approximations yDCE, inlining, register re-allocation, save/restore elimination, code layout… zHW model: 8-way CMP of 21264’s y10 cycle interconnect latency to shared L2 zSpec2000 Integer benchmarks on Alpha

27 Results Summary zDistilled Programs can be accurate y1 task misspeculation per 10,000 instructions zSpeedup depends on distillation y1.25 h-mean: ranges from 1.0 to 1.7 (gcc, vortex) y(relative to uniprocessor execution) zModest storage requirements yTens of kB at L2 for speculation buffering zDecent latency tolerance yLatency 5 -> 20 cycles: 10% slowdown

28 Distilled Program Accuracy 100,000 1,000 10,000 Average distance between task misspeculations: > 10,000 original program instructions

29 Distillation Effectiveness 100% 0% 60% Instructions retired by Master(distilled program) Instructions retired by Slave(original program (not counting nops) Up to two-thirds reduction

30 Performance Performance scales with distillation effectiveness 100,000 1,000 10, % 0% 60% Speedup accuracy distillation

31 Related Work zSlipstream zSpeculative Multithreading zPre-execution zFeedback-directed Optimization zDynamic Optimizers

32 Summary zDon’t waste core on predictable things y“Distill” out predictability from programs zVerify predictions with original program ySplit into tasks: parallel validation yAchieve the throughput to keep up zHas some nice attributes (ask offline) yCan support legacy binaries, latency tolerant, low verification cost, complements explicit parallelism


Download ppt "Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)"

Similar presentations


Ads by Google