RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA.

rePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

2 rePLay as an evolution Trace Caches [high bandwidth fetch] Branch Characterization [correlation, predictability] Dynamic Optimization [Dynamo and HW-based] rePLay Effective at capturing short dynamic traces dynamically, few “unpredictable” branches Good potential, but watch out for overhead

3 The rePLay Framework : supporting dynamic optimization in the microarchitecture Fetch Engine Execution Engine Frame Constructor Optimization Engine Frame CacheSequencer Completing instructions recovery support Optimization engine provides for concurrent optimization. HW recovery support enables speculative optimization without requiring recovery code. Frame constructor builds atomic optimization regions. Frame cache and sequencer provide for quick dispatch.

4 rePLay Key Innovations Atomic optimization regions (frames) Assertion architecture Use of hardware rollback for dynamic, speculative optimization

5 Frames rePLay optimization regions A B C D E A B C D E

6 Dealing with side exits C BEQ R3, D X D C D ASSERT R3==0

7 Frames become building blocks A B C D E A A

8 Why are atomic blocks desirable? Atomic Block Because they contain no intermediate branches, they permit sustained fetch Because the code is control- independent, they permit code reordering without requiring recovery code Also, they require less effort to optimize and schedule inst A inst B

9 Mechanisms

10 rePLay Microarchitecture Frame Construction Frame Sequencer Frame Cache Optimizer

11 Frame Constructor Fetch Engine Execution Engine Frame Constructor Optimization Engine Frame CacheSequencer Completing instructions recovery support

12 Frame Construction Objective 1: Frames should be long Objective 2: Frames should completely execute Objective 3: Should cover a significant fraction of istream Approach: Use path-based history-based branch promotion to convert highly-biased branches into assertions.

13 The rePLay Frame Constructor Branch Bias Table Pending Frame Construction Buffer incoming block from execution engine Promote branch to assertion Keep enlarging frame if branch is promoted History

14 Branch History forms Frame Context X Y A B C D A B C D XY

15 Frame Construction Strategies [averaged across SPEC2000 integer benchmarks] History Length Ave instructions per frame

16 Fraction of Dynamic Assertions [averaged across SPEC2000 integer benchmarks] Path History

17 Frame Length Distribution [using 9-element path history, averaged across SPEC2000int] Percent

18 Frame Construction Summary: With a 9-element path history Frame Size 102 insts Completion Rate Coverage Assertions/Frame 97% 82% 9 Unique Frames6191

19 Frame Sequencer Fetch Engine Execution Engine Frame Constructor Optimization Engine Frame CacheSequencer Completing instructions recovery support

20 The rePLay Frame Sequencer Branch predictor Branch predictor CycleN+1 Next fetch address generator Next fetch address generator Frame cache Frame cache Path History Cycle N Icache = = Instructions Initiate frame fetch Next fetch address First line of frame Pipelined Generates one target per cycle We can start fetching next frame before current one is done. Tag Allows fetch by frame context

21 Frame Cache Fetch Engine Execution Engine Frame Constructor Optimization Engine Frame CacheSequencer Completing instructions recovery support

22 Issues with caching frames Need a cache where each entry is atomic –eviction must remove entire frame –but frames should span multiple cache lines Reads are frequent, writes infrequent Evaluated several schemes [Crum 2000] –including a trace cache patent by Intel (P4)

23 The rePLay Frame Cache Head Partition Path History head Body Partition tail startPC (tag) instructionsnext lineinstructionsnext lineT [M. Crum, MS Thesis, UIUC 2001]

24 Optimization Engine Fetch Engine Execution Engine Frame Constructor Optimization Engine Frame CacheSequencer Completing instructions recovery support

25 The rePLay Optimization Engine Programmable, with local memory Hardware blocks to assist with dead code, etc… Possible implementation: optimizer running as a thread on an MT processor Dead Code Detector Optimization Datapath OptEngine Memory Unoptimized Frame Buffer Optimized Frame

26 rePLay Optimizations Removal of Partially Dead Code Reassociation Constant Propagation Common Subexpression Elimination Subroutine In-lining Fetch Scheduling

27 Example of Optimization [from gcc] asst_i r1,NE asst_i r16,EQ srl r16,48,r16 sll r29,48,r16 lda r0,0(r29) sll r0,48,r0 srl r0,48,r0 cmpeq r0,28,r0 asst_i r0,EQ lda r16,0(r29) sll r16,48,r16 srl r16,48,r16 cmpeq r16,27,r16 beq r16,0x14 ldq r1,24(r1) asst_i r1, NE lda r29,0(r1) cmpeq r16,29,r16 and r1,255,r30 cmpeq r30,29,r16 asst_i r16,EQ cmpeq r30,28,r0 asst_i r0,EQ cmpeq r30,27,r16 beq r16,0x14 beq r1, A bne r1, B ldq r1,24(r1) beq r16, C srl r16,48,r16 sll r29,48,r16 lda r29, 0(r1) cmpeq r16,29,r16 lda r0,0(r29) sll r0,48,r0 srl r0,48,r0 cmpeq r0,28,r0 beq r0, D lda r16,0(r29) sll r16,48,r16 srl r16,48,r16 cmpeq r16,27,r16 beq r16, E Frame Constr Frame Constr Optimizer asst_i r1, NE ldq r1,24(r1) asst_i r1,NE lda r29,0(r1)

28 asst_i r1,NE asst_i r16,EQ srl r16,48,r16 sll r29,48,r16 lda r0,0(r29) sll r0,48,r0 srl r0,48,r0 cmpeq r0,28,r0 asst_i r0,EQ lda r16,0(r29) sll r16,48,r16 srl r16,48,r16 cmpeq r16,27,r16 beq r16,0x14 ldq r1,24(r1) asst_i r1, NE lda r29,0(r1) cmpeq r16,29,r16 asst_i r1,NE asst_i r16,EQ srl r16,48,r16 sll r1,48,r16 lda r0,0(r1) sll r0,48,r0 srl r0,48,r0 cmpeq r0,28,r0 asst_i r0,EQ lda r16,0(r1) sll r16,48,r16 srl r16,48,r16 cmpeq r16,27,r16 beq r16,0x14 ldq r1,24(r1) asst_i r1, NE lda r29,0(r1) cmpeq r16,29,r16 Reassociation asst_i r1,NE asst_i r16,EQ srl r16,48,r30 sll r1,48,r16 cmpeq r30,28,r0 asst_i r0,EQ cmpeq r30,27,r16 beq r16,0x14 ldq r1,24(r1) asst_i r1, NE lda r29,0(r1) cmpeq r16,29,r16 Common sub-expression elimination Miscellaneous and r1,255,r30 cmpeq r30,29,r16 asst_i r16,EQ cmpeq r30,28,r0 asst_i r0,EQ cmpeq r30,27,r16 beq r16,0x14 asst_i r1, NE ldq r1,24(r1) asst_i r1,NE lda r29,0(r1)

29 Experimental Results

30 Processor Configurations ICache (IC) Simple Trace Cache (TC) rePLay without Optimizations (RP) rePLay with Optimizations (RPO) Fetch: 8-wide, 64KB of cache space for insts. Core: 8-wide OOO, 64KB DCache, 1 exe cluster 7 cycles (min) for branch resolution Compaq Alpha Compiler used for optimizations (-O4)

31 Performance on statically optimized binaries

32 Performance on unoptimized binaries

33 Some observations Optimizations boost performance on optimized code by 14% on average, 17% over IC and TC. In terms of effective IPC : 16% and 21%. Unoptimized code, over 20% (25% IPC) than IC, TC. In some cases, rePLay optimizer does better than the static optimizer (gap, mcf, perlbmk when comparing RP vs. RPO).

34 Effect of Individual Optimizations Dead code removal, constant prop, reassociation, constant subexp elim, fetch scheduling

35 Effects of Frame Length

36 Note about fragile code Often caused by invalid accesses to memory –overwrites, uninitialized memory –barrier to optimizations on production code –our sampling of x86 code reveals little static optimization Because rePLay optimizations preserve architectural equivalence at frame boundaries, fragile bugs not exposed There is a tradeoff between exploiting semantic info (and exposing bugs) and aggressive optimization

37 Current Work Generalized Assertion Model x86-based rePLay Machine Idioms

38 Generalized Assertion Model A B C D E A simple opts control flow assertions A speculative opts general assertions memory assertions

39 Continuous Optimization A WXYZ A Z Y X W

40 The potential of rePLay on x86 code may be very significant Production code appears not to be highly optimized The larger granularity and small register set of x86 ISA provide ample opportunities for uop optimization

41 mov [ebp-1Ch],eaxstl rax, -1c(rbp) mov ecx,[ebp-1Ch]ldl rcx, -1c(rbp) add ecx,[ebp-0Ch]ldl rt1, -0c(rbp)ldl rt1, -0c(rbp) add rt1, rcx, rcx add rt1, rax, rcx mov [ebp-1Ch],ecxstl rcx, -1c(rbp)stl rcx, -1c(rbp) cmp [L00993A00],00000320hldl rt1, M1ldl rt1, M1cmp rt1, 320, flags jnz L00406A89asst nz, flagsasst nz, flags L00406A89: mov eax,[L00993A00]ldl rax, M1 sub eax,0000016Ehsub rax, 16e, raxsub rt1, 16e, rax cmp [ebp-1Ch],eaxldl rt1, -1c(rbp) cmp rt1, rax, flagscmp rcx, rax, flags jge L00406AA0asst ge, flagsasst ge, flags mov ecx,[ebp-1Ch]ldl rcx, -1c(rbp) mov [ebp-24h],ecxstl rcx, -24(rbp) jmp L00406AAF L00406AA0: mov edx,[L00993A00]ldl rdx, M1asst ne, M1, -24(rbp) sub edx,0000016Ehsub rdx, 16e, rdx mov [ebp-24h],edxstl rdx, -24(rbp)stl rax, -24(rbp) L00406AAF: mov eax,[ebp-24h]ldl rax, -24(rbp) mov [ebp-1Ch],eaxstl rax, -1c(rbp)stl rax, -1c(rbp) L00406AB5: mov ecx,[ebp-1Ch]ldl rcx, -1c(rbp) push ecxstl rcx, 0(rsp)stl rax, 0(rsp) sub rsp, 4, rsp mov edx,[ebp-04h]ldl rdx, -4(rbp)ldl rdx, -4(rbp) mov eax,[edx+000021D0h]ldl rax, 21d0(rdx)ldl rax, 21d0(rdx) mov ecx,[eax+08h]ldl rcx, 8(rax)ldl rcx, 8(rax) mov edx,[ebp-04h]ldl rdx, -4(rbp) lea ecx,[edx+ecx+000021D0h]add rcx, rdx, rt1add rcx, rdx, rt1add rt1, 21d0, rcx call L00406FB0 L00406FB0: push ebpstl rbp, 0(rsp)stl rbp, 4(rsp) sub rsp, 4, rsp mov ebp,espadd rsp, 0, rbpadd rsp, 8, rbp sub esp,00000044hsub rsp, 44, rspsub rsp, 4c, rsp push ebxstl rbx, 0(rsp) sub rsp, 4, rsp push esistl rsi, 0(rsp) sub rsp, 4, rsp push edistl rdi, 0(rsp) sub rsp, 4, rsp mov [ebp-04h],ecxstl rcx, -4(rbp)stl rcx, -4(rbp) mov eax,[ebp-04h]ldl rax, -4(rbp)add rcx, 0, rax mov ecx,[ebp+08h]ldl rcx, 8(rbp)ldl rcx, 8(rbp) mov [eax+2Ch],ecxstl rcx, 2c(rax)stl rcx, 2c(rax) pop edildl rdi, 0(rsp)asst range, 2c(rax), rsp add rsp, 4, rsp pop esildl rsi, 0(rsp) add rsp, 4, rsp pop ebxldl rbx, 0(rsp) add rsp, 4, rsp mov esp,ebpadd rbp, 0, rspadd rbp, 0, rsp pop ebpldl rbp, 0(rsp)ldl rbp, 4(rsp) add rsp, 4, rsp retn 0004h L00406AD4: mov eax,[ebp-04h]ldl rax, -4(rbp)add rdx, 0, rax cmp [eax+00003168h],00000000hadd rax, 3168, rt1add rax, 3168, rt1cmp rt1, 0, flags jnz L00406AEDjnz flags, Xjnz flags, X

42 Idioms : A Motivation The Frame Cache has many redundant basic blocks. The effective cache size is reduced. Traditional approaches to reducing this redundancy don’t work. Our solution: idioms. Extract and encode common segments of dataflow into smaller codewords.

43 What is an idiom? An idiom is a connected dataflow graph of instructions from the dynamic instruction stream.

44 Idiom Formation Frame Builder Dynamic Instruction Stream Frame Analyzer Idiom Selector Dynamic Coverage I-Stream Idiom Mapper Dynamic Instruction Stream

45 Example Idioms

46 Results and Analysis

47 Fetch Compression Idiom Decoding logic expands the idiom’s codeword into the constituent operations.

48 Optimizer provides 15% reduction in cycles over basic rePLay, 20% over other configurations. Dead Code, Reassociation, Fetch Scheduling most important. Dead Code removes almost 1 in 5 instructions. High potential for x86 uop optimization using rePLay Small number of code idioms cover over a quarter of the I-stream. Many potential opportunities.

RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA.

Similar presentations

Presentation on theme: "RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA.

Similar presentations

Presentation on theme: "RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA."— Presentation transcript:

Similar presentations

About project

Feedback