2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok Garg, and Michael Huang Department of Electrical & Computer Engineering University of Rochester
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Motivation Hiding long latencies Scaling up of many structures Complex, hard to design Consumes more energy Slower Inefficiency in hardware Meticulously keep track of all instructions No prior knowledge of out-of-order execution Simply cross-compare all loads and stores ROB size: 320 SQ size: 48 LQ size: 48 LQ Size 16%
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software Assistance Global information Statically identify non-conflicting memory accesses Advantages Reduced resource pressure Energy savings Loads not requiring memory disambiguation Average 43% dynamic loads in FP Spec applications
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Recent Research Chrysos and Emer (ISCA’98) Sethumadhavan et al. (MICRO’03) Park et al. (MICRO’03) Baugh and Zilles (PACC’04) Akkary et al. (MICRO’03) Gandhi et al. (ISCA’05), etc. Hardware-only: Provisioning, re-occurring overhead Cooperative: Consumption, one-time overhead
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Outline Cooperative Memory Disambiguation Framework Evaluation Conclusion
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Cooperative Memory Disambiguation - Resource-Effective Approach 90% dynamic loads do not communicate with in-flight stores Many loads do not require memory disambiguation resources Safe loads: Software analyzer can identify them Can exploit hardware specific information Hardware resources only for non-safe loads int A[1000], B[1000]; void VecAdd() { for(int i=0; i<1000; i++) A[i] = A[i] + B[i]; }
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Cooperative Memory Disambiguation Framework Software-hardware Interface Decoupled ISA (No compatibility obligations) Software Support Binary to binary translator - alto (Muth et al.) Binary analyzer Identify read-only data loads Identify other general safe loads Architectural Support Light-weight Source compiler Original binary Hardware Translator Compilation Hardware specific translator ISA Extended instruction set Hardware specific internal binary
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads Scope of parser analysis Steady state loop No internal control flow Limited in-flight instructions ROB size, store queue size … Load … Store Branch Simple loop body … Store … Store … Load … Store … i i-1 i-2 Steady state loop execution Instruction window
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads (Cont.) -Real example from a SPEC FP application 0x :ldl r31, 256(r3); prefetch 0x : ldt f21, 0(r3) ; Ld1 0x : lda r27, -2(r27) ; r27 = r27-2 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : ldt f22, -8(r3) ; Ld2 0x : ldt f23, 0(r11) ; Ld3 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : ldt f24, -8(r11) ; Ld4 0x : lds f31, 240(r11) ; prefetch 0x : mult f20, f21, f21 ; 0x c: mult f20, f22, f22 ; 0x : addt f23, f21, f21 ; 0x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 0x : beq r1, 0x ; One loop from galgel 0x :ldl r31, 256(r3); prefetch 0x : ldt f21, 0(r3) ; Ld1 0x : lda r27, -2(r27) ; r27 = r27-2 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : ldt f22, -8(r3) ; Ld2 0x : ldt f23, 0(r11) ; Ld2 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : ldt f24, -8(r11) ; Ld4 0x : lds f31, 240(r11) ; prefetch 0x : mult f20, f21, f21 ; 0x c: mult f20, f22, f22 ; 0x : addt f23, f21, f21 ; 0x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 0x : beq r1, 0x ; AddrLd1=_R3+16*i AddrLd2=_R11+16*i AddrSt1=_R11+16*i AddrSt2=_R11+16*i+8 Analysis window: 16 iterations Address range = _R11+(i-16)*16 to _R11+(i-1)*16+8 Ld2 statically determined to be safe Ld1 need run-time evaluation
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads (Cont.) -Real example from a SPEC FP application New_entry: mark_sq if(r3-r11+8>0) or (r3-r11+264<0) then cset CR0, 1 0x : sldt f21, 0(r3), [CR0]; Ld1 (safe) 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : sldt f23, 0(r11), [CR_TRUE]; Ld2 (safe) 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 Modified Code
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Safe stores If it does not communicate with future loads Indirectly discover safe loads Un-analyzable store Load is safe if all stores in SQ are safe Summary of safe load detection Simple loop body All stores must be analyzable Address range calculation … Load (A) … Store1 (UA) … Store2 (A) … Branch Loop Body … Load (A) … Store1 (UA) … Store2 (A) … Branch … Load (A)... In-flight instructions
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Architectural Support Safe loads Boolean condition registers cset (instruction) Safe stores Scope marker Indirect jumps Flash-reset all condition registers
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Outline Cooperative Memory Disambiguation Framework Evaluation Conclusion
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Experimental Setup Modified SimpleScalar 3.0b simulator Wattch to estimate dynamic energy consumption SPEC CPU2000 benchmark suite
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Breakdown of Safe Loads (FP) 97% 43%
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Performance Improvement (FP) 40/48%
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Breakdown of Safe Loads (INT)
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Performance Improvement (INT)
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Energy Savings Floating-point applications Integer applications
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Conclusions Software assistance improves LSQ efficiency Detects average 43% loads as safe Average 10% performance gain Compiler techniques for optimization of micro- architecture resources Future work More powerful static analyzer Manage other micro-architecture resources E.g., register file
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Thank you! Questions?
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Support for Coherency Hash Table: 2-bit Total entries: 512 Details: Table 1Table 2 Access bit Invalidation bit
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Read-Only Data Loads Alpha COFF binary header Global pointer (GP) Read-only sections Access address calculation Algorithm - extended constant propagation gp=0x Read-Only Section Start: 0x End: 0x