2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 20061 Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

1 RISE: Randomization Techniques for Software Security Dawn Song CMU Joint work with Monica Chew (UC Berkeley)

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

ParaScale : Exploiting Parametric Timing Analysis for Real-Time Schedulers and Dynamic Voltage Scaling Sibin Mohan 1 Frank Mueller 1,William Hawkins 2,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

The University of Texas at Austin Lizy Kurian John, LCA, UT Austin1 What Programming Language/Compiler Researchers should Know about Computer Architecture.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Transmeta’s New Processor Another way to design CPU By Wu Cheng

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Evaluating Register File Size

Multiscalar Processors

Improving Program Efficiency by Packing Instructions Into Registers

Henk Corporaal TUEindhoven 2009

Energy-Efficient Address Translation

Lecture 11: Memory Data Flow Techniques

Ronny Krashinsky and Mike Sung

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

The Vector-Thread Architecture

Lecture 5: Pipeline Wrap-up, Static ILP

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Phase based adaptive Branch predictor: Seeing the forest for the trees

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok Garg, and Michael Huang Department of Electrical & Computer Engineering University of Rochester

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Motivation Hiding long latencies Scaling up of many structures Complex, hard to design Consumes more energy Slower Inefficiency in hardware Meticulously keep track of all instructions No prior knowledge of out-of-order execution Simply cross-compare all loads and stores ROB size: 320 SQ size: 48 LQ size: 48 LQ Size 16%

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software Assistance Global information Statically identify non-conflicting memory accesses Advantages Reduced resource pressure Energy savings Loads not requiring memory disambiguation Average 43% dynamic loads in FP Spec applications

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Recent Research Chrysos and Emer (ISCA’98) Sethumadhavan et al. (MICRO’03) Park et al. (MICRO’03) Baugh and Zilles (PACC’04) Akkary et al. (MICRO’03) Gandhi et al. (ISCA’05), etc. Hardware-only: Provisioning, re-occurring overhead Cooperative: Consumption, one-time overhead

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Outline Cooperative Memory Disambiguation Framework Evaluation Conclusion

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Cooperative Memory Disambiguation - Resource-Effective Approach 90% dynamic loads do not communicate with in-flight stores Many loads do not require memory disambiguation resources Safe loads: Software analyzer can identify them Can exploit hardware specific information Hardware resources only for non-safe loads int A[1000], B[1000]; void VecAdd() { for(int i=0; i<1000; i++) A[i] = A[i] + B[i]; }

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Cooperative Memory Disambiguation Framework Software-hardware Interface Decoupled ISA (No compatibility obligations) Software Support Binary to binary translator - alto (Muth et al.) Binary analyzer Identify read-only data loads Identify other general safe loads Architectural Support Light-weight Source compiler Original binary Hardware Translator Compilation Hardware specific translator ISA Extended instruction set Hardware specific internal binary

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads Scope of parser analysis Steady state loop No internal control flow Limited in-flight instructions ROB size, store queue size … Load … Store Branch Simple loop body … Store … Store … Load … Store … i i-1 i-2 Steady state loop execution Instruction window

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads (Cont.) -Real example from a SPEC FP application 0x :ldl r31, 256(r3); prefetch 0x : ldt f21, 0(r3) ; Ld1 0x : lda r27, -2(r27) ; r27 = r27-2 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : ldt f22, -8(r3) ; Ld2 0x : ldt f23, 0(r11) ; Ld3 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : ldt f24, -8(r11) ; Ld4 0x : lds f31, 240(r11) ; prefetch 0x : mult f20, f21, f21 ; 0x c: mult f20, f22, f22 ; 0x : addt f23, f21, f21 ; 0x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 0x : beq r1, 0x ; One loop from galgel 0x :ldl r31, 256(r3); prefetch 0x : ldt f21, 0(r3) ; Ld1 0x : lda r27, -2(r27) ; r27 = r27-2 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : ldt f22, -8(r3) ; Ld2 0x : ldt f23, 0(r11) ; Ld2 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : ldt f24, -8(r11) ; Ld4 0x : lds f31, 240(r11) ; prefetch 0x : mult f20, f21, f21 ; 0x c: mult f20, f22, f22 ; 0x : addt f23, f21, f21 ; 0x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 0x : beq r1, 0x ; AddrLd1=_R3+16*i AddrLd2=_R11+16*i AddrSt1=_R11+16*i AddrSt2=_R11+16*i+8 Analysis window: 16 iterations Address range = _R11+(i-16)*16 to _R11+(i-1)*16+8 Ld2 statically determined to be safe Ld1 need run-time evaluation

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA General Safe Loads (Cont.) -Real example from a SPEC FP application New_entry: mark_sq if(r3-r11+8>0) or (r3-r11+264<0) then cset CR0, 1 0x : sldt f21, 0(r3), [CR0]; Ld1 (safe) 0x c: lda r3, 16(r3) ; r3 = r3+16 0x : sldt f23, 0(r11), [CR_TRUE]; Ld2 (safe) 0x : cmple r27, 0x1, r1 ; 0x c: lda r11, 16(r11) ; r11 = r x : addt f24, f22, f22 ; 0x : stt f21, -16(r11) ; St1 0x c: stt f22, -8(r11) ; St2 Modified Code

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Safe stores If it does not communicate with future loads Indirectly discover safe loads Un-analyzable store Load is safe if all stores in SQ are safe Summary of safe load detection Simple loop body All stores must be analyzable Address range calculation … Load (A) … Store1 (UA) … Store2 (A) … Branch Loop Body … Load (A) … Store1 (UA) … Store2 (A) … Branch … Load (A)... In-flight instructions

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Architectural Support Safe loads Boolean condition registers cset (instruction) Safe stores Scope marker Indirect jumps Flash-reset all condition registers

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Outline Cooperative Memory Disambiguation Framework Evaluation Conclusion

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Experimental Setup Modified SimpleScalar 3.0b simulator Wattch to estimate dynamic energy consumption SPEC CPU2000 benchmark suite

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Breakdown of Safe Loads (FP) 97% 43%

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Performance Improvement (FP) 40/48%

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Breakdown of Safe Loads (INT)

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Performance Improvement (INT)

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Energy Savings Floating-point applications Integer applications

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Conclusions Software assistance improves LSQ efficiency Detects average 43% loads as safe Average 10% performance gain Compiler techniques for optimization of microarchitecture resources Future work More powerful static analyzer Manage other micro-architecture resources E.g., register file

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Thank you! Questions?

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Support for Coherency Hash Table: 2-bit Total entries: 512 Details: Table 1Table 2 Access bit Invalidation bit

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Read-Only Data Loads Alpha COFF binary header Global pointer (GP) Read-only sections Access address calculation Algorithm - extended constant propagation gp=0x Read-Only Section Start: 0x End: 0x