Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

How SAS implements structured programming constructs

Computer Architecture Instruction-Level Parallel Processors

Introduction to Memory Management. 2 General Structure of Run-Time Memory.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

CSI 3120, Implementing subprograms, page 1 Implementing subprograms The environment in block-structured languages The structure of the activation stack.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Breno de MedeirosFlorida State University Fall 2005 Buffer overflow and stack smashing attacks Principles of application software security.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Programming with Recursion

Run-Time Storage Organization

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Multiscalar processors

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Memory Model of A Program, Methods Overview l Memory Model of JVM »Method Area »Heap »Stack.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

1 Chapter 5: Names, Bindings and Scopes Lionel Williams Jr. and Victoria Yan CSci 210, Advanced Software Paradigms September 26, 2010.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Lecture 2 Foundations and Definitions Processes/Threads.

Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.

1 7.Algorithm Efficiency What to measure? Space utilization: amount of memory required  Time efficiency: amount of time required to process the data.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Lecture 3 Classes, Structs, Enums Passing by reference and value Arrays.

1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

FOUNDATION IN INFORMATION TECHNOLOGY (CS-T-101) TOPIC : INFORMATION SYSTEM – SOFTWARE.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

Multiscalar Processors

nZDC: A compiler technique for near-Zero silent Data Corruption

Antonia Zhai, Christopher B. Colohan,

Depth First Search—Backtracking

Temporally Silent Stores (Alternatively: Louder Silent Stores)

High Coverage Detection of Input-Related Security Faults

CS 465 Buffer Overflow Slides by Kent Seamons and Tim van der Horst

Computer Organization & Compilation Process

PA0 is due in 12 hours PA1 will be out in 12 hours

Binding Times Binding is an association between two things Examples:

Sampoorani, Sivakumar and Joshua

Programming Languages

Adapted from the slides of Prof

Dynamic Memory And Objects

Foundations and Definitions

Computer Organization & Compilation Process

Understanding and Preventing Buffer Overflow Attacks in Unix

Run-time environments

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

PACT 2000G. Bell, K. Lepak and M. Lipasti Background Lepak, Lipasti: On the Value Locality of Store Instructions: ISCA 2000 Introduced Silent Stores A memory write that does not change the system state Silent stores are real and non-trivial 20%-60% of all dynamic stores are silent in SPECINT-95 and MP benchmarks (32% average)

PACT 2000G. Bell, K. Lepak and M. Lipasti Why Do We Care? Reducing cache writebacks Reducing writeback buffering Reducing true and false sharing Write operations are generally more expensive than reads

PACT 2000G. Bell, K. Lepak and M. Lipasti Code Size / Efficiency R(I1,I2,I3) = V(I1,I2,I3) - A(0)*(U(I1,I2,I3)) - A(1)*(U(I1-1,I2,I3) + U(I1+1,I2,I3) + U(I1,I2-1,I3) + U(I1,I2+1,I3) + U(I1,I2,I3-1) + U(I1,I2,I3+1)) - A(2)*(U(I1-1,I2- 1,I3) + U(I1+1,I2-1,I3) + U(I1-1,I2+1,I3) + U(I1+1,I2+1,I3) + U(I1,I2-1,I3-1) + U(I1,I2+1,I3-1) + U(I1,I2-1,I3+1) + U(I1,I2+1,I3+1) + U(I1- 1,I2,I3-1) + U(I1-1,I2,I3+1) + U(I1+1,I2,I3-1) + U(I1+1,I2,I3+1)) - A(3)*(U(I1-1,I2-1,I3-1) + U(I1+1,I2-1,I3-1) + U(I1-1,I2+1,I3-1) + U(I1+1,I2+1,I3-1) + U(I1-1,I2-1,I3+1) + U(I1+1,I2-1,I3+1) + U(I1- 1,I2+1,I3+1) + U(I1+1,I2+1,I3+1)) Example from mgrid (SPECFP-95) Eliminating this expression (when silent) removes over 100 static instructions (2.4% of the total dynamic instructions)

PACT 2000G. Bell, K. Lepak and M. Lipasti This Talk Characterize silent stores Why do they occur? Source code case studies Silent store statistics Critical silent stores Goal: provide insight into silent stores that can lead to novel innovations in detecting and exploiting them

PACT 2000G. Bell, K. Lepak and M. Lipasti Terminology Silent Store: A memory write that does not change the system state Store Verify: A load, compare, and conditional store (if non-silent) operation Store Squashing: Removal of a silent store from program execution

PACT 2000G. Bell, K. Lepak and M. Lipasti An Example for (i = 0; i < 32; i++){ time_left[i] -= MIN(time_left[i],time_to_kill); } Example from m88ksim This store is silent in over 95% of the dynamic executions of this loop Difficult for compiler to eliminate because how often the store is silent may depend on program inputs

PACT 2000G. Bell, K. Lepak and M. Lipasti Value Distribution Both values and addresses are likely to be silent

PACT 2000G. Bell, K. Lepak and M. Lipasti Frequency of Execution Few static instructions contribute to most silent stores

PACT 2000G. Bell, K. Lepak and M. Lipasti Stack / Heap Uniform stack silent stores (25%-50%) Variable heap silent stores

PACT 2000G. Bell, K. Lepak and M. Lipasti Stores Likely to be Silent 4 categories based on previous execution of that particular static store Same Location, Same Value A silent store stores the same value to the same location as the last time it was executed Common in loops

PACT 2000G. Bell, K. Lepak and M. Lipasti Same Location, Same Value for (anum = 1; anum <= maxarg; anum++) { argflags = arg[anum].arg_flags; Example from perl argflags is a stack-allocated temporary variable (same location) arg_flags is often zero (same value) Silent 71% of the time

PACT 2000G. Bell, K. Lepak and M. Lipasti Stores Likely to be Silent Different Location, Same Value A silent store stores the same value to a different location as the last time it was executed Common in instructions that store to an array indexed by a loop induction variable

PACT 2000G. Bell, K. Lepak and M. Lipasti Different Location, Same Value for(x = xmin; x <= xmax; ++x) for(y = ymin; y <= ymax; ++y){ s = y*boardsize+x;... ltrscr -= ltr2[s]; ltr2[s] = 0; ltr1[s] = 0; ltrgd[s] = FALSE; } Example from go Clears game board array Board is likely to be mostly zero in subsequent clearings Silent 86%, 43%, 77% of the time, respectively

PACT 2000G. Bell, K. Lepak and M. Lipasti Stores Likely to be Silent Same Location, Different Value A silent store stores a different value to the same location as the last time it was executed Rare, but can be caused by: Intervening static stores to the same address Stack frame manipulations

PACT 2000G. Bell, K. Lepak and M. Lipasti Same Location, Different Value for(x = xmin; x <= xmax; ++x) for(y = ymin; y <= ymax; ++y){ s = y*boardsize+x;... ltrscr -= ltr2[s]; ltr2[s] = 0; ltr1[s] = 0; ltrgd[s] = FALSE; } Example from go ltrscr is a global variable (same location) ltr2 is indexed by loop induction variable (different value) Silent 86%, but of that 98% is Same Location, Same Value

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers call foo() call bar() … call foo() *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp)... $17 is callee-saved

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp)... call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp)... 2 call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp)... 2 call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp)... 2 call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp) call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp) call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp) call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Callee-Saved Registers *** void foo() *** sw $17,28($fp)... *** void bar() *** sw $17,28($fp) = Same location, different value, silent 6 call foo() call bar() … call foo()

PACT 2000G. Bell, K. Lepak and M. Lipasti Stores Likely to be Silent Different Location, Different Value A static silent store stores a different value to a different location as the last time it was executed Example: nested loops

PACT 2000G. Bell, K. Lepak and M. Lipasti Different Location, Different Value NODE ***xlsave(NODE **nptr,...){... for (; nptr != (NODE **) NULL; nptr = va_arg(pvar, NODE **)){... *--xlstack = nptr;... } Example from li xlstack is continually decremented (different location) nptr is set to next function argument (different value) Silent if subsequent calls to xlsave store the same set of nodes to the same starting stack address

PACT 2000G. Bell, K. Lepak and M. Lipasti Likelihood of Being Silent Silence can be accurately predicted based on category

PACT 2000G. Bell, K. Lepak and M. Lipasti Silent Store Breakdown Stores that can be predicted silent (Same Value) are a large portion of all silent stores

PACT 2000G. Bell, K. Lepak and M. Lipasti Critical Silent Stores Critical Silent Store: A specific dynamic silent store that, if not squashed, will cause a cacheline to be marked as dirty and hence require a writeback

PACT 2000G. Bell, K. Lepak and M. Lipasti Critical Silent Stores Dirty bit Cache blocks ABCDEF

PACT 2000G. Bell, K. Lepak and M. Lipasti Critical Silent Stores sw 0, C sw 2, E x 02 Both silent stores are critical because the dirty bit would not have been set if silent stores are squashed = = ABCDEF

PACT 2000G. Bell, K. Lepak and M. Lipasti Non-Critical Silent Stores sw 0, C sw 2, E x 02 No silent stores are critical because the dirty bit is set by a non-silent store (regardless of squashing) 0 sw 4, D = =  ABCDEF 4

PACT 2000G. Bell, K. Lepak and M. Lipasti Critical Silent Stores Who Cares? It is sufficient to squash only critical silent stores to obtain maximal writeback reduction Squashing non-critical silent stores: Incurs store verify overhead with no reduction in writebacks Can cause additional address bus transactions in multiprocessors

PACT 2000G. Bell, K. Lepak and M. Lipasti Critical Silent Stores: Example do{ *(htab_p-16) = -1; *(htab_p-15) = -1; *(htab_p-14) = -1; *(htab_p-13) = -1;... *(htab_p-2) = -1; *(htab_p-1) = -1; } while ((i -= 16) >= 0); Example from compress These 16 stores fill entire cache lines If all stores to a line are silent, then they are all critical as well 19% of all writebacks can be eliminated

PACT 2000G. Bell, K. Lepak and M. Lipasti Writeback Reduction Squashing only a subset of silent stores results in significant writeback reduction

PACT 2000G. Bell, K. Lepak and M. Lipasti Conclusion Silent Stores occur for a variety of values and execution frequencies Silent Store causes: Algorithmic (bad programming?) Architecture / compiler conventions Squashing only critical silent stores is sufficient for removing all writebacks

PACT 2000G. Bell, K. Lepak and M. Lipasti Future Work Silence prediction Store verify only if have reason to believe that store is: Silent Critical Multiprocessor Silent Stores Extend notion of criticality to include silent stores that cause sharing misses as well as writebacks