Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.

Slides:



Advertisements
Similar presentations
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
Advertisements

Instruction Set Design
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Dynamic Branch Prediction
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
TDB: A Source-level Debugger for Dynamically Translated Programs Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 Low Overhead Program Monitoring and Profiling Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania {naveen,
Multiprocessing Memory Management
Chapter 3.2 : Virtual Memory
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Efficient Instruction Set Randomization Using Software Dynamic Translation Michael Crane Wei Hu.
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
Virtual Machines: Versatile Platforms for Systems and Processes
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
1 Dimension: An Instrumentation Tool for Virtual Execution Environments Jing Yang, Shukang Zhou and Mary Lou Soffa Department of Computer Science University.
Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Heterogeneous Chip Multiprocessor Design for Virtual Machines Dan Upton and Kim Hazelwood University of Virginia.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
1.4 Hardware Review. CPU  Fetch-decode-execute cycle 1. Fetch 2. Bump PC 3. Decode 4. Determine operand addr (if necessary) 5. Fetch operand from memory.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Lecture 26 Virtual Machine Monitors. Virtual Machines Goal: run an guest OS over an host OS Who has done this? Why might it be useful? Examples: Vmware,
Lecture 12 Virtualization Overview 1 Dec. 1, 2015 Prof. Kyu Ho Park “Understanding Full Virtualization, Paravirtualization, and Hardware Assist”, White.
Introduction Why are virtual machines interesting?
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Embedded Real-Time Systems
Smalltalk Implementation Harry Porter, October 2009 Smalltalk Implementation: Optimization Techniques Prof. Harry Porter Portland State University 1.
Introduction to Operating Systems Concepts
CS161 – Design and Architecture of Computer
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Virtual Machines: Versatile Platforms for Systems and Processes
Page Table Implementation
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CSc 453 Interpreters & Interpretation
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSE 451: Operating Systems Autumn 2005 Memory Management
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
CSC3050 – Computer Architecture
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Introduction to Virtual Machines
Introduction to Virtual Machines
Lecture 4: Instruction Set Design/Pipelining
rePLay: A Hardware Framework for Dynamic Optimization
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSc 453 Interpreters & Interpretation
Dynamic Binary Translators and Instrumenters
INSTRUCTION SET DESIGN
Presentation transcript:

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason Mars Department of Computer Science, University of Virginia Department of Computer Science, University of Pittsburgh Bruce Childers

2 What is SDT? The programmatic modification of a running program’s binary instructions Software layer mediates program execution by modifying (translating) instructions before they execute on host CPU Application Binary Dynamic Translator Operating System CPU Uses include  Dynamic optimization (e.g., Dynamo, JITs)  Code security (e.g., diversity, shepherding)  Software migration (e.g., Apple Rosetta)  Dynamic instrumentation (e.g., Insop)  Dynamic patching & debugging (bug fixes)  And many more!

3 SDT Overhead More pervasive use desirable High overhead can limit pervasive use  Execution time, memory, disk size, network traffic Many techniques to minimize overhead  Traces, large code regions, branch linking, etc. How branches are handled especially important  Indirect branches problematic Several IB schemes in different translators, architectures  Goal: Understand how translation mechanisms for indirect branches impact overhead, given architecture capabilities.

4 Overview Introduction SDT and branch handling Indirect branch mechanisms Evaluation Summary

5 Software Dynamic Translation Context Capture Context Switch Next PC Translate Decode Fetch New Fragment Finished? Dynamic Translator Direct branch Indirect branch Cached? New PC Fragment Cache Application Binary

6 Handling Direct Branches Context Capture Context Switch Next PC Translate Decode Fetch New Fragment Finished? Dynamic Translator Direct branch Indirect branch Cached? New PC Fragment Cache Application Binary Fragment linking – change branch to jump to already translated target fragment

7 Handling Indirect Branches Context Capture Context Switch Next PC Translate Decode Fetch New Fragment Finished? Dynamic Translator Direct branch Indirect branch Cached? New PC Fragment Cache Application Binary Fragment ending with an indirect branch that can transfer to one of several target addresses – can’t link the branch to the targets

8 Indirect branches are rare, right?

9 Fragment Cache Reduce Overhead due to IBs Context Capture Context Switch Next PC Translate Decode Fetch New Fragment Finished? Dynamic Translator Direct branch Indirect branch Cached? New PC Application Binary Fragment ends with an indirect branch that can transfer to one of several target addresses Embed lookup and mapping of application address into fragment cache  Minimize amount of context to save & restore  Can be specialized to each indirect branch Map app. address to frag. address  Typically use a hash table  Implemented as data or instruction sequence  Interacts with the target machine IB mapping implementations  Data cache hashing: IBTC [Strata, Bruening Kim & Smith]  Instruction cache hashing: Sieve [HDTrans]  Combined: Inline entries [Dynamo, DAISY, Pin, Strata]

10 Indirect Branch Translation Cache Mapping done with table in memory (memory accesses)  Table entry: Table indexed by application address... r1 = …... jmp r1... L0:... r1 = …... save t0, t1 t0 = hash(r1) if (IBTC[t0].AppAddr == r1) t1 = IBTC[t0].FragAddr jmp t1 restore t0, t1 else jmp translator Application Binary Fragment Cache

11 Indirect Branch Translation Cache Table in memory  Advantage: Small code footprint & minimal branches  Disadvantage: Memory accesses & D-cache pressure  Other considerations Uses two temporary registers & comparison Many options  Sharing (one for all branches or one per branch)  Appropriate size (number of entries)  Resizing (dynamically adjust size)  Reprobing (where to look on collision)  Lookup code placement Inline in fragment or a separate “function”

12 Fragment Cache Sieve Dispatch Jmp Bucket1 Jmp Bucket4 Return To Translator Bucket2 Addr8 Bucket1 Addr4 Bucket4 Addr10 Bucket3 Addr12 Frag10 Frag99 Frag111 Frag16 Sieve Table Addr16Addr10 Mapping done by executing instruction sequence Bucket5 Addr16 Frag204

13 Sieve Table as an instruction sequence  Advantage: Fewer memory accesses  Disadvantage: More branches and possibly pressure on I-cache  Other considerations Uses one temporary register Uses an address-sized constant compared to register Options  Table size  Others possible, but seem to not matter

14 Combined: Inline Mapping Instructions emitted at each branch to perform translation No hashing – compare app. address against inlined addresses... r1 = …... jmp r1... L0:... r1 = …... save t0 t0 = APPADDR_1 if (r1 == t0) jmp FRAGADDR_100 restore t0 t0 = APPADDR_2 if (r1 == t0) jmp FRAGADDR_120 restore t0 Application Binary Fragment Cache

15 Combined: Inline Mapping Inlining mappings at indirect  Advantage: Avoids hashing, no mem. accesses, min. branches  Disadvantage: Code growth & hit cost depends on hit entry  Other considerations Possibly one register and constant address comparison to register Options  Number of inline entries Should the translator decide the amount of inlining?  Target to inline  Execution point when that target be selected  Backing mechanism to use (what to do on a miss)

16 Evaluation Common SDT platform to study indirect branch translation implementations across architectures Strata: Retargetable framework [CGO’03, IJPP’05, VEE’06] Three machines/OS/compiler  UltraSparc-IIi/Solaris/SunSWPRO  Pentium IV Xeon/Linux/gcc 3.4  Opteron 244/Linux/gcc 4.0 SPEC 2000: mesa, gcc, crafty, eon, perlbmk, gap, and vortex Returns are handled separately (predictable) Slowdown compared to native execution (no translation)

17 IBTC Size (P4) Conflicts reduced by larger table size; levels off and more cost at >32K Opteron and SPARC had similar results.

18 IBTC Reprobing (P4) Conflicts reduced for 1K but increased cost not worthwhile on 32K Opteron and SPARC had similar results.

19 Sieve Size (P4) Conflicts by larger table, but ISA effects restrict benefit beyond 16K Opteron had similar results; SPARC levels off at 1K entries

20 Inlining (Opteron) Inlining helps branch predictor in some cases P4 and SPARC have worse performance (complexity & I-cache pressure)

21 Summary SDT is widely used and performance is important  Good performance requires good IB handling Evaluated IB handling techniques in an apples-to-apples comparison across three architectures Details of the hardware dictate best method  IBTC on SPARC’s due to limited constant size (3.5% avg SPEC)  16K Sieve on Intel P4 to avoid eflag save (4.5% avg SPEC)  Inlining on Opteron to help branch predictor (2.2% avg SPEC)

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Questions? Contact us: