Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
OS Memory Addressing.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
CS 153 Design of Operating Systems Spring 2015
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Efficient and Flexible Architectural Support for Dynamic Monitoring YUANYUAN ZHOU, PIN ZHOU, FENG QIN, WEI LIU, & JOSEP TORRELLAS UIUC.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley and Rabi Mahapatra & Hank Walker.
1 Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4) Reminders: midterm begins at 9am, ends at 10:40am.
Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.
Memory Management 2010.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Memory ManagementCS-502 Fall Memory Management CS-502 Operating Systems Fall 2006 (Slides include materials from Operating System Concepts, 7 th.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Memory ManagementCS-3013 C-term Memory Management CS-3013 Operating Systems C-term 2008 (Slides include materials from Operating System Concepts,
MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Efficient Memory Shadowing for 64-bit Architectures ISMM 2010, Toronto, Canada June 6, 2010.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.
CS533 Concepts of Operating Systems Jonathan Walpole.
Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
CE Operating Systems Lecture 14 Memory management.
Chapter 4 Memory Management Virtual Memory.
CS399 New Beginnings Jonathan Walpole. Virtual Memory (1)
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Processes and Virtual Memory
1 Pintos Virtual Memory Management Project (CS3204 Spring 2006 VT) Yi Ma.
Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.
OS Memory Addressing. Architecture CPU – Processing units – Caches – Interrupt controllers – MMU Memory Interconnect North bridge South bridge PCI, etc.
Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.
CS203 – Advanced Computer Architecture Virtual Memory.
W4118 Operating Systems Instructor: Junfeng Yang.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Translation Lookaside Buffer
Non Contiguous Memory Allocation
Segmentation COMP 755.
ECE232: Hardware Organization and Design
Memory Caches & TLB Virtual Memory
Virtual Memory - Part II
Outline Paging Swapping and demand paging Virtual memory.
CS510 Operating System Foundations
Chapter 8: Main Memory.
CS399 New Beginnings Jonathan Walpole.
Lecture 3: Main Memory.
CS 5204 Operating Systems Lecture 10
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 7: Flexible Address Translation
Lecture 8: Efficient Address Translation
CS703 - Advanced Operating Systems
CS703 – Advanced Operating Systems
Virtual Memory and Paging
Dynamic Binary Translators and Instrumenters
CSE 542: Operating Systems
Presentation transcript:

Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010

Shadow Memory Meta-data –Track properties of application memory Synchronized Update –Application data and meta-data CGO, Toronto, Canada, 4/26/ a.out stack libc Application Memory Shadow Memory heap

Examples Memory Error Detection –MemCheck [VEE’07] –Purify [USENIX’92] –Dr. Memory –MemTracker [HPCA’07] Dynamic Information Flow Tracking –LIFT [MICRO’39] –TaintTrace [ISCC’06] Multi-threaded Debugging –Eraser [TCS’97] –Helgrind Others –Redux [TCS’03] –Software Watchpoint [CC’08] CGO, Toronto, Canada, 4/26/2010 3

Issues Performance –Runtime overhead Example: MemCheck 25x [VEE’07] Scalability –64-bit architecture Dependence –OS –Hardware Development –Implemented with specific analysis –Lack of a general framework CGO, Toronto, Canada, 4/26/2010 4

Memory Shadowing System Dynamic Instrumentation –Context switch (application ↔ shadow) –Address calculation –Updating meta-data Memory Management –Memory allocation / free Monitor application memory management Manage shadow memory –Mapping translation scheme (addr A  addr S ) DMS: Direct Mapping Scheme SMS: Segmented Mapping Scheme CGO, Toronto, Canada, 4/26/2010 5

Direct Mapping Scheme (DMS) Single memory region for entire address space. Translation: Issue: address conflict between mem A and mem S CGO, Toronto, Canada, 4/26/ lea [addr]  %r1 add %r1 disp  %r1 Slowdown relative to native execution Application Shadow

Slowdown relative to native execution Segmented Mapping Scheme (SMS) Shadow segment per application segment Translation: –Segment lookup (address indexing) –Address translation CGO, Toronto, Canada, 4/26/ lea [addr]  %r1 mov %r1  %r2 shr %r2, 16  %r2 add %r1, disp[%r2]  %r1 addr A addr S App 1 Shd 1 Shd 2 App 2 Segment table

Umbra Mapping Scheme –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Results –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 8

Kernel space Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout CGO, Toronto, Canada, 4/26/ a.out Unusable space stack User space vsyscall CGO, Toronto, Canada, 4/26/2010

Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) CGO, Toronto, Canada, 4/26/ addr A

Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) –Multi-Level SMS Even more expensive Fast path on lower 32G (MemCheck) CGO, Toronto, Canada, 4/26/ Slowdown relative to native execution addr A

Shadow Memory Mapping Scaling to 64-bit Architecture –DMS is infeasible –Single-Level SMS is too sparse –Multi-Level SMS is too expensive Umbra Solution –Eliminate empty entries –Compact table –Walk the table to find the entry CGO, Toronto, Canada, 4/26/

Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/

Implementation Memory Manager –Monitor and control application memory allocation brk, mmap, munmap, mremap –Allocate shadow memory –Maintain translation table Instrumenter –Instrument every memory reference Context save Address calculation Address translation Shadow memory update Context restore CGO, Toronto, Canada, 4/26/ App 1 Shd 1 Shd 2 App 2

Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/

~100 Unoptimized System Small overhead from DynamoRIO Slower than SMS-64 –Need to walk the global translation table Why so slow? –41.79% instructions are memory references –For each of these instructions Full context switch Table lookup Call-out instrumentation 16 Global translation table

Optimization Translation Optimization –Thread-local translation cache –Hashtable lookup –Memoization mini-cache –Reference uni-cache Instrumentation Optimization –Context switch reduction –Reference grouping –3-stage code layout 17 Global translation table ~100

1. Thread-Local Translation Cache Local translation table per thread –Synchronize with global translation table when necessary –Avoid lock contention –Walk table to find match entry Walk global table if not find in thread-local cache Inlined instrumentation 18 Thread 1 Thread 2 Global translation table Thread-local translation cache

~ Hashtable Lookup Hashtable per thread Fixed number of slots Hash(addr a )  entry in thread-local cache –If match, found –If no match, walk the local cache 19 Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable

~ Memoization Mini-Cache Four-entry table per thread –Stack –Heap –Application (a.out) –Units found in last table lookup If not match, hashtable lookup –68.93% hit ratio 20 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable

~ Reference Uni-Cache Software uni-cache per instr per thread –Last reference unit tag –Last translation displacement If not match, memoization mini-cache check –99.93% hit ratio 21 Reference uni-cache Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX

5. Context Switch Reduction Register liveness analysis –Use dead register –Avoid flags save/restore 22 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Eflag Steal 2.55% Register Steal 8.20%

6. Reference Grouping One reference cache for multiple references –Stack local variables –Different members of the same object 23 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Ref Uni-Cache Checks22.76%

3-stage Code Layout Inline stub (<10 instructions) –Quick inline check code with minimal context switch Lean procedure (~50 instructions) –Simple assembly procedure with partial context switch Callout (C function) –C function with complete context switch CGO, Toronto, Canada, 4/26/ uni-cache check memoization check hashtable lookup local cache lookup c_function() { // global table // lookup... } app instruction Inline stub Lean procedureCallout

Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/

Client API Event HooksDescription client_initProcess initialization client_exitProcess exit client_thread_initThread initialization client_thread_exitThread exit shadow_memory_createShadow memory creation shadow_memory_deleteShadow memory deletion instrument_updateInsert meta-data update code CGO, Toronto, Canada, 4/26/

Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map  [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_info  reg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_data  tid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } 27 CGO, Toronto, Canada, 4/26/2010 Meta-data maintains a bit map to store which threads access the associated memory

Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API √ Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/

Performance Evaluation CGO, Toronto, Canada, 4/26/ Slowdown relative to native execution

EMS64: Efficient Memory Shadowing for 64-bit Translation – –Reference uni-cache hit rate: 99.93% –Still need a costly check to catch the 0.07% Reg steal; save flags; compare & jump; restore EMS64 (ISMM’10) –Speculatively use a disp without check –Notified by memory access violation fault for incorrect disp CGO, Toronto, Canada, 4/26/

EMS64 Preliminary Result Slowdown relative to native execution CGO, Toronto, Canada, 4/26/

Thanks Download – Q & A CGO, Toronto, Canada, 4/26/