Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010
Shadow Memory Meta-data –Track properties of application memory Synchronized Update –Application data and meta-data CGO, Toronto, Canada, 4/26/ a.out stack libc Application Memory Shadow Memory heap
Examples Memory Error Detection –MemCheck [VEE’07] –Purify [USENIX’92] –Dr. Memory –MemTracker [HPCA’07] Dynamic Information Flow Tracking –LIFT [MICRO’39] –TaintTrace [ISCC’06] Multi-threaded Debugging –Eraser [TCS’97] –Helgrind Others –Redux [TCS’03] –Software Watchpoint [CC’08] CGO, Toronto, Canada, 4/26/2010 3
Issues Performance –Runtime overhead Example: MemCheck 25x [VEE’07] Scalability –64-bit architecture Dependence –OS –Hardware Development –Implemented with specific analysis –Lack of a general framework CGO, Toronto, Canada, 4/26/2010 4
Memory Shadowing System Dynamic Instrumentation –Context switch (application ↔ shadow) –Address calculation –Updating meta-data Memory Management –Memory allocation / free Monitor application memory management Manage shadow memory –Mapping translation scheme (addr A addr S ) DMS: Direct Mapping Scheme SMS: Segmented Mapping Scheme CGO, Toronto, Canada, 4/26/2010 5
Direct Mapping Scheme (DMS) Single memory region for entire address space. Translation: Issue: address conflict between mem A and mem S CGO, Toronto, Canada, 4/26/ lea [addr] %r1 add %r1 disp %r1 Slowdown relative to native execution Application Shadow
Slowdown relative to native execution Segmented Mapping Scheme (SMS) Shadow segment per application segment Translation: –Segment lookup (address indexing) –Address translation CGO, Toronto, Canada, 4/26/ lea [addr] %r1 mov %r1 %r2 shr %r2, 16 %r2 add %r1, disp[%r2] %r1 addr A addr S App 1 Shd 1 Shd 2 App 2 Segment table
Umbra Mapping Scheme –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Results –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 8
Kernel space Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout CGO, Toronto, Canada, 4/26/ a.out Unusable space stack User space vsyscall CGO, Toronto, Canada, 4/26/2010
Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) CGO, Toronto, Canada, 4/26/ addr A
Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) –Multi-Level SMS Even more expensive Fast path on lower 32G (MemCheck) CGO, Toronto, Canada, 4/26/ Slowdown relative to native execution addr A
Shadow Memory Mapping Scaling to 64-bit Architecture –DMS is infeasible –Single-Level SMS is too sparse –Multi-Level SMS is too expensive Umbra Solution –Eliminate empty entries –Compact table –Walk the table to find the entry CGO, Toronto, Canada, 4/26/
Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/
Implementation Memory Manager –Monitor and control application memory allocation brk, mmap, munmap, mremap –Allocate shadow memory –Maintain translation table Instrumenter –Instrument every memory reference Context save Address calculation Address translation Shadow memory update Context restore CGO, Toronto, Canada, 4/26/ App 1 Shd 1 Shd 2 App 2
Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/
~100 Unoptimized System Small overhead from DynamoRIO Slower than SMS-64 –Need to walk the global translation table Why so slow? –41.79% instructions are memory references –For each of these instructions Full context switch Table lookup Call-out instrumentation 16 Global translation table
Optimization Translation Optimization –Thread-local translation cache –Hashtable lookup –Memoization mini-cache –Reference uni-cache Instrumentation Optimization –Context switch reduction –Reference grouping –3-stage code layout 17 Global translation table ~100
1. Thread-Local Translation Cache Local translation table per thread –Synchronize with global translation table when necessary –Avoid lock contention –Walk table to find match entry Walk global table if not find in thread-local cache Inlined instrumentation 18 Thread 1 Thread 2 Global translation table Thread-local translation cache
~ Hashtable Lookup Hashtable per thread Fixed number of slots Hash(addr a ) entry in thread-local cache –If match, found –If no match, walk the local cache 19 Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable
~ Memoization Mini-Cache Four-entry table per thread –Stack –Heap –Application (a.out) –Units found in last table lookup If not match, hashtable lookup –68.93% hit ratio 20 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable
~ Reference Uni-Cache Software uni-cache per instr per thread –Last reference unit tag –Last translation displacement If not match, memoization mini-cache check –99.93% hit ratio 21 Reference uni-cache Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX
5. Context Switch Reduction Register liveness analysis –Use dead register –Avoid flags save/restore 22 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Eflag Steal 2.55% Register Steal 8.20%
6. Reference Grouping One reference cache for multiple references –Stack local variables –Different members of the same object 23 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Ref Uni-Cache Checks22.76%
3-stage Code Layout Inline stub (<10 instructions) –Quick inline check code with minimal context switch Lean procedure (~50 instructions) –Simple assembly procedure with partial context switch Callout (C function) –C function with complete context switch CGO, Toronto, Canada, 4/26/ uni-cache check memoization check hashtable lookup local cache lookup c_function() { // global table // lookup... } app instruction Inline stub Lean procedureCallout
Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/
Client API Event HooksDescription client_initProcess initialization client_exitProcess exit client_thread_initThread initialization client_thread_exitThread exit shadow_memory_createShadow memory creation shadow_memory_deleteShadow memory deletion instrument_updateInsert meta-data update code CGO, Toronto, Canada, 4/26/
Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_info reg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_data tid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } 27 CGO, Toronto, Canada, 4/26/2010 Meta-data maintains a bit map to store which threads access the associated memory
Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API √ Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/
Performance Evaluation CGO, Toronto, Canada, 4/26/ Slowdown relative to native execution
EMS64: Efficient Memory Shadowing for 64-bit Translation – –Reference uni-cache hit rate: 99.93% –Still need a costly check to catch the 0.07% Reg steal; save flags; compare & jump; restore EMS64 (ISMM’10) –Speculatively use a disp without check –Notified by memory access violation fault for incorrect disp CGO, Toronto, Canada, 4/26/
EMS64 Preliminary Result Slowdown relative to native execution CGO, Toronto, Canada, 4/26/
Thanks Download – Q & A CGO, Toronto, Canada, 4/26/