Presentation is loading. Please wait.

Presentation is loading. Please wait.

Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.

Similar presentations


Presentation on theme: "Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010."— Presentation transcript:

1 Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010

2 Shadow Memory Meta-data –Track properties of application memory Synchronized Update –Application data and meta-data CGO, Toronto, Canada, 4/26/2010 2 a.out stack libc Application Memory Shadow Memory heap

3 Examples Memory Error Detection –MemCheck [VEE’07] –Purify [USENIX’92] –Dr. Memory –MemTracker [HPCA’07] Dynamic Information Flow Tracking –LIFT [MICRO’39] –TaintTrace [ISCC’06] Multi-threaded Debugging –Eraser [TCS’97] –Helgrind Others –Redux [TCS’03] –Software Watchpoint [CC’08] CGO, Toronto, Canada, 4/26/2010 3

4 Issues Performance –Runtime overhead Example: MemCheck 25x [VEE’07] Scalability –64-bit architecture Dependence –OS –Hardware Development –Implemented with specific analysis –Lack of a general framework CGO, Toronto, Canada, 4/26/2010 4

5 Memory Shadowing System Dynamic Instrumentation –Context switch (application ↔ shadow) –Address calculation –Updating meta-data Memory Management –Memory allocation / free Monitor application memory management Manage shadow memory –Mapping translation scheme (addr A  addr S ) DMS: Direct Mapping Scheme SMS: Segmented Mapping Scheme CGO, Toronto, Canada, 4/26/2010 5

6 Direct Mapping Scheme (DMS) Single memory region for entire address space. Translation: Issue: address conflict between mem A and mem S CGO, Toronto, Canada, 4/26/2010 6 lea [addr]  %r1 add %r1 disp  %r1 Slowdown relative to native execution Application Shadow

7 Slowdown relative to native execution Segmented Mapping Scheme (SMS) Shadow segment per application segment Translation: –Segment lookup (address indexing) –Address translation CGO, Toronto, Canada, 4/26/2010 7 lea [addr]  %r1 mov %r1  %r2 shr %r2, 16  %r2 add %r1, disp[%r2]  %r1 addr A addr S App 1 Shd 1 Shd 2 App 2 Segment table

8 Umbra Mapping Scheme –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Results –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 8

9 Kernel space Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout CGO, Toronto, Canada, 4/26/2010 9 a.out Unusable space stack User space vsyscall 2 47 2 64 CGO, Toronto, Canada, 4/26/2010

10 Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) CGO, Toronto, Canada, 4/26/2010 10 addr A

11 Shadow Memory Mapping Scaling to 64-bit Architecture –DMS Infeasible due to memory layout –Single-Level SMS Too big (~4 billion entries) –Multi-Level SMS Even more expensive Fast path on lower 32G (MemCheck) CGO, Toronto, Canada, 4/26/2010 11 Slowdown relative to native execution addr A

12 Shadow Memory Mapping Scaling to 64-bit Architecture –DMS is infeasible –Single-Level SMS is too sparse –Multi-Level SMS is too expensive Umbra Solution –Eliminate empty entries –Compact table –Walk the table to find the entry CGO, Toronto, Canada, 4/26/2010 12

13 Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 13

14 Implementation Memory Manager –Monitor and control application memory allocation brk, mmap, munmap, mremap –Allocate shadow memory –Maintain translation table Instrumenter –Instrument every memory reference Context save Address calculation Address translation Shadow memory update Context restore CGO, Toronto, Canada, 4/26/2010 14 App 1 Shd 1 Shd 2 App 2

15 Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 15

16 ~100 Unoptimized System Small overhead from DynamoRIO Slower than SMS-64 –Need to walk the global translation table Why so slow? –41.79% instructions are memory references –For each of these instructions Full context switch Table lookup Call-out instrumentation 16 Global translation table

17 Optimization Translation Optimization –Thread-local translation cache –Hashtable lookup –Memoization mini-cache –Reference uni-cache Instrumentation Optimization –Context switch reduction –Reference grouping –3-stage code layout 17 Global translation table ~100

18 1. Thread-Local Translation Cache Local translation table per thread –Synchronize with global translation table when necessary –Avoid lock contention –Walk table to find match entry Walk global table if not find in thread-local cache Inlined instrumentation 18 Thread 1 Thread 2 Global translation table Thread-local translation cache

19 ~100 2. Hashtable Lookup Hashtable per thread Fixed number of slots Hash(addr a )  entry in thread-local cache –If match, found –If no match, walk the local cache 19 Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable

20 ~100 3. Memoization Mini-Cache Four-entry table per thread –Stack –Heap –Application (a.out) –Units found in last table lookup If not match, hashtable lookup –68.93% hit ratio 20 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable

21 ~100 4. Reference Uni-Cache Software uni-cache per instr per thread –Last reference unit tag –Last translation displacement If not match, memoization mini-cache check –99.93% hit ratio 21 Reference uni-cache Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX

22 5. Context Switch Reduction Register liveness analysis –Use dead register –Avoid flags save/restore 22 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Eflag Steal 2.55% Register Steal 8.20%

23 6. Reference Grouping One reference cache for multiple references –Stack local variables –Different members of the same object 23 Thread 1 Thread 2 Global translation table Thread-local translation cache Memoization mini-cache Hashtable ~100 Reference uni-cache ADD $1, (%RAX) MOV %RBX 48(%RAX) PUSH %RAX ADD 40(%RAX), %RBX #/#InstrSPEC2006 Memory Reference41.79% Ref Uni-Cache Checks22.76%

24 3-stage Code Layout Inline stub (<10 instructions) –Quick inline check code with minimal context switch Lean procedure (~50 instructions) –Simple assembly procedure with partial context switch Callout (C function) –C function with complete context switch CGO, Toronto, Canada, 4/26/2010 24 uni-cache check memoization check hashtable lookup local cache lookup c_function() { // global table // lookup... } app instruction Inline stub Lean procedureCallout

25 Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 25

26 Client API Event HooksDescription client_initProcess initialization client_exitProcess exit client_thread_initThread initialization client_thread_exitThread exit shadow_memory_createShadow memory creation shadow_memory_deleteShadow memory deletion instrument_updateInsert meta-data update code CGO, Toronto, Canada, 4/26/2010 26

27 Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map  [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_info  reg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_data  tid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } 27 CGO, Toronto, Canada, 4/26/2010 Meta-data maintains a bit map to store which threads access the associated memory

28 Umbra Mapping Scheme √ –Segmented mapping –Scale with actual memory usage Implementation √ –DynamoRIO Optimization √ –Translation optimization –Instrumentation optimization Client API √ Experimental Result –Performance evaluation –Statistics collection CGO, Toronto, Canada, 4/26/2010 28

29 Performance Evaluation CGO, Toronto, Canada, 4/26/2010 29 Slowdown relative to native execution

30 EMS64: Efficient Memory Shadowing for 64-bit Translation – –Reference uni-cache hit rate: 99.93% –Still need a costly check to catch the 0.07% Reg steal; save flags; compare & jump; restore EMS64 (ISMM’10) –Speculatively use a disp without check –Notified by memory access violation fault for incorrect disp CGO, Toronto, Canada, 4/26/2010 30

31 EMS64 Preliminary Result Slowdown relative to native execution CGO, Toronto, Canada, 4/26/2010 31

32 Thanks Download –http://people.csail.mit.edu/qin_zhao/umbra/http://people.csail.mit.edu/qin_zhao/umbra/ Q & A CGO, Toronto, Canada, 4/26/2010 32


Download ppt "Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010."

Similar presentations


Ads by Google