Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Secure In-VM Monitoring Using Hardware Virtualization Monirul Sharif, Wenke Lee, Weidong Cui, and Andrea Lanzi Presented by Tyler Bletsch.
Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.
Computer Organization and Architecture
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
Continuously Recording Program Execution for Deterministic Replay Debugging.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Efficient Instruction Set Randomization Using Software Dynamic Translation Michael Crane Wei Hu.
CH12 CPU Structure and Function
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
Native Client: A Sandbox for Portable, Untrusted x86 Native Code
29th ACSAC (December, 2013) SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization Zhui Deng, Xiangyu Zhang, and Dongyan.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Auther: Kevian A. Roudy and Barton P. Miller Speaker: Chun-Chih Wu Adviser: Pao, Hsing-Kuo.
Chapter 4 Memory Management Virtual Memory.
Lawrence Livermore National Laboratory Pianola: A script-based I/O benchmark Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Buffer Overflow Attack- proofing of Code Binaries Ramya Reguramalingam Gopal Gupta Gopal Gupta Department of Computer Science University of Texas at Dallas.
QEMU, a Fast and Portable Dynamic Translator Fabrice Bellard (affiliation?) CMSC 691 talk by Charles Nicholas.
Efficient Debugging using Dynamic Instrumentation (EDDI) Qin Zhao (Singapore-MIT Alliance) Rodric Rabbah (IBM TJ Watson Center) Saman Amarasinghe (CSAIL,
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Remix: On-demand Live Randomization
Selective Code Compression Scheme for Embedded System
The University of Adelaide, School of Computer Science
Threads & multithreading
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS170 Computer Organization and Architecture I
Page Replacement.
CMSC 611: Advanced Computer Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Efficient x86 Instrumentation:
CMSC 611: Advanced Computer Architecture
Introduction to Virtual Machines
Introduction to Virtual Machines
Chapter 11 Processor Structure and function
COMP755 Advanced Operating Systems
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore DEP : Detailed Execution Profile Larry Rudolph SingaporeMIT Alliance Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Chine -Cheng Wu PAS Lab,CSIE, NTU

Introduction Previous work on profiling needs large memory space and big times slowdown DEP (detailed Execution Profile) captures the complete dynamic control flow, data dependency and memory reference at the same time The profile size is significantly reduced DEP uses DynamoRIO binary instrumentation framework to profile in an infrastructure called Adept (A dynamic execution profiling tool)

DEP Advantage DEP complete coverage of the program including shared libraries Multi-threaded application can be collected by independent DEPs Collection is very efficient, incurring a 5 times slowdown Profile contains memory reference and control flow information

Control Flow Profile : DEP c Traditional way to record basic block entries using 4 byte for each DEP use 2-byte for each and an extra 2-byte if needed H-tag for high 2 bytes L-tag for low 2 bytes This compressibility does not guarantee space optimization

Memory References Profile : DEP m Memory reference : {pc,addr,size,type} PC of the memory reference instruction Address of memory reference Size of the data being accessed If it’s a read or a write Storing only the necessary values that

Memory Reference There are three memory references above Push ebp; Mov 0 -> [esp+4]; Mov 0 -> [esp+8];

BB_pc+Mem_addr Compared to DEP DEP trigger fewer analyzer calls than (BB_pc+Mem_addr) cause of smaller profile data that reach overflow to signal analyzer Penalty includes steal and restore registers Address calculation Storage of the address Update profile counter Extra overhead Checking H-tag changes Checking and updating register status

DynamoRIO Running on IA-32 under both Linux and Windows DynamoRIO executes applications by copying user code into cache and then executing Code is the same as original one except control operation return to DynamoRIO Trace cache will cache code for in-direct branch lookup

ADEPT : A Dynamic Execution Profiling Tool

Control Flow : Obtaining DEPc If the L-tag is 0x0000

Memory References: Obtaining DEPm Two state of each register variable : UPDATED, RECORDED

Profile Buffer Store the collected profile for future analysis One buffer for each thread Using large buffer will reduce analyzer invocations Profile buffer has two parts for DEPc and DEPm separately 20 % for DEPc, 80 % for DEPm works well Analyzer is triggered by buffer full using OS signal of page segmentation fault

Optimizing DEPc Basic block 0x0804ffa4 branch to 0x

Optimizing DEP m Optimized

Evaluation Platform : Dual-core 3.2GHz Intel Pentium D 840, 2GBytes of RAM OS : Linux Fedora Core 4 and Windows XP SP2 Benchmarks : SPEC CPU2000 integer benchmarks for Linux, SysMark 2004SE for windows ( run Access, PowerPoint and Word ) Compiler : gcc with -O3 flag

Execution Time

Relative slowdown

Profile Frameworks Pin Count number of basic blocks executed Count number of memory references Valgrind Cachegrind is a cache profiler for capture the number of basic blocks counts and memory references counts eWPP (Extended Whole Program Paths) Recording control flow and dependence information Uses two-phase profiling approach First phase, identify all memory dependence Second phase, collection phase

Profile Size and Compressibility * CF_bit uses bits and 4-byte target addresses for indirect branches

Normalize by uncompress BB_pc size Normalize by uncompress Mem_addr CF_bit not compress well

Related Work Whole Execution Traces (WET) Simulation environment Whole Program Paths (eWPP) Encode trace information in WPP Whole Program Paths (WPP) They have difficulties to support multi-thread applications

Conclusion DEP captures major program execution Control flow, memory reference DEP collected by Adept which can perform on-line or off- line analysis Adept builds the mapping between collected information and original apps. Experiment results show 5 times slowdown and save 40% space compared to traditional profiles Complete trace to recover whole program execution is not necessarily, particular segment can be reproduced for simulations or replay

Back-up Slides

Recovering memory reference trace Using naïve approach of recovering the memory reference trace from a DEP

Recovering Memory References Scenario 1 : complete memory reference profile { pc,addr,size,type} Scenario 2 : DEP collected by Adept Scenario 2 almost triple of native execution time Tradeoff