KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Slides:



Advertisements
Similar presentations
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.
Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,
Continuously Recording Program Execution for Deterministic Replay Debugging.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Virtualization for Cloud Computing
Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.
Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,
CS 149: Operating Systems April 21 Class Meeting
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Virtualization Concepts Presented by: Mariano Diaz.
Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload.
Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Virtual Machine and its Role in Distributed Systems.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
CS533 Concepts of Operating Systems Jonathan Walpole.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Midterm Meeting Pete Bohman, Adam Kunk, Erik Shaw.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
Perseus Design. 2 Lockheed Martin and Government Use Only Architecture Behavioral “signatures” are extracted from a baseline execution Prototype will.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
Full and Para Virtualization
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
QEMU, a Fast and Portable Dynamic Translator Fabrice Bellard (affiliation?) CMSC 691 talk by Charles Nicholas.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Scaling up R computation with high performance computing resources.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Virtualization Neependra Khare
Virtualization for Cloud Computing
Introduction to Virtualization
Virtualization.
Towards Scalable Parallelization of
OPERATING SYSTEMS CS 3502 Fall 2017
Seth Pugsley, Jeffrey Jestes,
Improving java performance using Dynamic Method Migration on FPGAs
Group 8 Virtualization of the Cloud
Section 1: Introduction to Simics
Department of Computer Science University of California, Santa Barbara
Virtualization Techniques
Hardware Counter Driven On-the-Fly Request Signatures
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Virtualization Dr. S. R. Ahmed.
Department of Computer Science University of California, Santa Barbara
Hypervisor A hypervisor or virtual machine monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines. A computer.
Presentation transcript:

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER SCIENCE SimuBoost: Scalable Parallelization of Functional System Simulation GI Fachgruppentreffen Betriebssysteme (BS) 2013 Marc Rittinghaus, Konrad Miller, Marius Hillenbrand, Frank Bellosa

System Architecture Group Department of Computer Science 2 Motivation Operating system performance analysis: Application and kernel interaction Memory access patterns Cache efficiency Approach: Functional System Simulation/Emulation Simulate physical machine at functional-level (instructions) Monitor states/operations non-intrusively Marc Rittinghaus – SimuBoost Measurements distort results

System Architecture Group Department of Computer Science 3 Functional Simulation is Slow Average slowdowns for: Kernel build, SPECint_base2006, LAMMPS Example: Analyze memory duplication in kernel build Memory access patterns on shareable pages Operations that lead to breaking merged pages Our experience with Simics: 30 min -> 10 months Problem: Not practical for long-running workloads Marc Rittinghaus - SimuBoost VirtualizationSimulation KVMQEMUSimics ~ 1x~ 100x~ 1000x

System Architecture Group Department of Computer Science 4 Accelerating Simulation: Sampling Simulate representative samples and extrapolate (SimPoints[2]) There may be no representative intervals Not even all applications show phase behavior (gcc [3]) Even less probable for whole system (i.e., mix of multiple applications) Chicken-and-Egg Problem How do you find representative intervals without analyzing first? Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 5 Accelerating Simulation: Parallel vCPUs Simulate vCPUs in parallel (e.g., PQEMU[1]) Scales in number of vCPUs (e.g., 4x → still 2.5 months) Does not accelerate single-CPU simulation Goal: Scale-out single-core simulation Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 6 Basic Approach (1) Split simulation into time intervals (2) Simulate intervals simultaneously Scales with run-time of workload Applicable to single-CPU simulations Problem: How do we bootstrap the simulation of i[2..n]? Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 7 SimuBoost Leverage fast virtualization Create checkpoints at interval boundaries Checkpoints bootstrap simulations: Memory, device states, etc. Run simulations in parallel Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 8 State Deviation Marc Rittinghaus - SimuBoost Devices work asynchronous to CPU Different I/O data and completion timing Virtualization and simulation drift apart Problem: Machine states differ at interval boundaries

System Architecture Group Department of Computer Science 9 Coping with State Deviation Marc Rittinghaus - SimuBoost (1) Trap and log non-deterministic events in the hypervisor (2) Precisely replay events in the simulation Non-deterministic events (e.g., interrupts, timing instructions) …appear at equal points in the instruction stream …produce same data output Virtualization and simulation stay sychronized

System Architecture Group Department of Computer Science 10 Implementation Implementation is work in progress − components available: Fast virtualization (e.g., KVM [4]) Fast logging of non-deterministic events (e.g., ReVirt [5], Retrace [6]) Lightweight checkpointing (e.g., Remus [7]) Functional simulation (e.g., QEMU [8], Simics [9]) Marc Rittinghaus - SimuBoost CheckpointingVirtualizationLoggingSimulation 100 ms [7]≈ 1x0-8% [5], 5% [6]100x slowdown

System Architecture Group Department of Computer Science 11 Speedup and Scalability Right interval length is crucial Too short (a): Checkpoint time dominates Too long (c): Little parallelization Long simulation of final interval Example scenario: Basis: Performance of available components Optimal interval length: 2s Best possible speedup for 1h workload: 90 nodes (94% parallel efficiency) Near linear speedup possible Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 12 Open Questions Can we reach theoretical speedups in real simulations? Can multi-core/multi-socket simulations benefit from SimuBoost? Capturing non-deterministic events is challenging (shared memory) Can we simulate a machine with different hardware characteristics than the host? SimuBoost replicates behavior of virtual machine Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 13 Conclusion Slowdown of Functional Full System Simulation: >100x SimuBoost: Accelerate simulation Run workload with fast virtualization Take checkpoints in regular intervals Start parallel simulations on checkpoints Logging and replay of non-deterministic events Advantages: Scales with workload run-time (also for single-core simulations) High scalability and parallel efficiency possible 90 nodes, 94%) Marc Rittinghaus - SimuBoost SimuBoost: Detailed Full System Simulation made practical

System Architecture Group Department of Computer Science 14Marc Rittinghaus - SimuBoost

System Architecture Group Department of Computer Science 15 Functional Simulation Slowdown Functional System Simulation is slow Time to Completion [h] (Slowdown): Marc Rittinghaus - SimuBoost NativeHw.-Virt. KVM Simulation QEMU 1 QEMU 2 Simics 1 Linux Kernel Build (1.08x)47 (33x)238 (165x)1080 (771x) SPECint_base (1.05x)133 (22x)1243 (207x)6216 (1036x) LAMMPS Lennard Jones (0.91x)69 (38x)204 (113x)1123 (624x) Ø 1xØ 31xØ 162xØ 810x 1 Empty memory hooks 2 Counting unique accessed physical pages per second

System Architecture Group Department of Computer Science 16 SimuBoost: Job Distribution Marc Rittinghaus - SimuBoost Intervals are independent jobs Distribute jobs across nodes One virtualization node Many simulation nodes (One controller node) Can we simulate a single core on a multi-core node in its native execution time?

System Architecture Group Department of Computer Science 17 References [1] Ding et al. ’11 PQEMU: A parallel system emulator based on QEMU [2] T. Sherwood et al. Automatically characterizing large scale program behavior. volume 30. ACM, [3] V. Weaver et al. Using dynamic binary instrumentation to generate multi-platform simpoints: Methodology and accuracy. HiPEAC, [4] A. Kivity et al. kvm: the linux virtual machine monitor. volume 1. Linux Symposium, 2007 [5] G. Dunlap et al. Revirt: Enabling intrusion analysis through virtual- machine logging and replay [6] M. Sheldon et al. Retrace: Collecting execution trace with virtual machine deterministic replay [7] M. Sun et al. Fast, lightweight virtual machine checkpointing [8] F. Bellard. Qemu: A fast and portable dynamic translator. USENIX, 2005 [9] P. Magnusson et al. Simics: A full system simulation platform. Computer, 35(2), 2002 Marc Rittinghaus - SimuBoost