Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Chapter 13 Embedded Systems
OS Spring’03 Introduction Operating Systems Spring 2003.
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris
Experience with K42, an open- source, Linux-compatible, scalable operation-system kernel IBM SYSTEM JOURNAL, VOL 44 NO 2, 2005 J. Appovoo 、 M. Auslander.
1 Chapter 13 Embedded Systems Embedded Systems Characteristics of Embedded Operating Systems.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
TAU: Recent Advances KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views TAUg: Runtime Global Performance Data Access Using MPI Aroon.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, Department of Computer and Information.
Cluster Reliability Project ISIS Vanderbilt University.
Windows NT and Real-Time? Reading: “Inside Microsoft Windows 2000”, (Solomon, Russinovich, Microsoft Programming Series) “Real-Time Systems and Microsoft.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
2.Sampling over the Ranks in each time Step. Sampling also reduces Amt of data (but over Diff. dimension). 9 Scalable Online Parallel Performance Measurement.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.
Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Operating Systems: Internals and Design Principles
Full and Para Virtualization
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Performance Technology for Scalable Parallel Systems
OPERATING SYSTEMS CS3502 Fall 2017
Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor
DADA – Dynamic Allocation of Disk Area
Allen D. Malony, Sameer Shende
Adaptive Code Unloading for Resource-Constrained JVMs
Architectural Support for OS
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
Department of Computer Science, University of Tennessee, Knoxville
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Architectural Support for OS
Presentation transcript:

Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Early Experiences with KTAU on the IBM BG/L1EuroPar 2006 Outline Motivations Objectives ZeptoOS project KTAU Architecture KTAU on Blue Gene / L Experiments and experience KTAU improvements Future work Acknowledgements

Early Experiences with KTAU on the IBM BG/L2EuroPar 2006 Motivation Application performance is a consequence of  User-level execution  OS-level operation Good tools exist for observing user-level performance  User-level events  Communication events  Execution time measures  Hardware performance Fewer tools exist to observe OS-level aspects Ideally would like to do both simultaneously  OS-level influences on application performance

Early Experiences with KTAU on the IBM BG/L3EuroPar 2006 Scale and Performance Sensitivity HPC systems continue to scale to larger processor counts  Application performance more performance sensitive  OS factors can lead to performance bottlenecks [Petrini’03, Jones’03, …]  System/application performance effects are complex  Isolating system-level factors is non-trivial Require comprehensive performance understanding  Observation of all performance factors  Relative contributions and interrelationship  Can we correlate OS and application performance?

Early Experiences with KTAU on the IBM BG/L4EuroPar 2006 Phase Performance Effects Waiting time due to OS Overhead accumulates

Early Experiences with KTAU on the IBM BG/L5EuroPar 2006 Program - OS Interactions Program OS Interactions  Direct applications invoke the OS for certain services syscalls and internal OS routines called from syscalls  Indirect OS operations without explicit invocation by application preemptive scheduling (other processes) (HW) interrupt handling OS-background activity  keeping track of time and timers, bottom-half handling, … can occur at any OS entry

Early Experiences with KTAU on the IBM BG/L6EuroPar 2006 Program - OS Interactions (continued) Direct interactions easier to handle  Synchronous with user-code  In process-context Indirect interactions more difficult  Usually asynchronous  Usually in interrupt-context  Harder to measure where are the boundaries?  Harder to correlate and integrate with application measurements

Early Experiences with KTAU on the IBM BG/L7EuroPar 2006 Performance Perspectives Kernel-wide  Aggregate kernel activity of all active processes  Understand overall OS behavior  Identify and remove kernel hot spots  Cannot show application-specific OS actions Process-centric  OS performance in specific application context  Virtualization and mapping performance to process  Programs, daemons, and system services interactions  Expose sources of performance problems  Tune OS for specific workload and application for OS

Early Experiences with KTAU on the IBM BG/L8EuroPar 2006 Existing Approaches User-space only measurement tools  Many tools only work at user-level  Cannot observe system-level performance influences Kernel-level only measurement tools  Most only provide the kernel-wide perspective lack proper mapping/virtualization  Some provide process-centric views cannot integrate OS and user-level measurements

Early Experiences with KTAU on the IBM BG/L9EuroPar 2006 Existing Approaches (continued) Combined or integrated user/kernel measurement tools  A few tools allow fine-grained measurement  Can correlate kernel and user-level performance  Typically focus only on direct OS interactions  Indirect interactions not normally merged  Do not explicitly recognize parallel workloads MPI ranks, OpenMP threads, … Need an integrated approach to parallel performance observation and analyses that support both perspectives

Early Experiences with KTAU on the IBM BG/L10EuroPar 2006 High-Level Objectives Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Merge user-level and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data

Early Experiences with KTAU on the IBM BG/L11EuroPar 2006 ZeptoOS DOE OS/RTS for Extreme Scale Scientific Computation  Effective OS/Runtime for petascale systems  Funded ZeptoOS project Argonne National Lab and University of Oregon What are the fundamental limits and advanced designs required for petascale Operating System Suites?  Behaviour at large scales  Management and optimization of OS suites  Collective operations  Fault tolerance  OS performance analysis

Early Experiences with KTAU on the IBM BG/L12EuroPar 2006 ZeptoOS and TAU/KTAU Lots of fine-grained OS measurement is required for each component of the ZeptoOS work How and why do the various OS source and configuration changes affect parallel applications? How do we correlate performance data between  OS components  Parallel application and OS Solution: TAU/KTAU  An integrated methodology and framework to measure performance of applications and OS kernel

Early Experiences with KTAU on the IBM BG/L13EuroPar 2006 ZeptoOS Strategy “Small Linux on big computers”  IBM BG/L and other systems (e.g., Cray XT3) Argonne  Modified Linux on BG/L I/O nodes (ION)  Modified Linux for BG/L compute nodes (TBD)  Specialized I/O daemon on I/O node (ZOID) (TBD) Oregon  KTAU integration of TAU infrastructure in Linux Kernel integration with ZeptoOS and installation on BG/L ION port to other 32-bit and 64-bit Linux platforms

Early Experiences with KTAU on the IBM BG/L14EuroPar 2006 KTAU Architecture

Early Experiences with KTAU on the IBM BG/L15EuroPar 2006 KTAU On BG/L’s ZeptoOS I/O Node  Open source modified Linux Kernel (2.4, 2.6)  Control I/O Daemon (CIOD) handles I/O syscalls from compute nodes in process set Compute Node  IBM proprietary (closed-source) light-weight kernel  No scheduling or virtual memory support  Forwards I/O syscalls to CIOD on I/O node KTAU on I/O Node  Integrated into ZeptoOS configuration and build system  Require KTAU-D (daemon) (CIOD is closed-source)  KTAU-D periodically monitors KTAU measurements system-wide or individual process

Early Experiences with KTAU on the IBM BG/L16EuroPar 2006 KTAU On BG/L (current version)

Early Experiences with KTAU on the IBM BG/L17EuroPar 2006 Early Experiences on BG/L Validate and verify KTAU system  Show kernel-wide and process-specific perspectives  Run benchmark experiments Argonne iotest benchmark  MPI-based benchmark (open/write/read/close)  aggregate bandwidth numbers  varying block-sizes, number of nodes, and iterations  observe functional and performance behavior Apply KTAU to ZeptoOS problems  Accurate identification of “noise” sources Argonne Selfish benchmark  identify “detours” (noise events) in user-space

Early Experiences with KTAU on the IBM BG/L18EuroPar 2006 Experiment Setup (Parameters) KTAU:  Enable all instrumentation points  Number of kernel trace entries per proces = 10K KTAU-D:  System-wide tracing  Accessing trace every 1 second and dump trace output to a file in user’s home directory through NFS IOTEST:  Running with default parameters (blocksize = 16MB)

Early Experiences with KTAU on the IBM BG/L19EuroPar 2006 CIOD Kernel Profile on I/O Nodes All instrumentation points enabled except schedule() Numbers shown are function call counts (profile data) Compute node running “hello world” sample job Visualize using TAU’s ParaProf

Early Experiences with KTAU on the IBM BG/L20EuroPar 2006 CIOD Kernel Trace (iotest) 8 compute nodes zoomed view

Early Experiences with KTAU on the IBM BG/L21EuroPar 2006 sys_read / sys_write KTAU Trace of CIOD running 2, 4, 8, 16, 32 nodes As the number of compute node increase, CIOD has to handle larger amount of sys_call being forwarded. 1,769 sys_write 3,142 sys_write 5,838 sys_write 10,980 sys_write 37,985 sys_write

Early Experiences with KTAU on the IBM BG/L22EuroPar 2006 Correlated CIOD Activity with RPCIOD Switching from CIOD to RPCIOD during a “sys_write” call RPCIOD performs “socket_send” for NFS read/write and IRQ RPCIOD CIOD

Early Experiences with KTAU on the IBM BG/L23EuroPar 2006 Recent Work on ZeptoOS Project Accurate Identification of “noise” sources  Modified Linux on BG/L should be efficient  Effect of OS “noise” on synchronization / collectives  What OS aspects induce what types of interference code paths configurations devices attached  Requires user-level and OS measurement If can identify noise sources, then can remove or alleviate interference

Early Experiences with KTAU on the IBM BG/L24EuroPar 2006 Approach ANL Selfish benchmark to identify “detours”  Noise events in user-space  Shows durations and frequencies of events  Does NOT show cause or source  Runs a tight loop with an expected (ideal) duration logs times and duration of detours Use KTAU OS-tracing to record OS activity  Correlate time of occurrence uses same time source as Selfish benchmark  Infer type of OS-activity (if any) causing the “detour”

Early Experiences with KTAU on the IBM BG/L25EuroPar 2006 OS/User Performance View of Scheduling preemptive scheduling

Early Experiences with KTAU on the IBM BG/L26EuroPar 2006 OS/User View of OS Background Activity

Early Experiences with KTAU on the IBM BG/L27EuroPar 2006 OS/User View of OS Background Activity

Early Experiences with KTAU on the IBM BG/L28EuroPar 2006 Replace with: ZOID + TAU Replace with: Linux + KTAU KTAU On BG/L (future version)

Early Experiences with KTAU on the IBM BG/L29EuroPar 2006 Future Work Dynamic measurement control Improve performance data sources Improve integration with TAU’s user-space capabilities  Better correlation of user and kernel performance  Full callpaths and phase-based profiling  Merged user/kernel traces (already available) Integration of TAU and KTAU with Supermon Porting efforts to IA-64, PPC-64, and AMD Opteron ZeptoOS characterization efforts  BGL I/O node  Dynamically adaptive kernels

Early Experiences with KTAU on the IBM BG/L30EuroPar 2006 Acknowledgements Department of Energy’s Office of Science National Science Foundation University of Oregon (UO) Core Team  Aroon Nataraj, PhD Student  Prof. Allen D Malony  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Suravee Suthikulpanit, MS Student (Graduated) Argonne National Lab (ANL) Contributors  Pete Beckman  Kamil Iskra  Kazutomo Yoshii