Performance Engineering and Debugging HPC Applications David Skinner

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

User-Mode Linux Ken C.K. Lee
Distributed Systems CS
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Profiling your application with Intel VTune at NERSC
Performance Debugging Techniques For HPC Applications David Skinner CS267 Feb
Introduction CS 524 – High-Performance Computing.
Computer Systems. Computer System Components Computer Networks.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 4: Operating Systems.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
Memory Management 2010.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Operating Systems Béat Hirsbrunner Main Reference: William Stallings, Operating Systems: Internals and Design Principles, 6 th Edition, Prentice Hall 2009.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
CSE 451: Operating Systems Winter 2012 Processes Mark Zbikowski Gary Kimura.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Tools for Performance Debugging HPC Applications David Skinner
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
E X C E E D I N G E X P E C T A T I O N S OP SYS Linux System Administration Dr. Hoganson Kennesaw State University Operating Systems Functions of an operating.
OS, , Part II Processes Department of Computer Engineering, PSUWannarat Suntiamorntut.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
7.1 Operating Systems. 7.2 A computer is a system composed of two major components: hardware and software. Computer hardware is the physical equipment.
Tuning Threaded Code with Intel® Parallel Amplifier.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Chapter 2 Memory and process management
Sujata Ray Dey Maheshtala College Computer Science Department
Architecture Background
Is System X for Me? Cal Ribbens Computer Science Department
Characterization of Parallel Scientific Simulations
IB Computer Science Topic 2.1.1
CSC Classes Required for TCC CS Degree
Sujata Ray Dey Maheshtala College Computer Science Department
Operating System Introduction.
Operating System Overview
Presentation transcript:

Performance Engineering and Debugging HPC Applications David Skinner

Today: Tools for Performance and Debugging Principles –Topics in performance scalability –Examples of areas where tools can help Practice –Where to find tools –Specifics to NERSC and Hopper 2

Big Picture of Scalability and Performance 3

To your goals –Time to solution, T queue +T run –Your research agenda –Efficient use of allocation To the –application code –input deck –machine type/state Performance is Relative Suggestion: Focus on specific use cases as opposed to making everything perform well. Bottlenecks can shift.

Registers Caches Local Memory Remote Memory Disk / Filesystem 5 Performance is Hierarchical instructions & operands lines pages messages blocks, files

Registers Caches Local Memory Remote Memory Disk / Filesystem 6 Tools are Hierarchical PAPI valgrind Craypat IPM Tau SAR PMPI

Tools can add overhead to code execution What level can you tolerate? Tools can add overhead to scientists What level can you tolerate? Scenarios: Debugging code that ~isn’t working Performance debugging Performance monitoring in production 7 Using the right tool

One tool example: IPM on XE 1) Do “module load ipm”, link with $IPM, then run normally 2) Upon completion you get Maybe that’s enough. If so you’re done. Have a nice day ##IPM2v0.xx################################################## ###### # # command :./fish -n # start : Tue Feb 08 11:05: host : nid06027 # stop : Tue Feb 08 11:08: wallclock : # mpi_tasks : 25 on 2 nodes %comm : 1.62 # mem [GB] : 0.24 gflop/sec : 5.06 …

HPC Tool Topics CPU and memory usage –FLOP rate –Memory high water mark OpenMP –OMP overhead –OMP scalability (finding right # threads) MPI –% wall time in communication –Detecting load imbalance –Analyzing message sizes 9

Examples of HPC tool usage 10

Scaling: definitions Scaling studies involve changing the degree of parallelism. Will we be change the problem also? Strong scaling –Fixed problem size Weak scaling – Problem size grows with additional resources Speed up = T s /T p (n) Efficiency = T s /(n*T p (n)) Be aware there are multiple definitions for these terms

Conducting a scaling study With a particular goal in mind, we systematically vary concurrency and/or problem size 12 Example: How large a 3D (n^3) FFT can I efficiently run on 1024 cpus? Looks good?

Let’s look a little deeper….

The scalability landscape –Algorithm complexity or switching –Communication protocol switching –Inter-job contention –~bugs in vendor software  Whoa! Why so bumpy?

15 Not always so tricky Main loop in jacobi_omp.f90; ngrid=6144 and maxiter=20

Load (Im)balance MPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230s

Load Balance : cartoon Universal App Unbalanced: Balanced: Time saved by load balance

18 Too much communication

Simple Stuff: What’s wrong here?

Not so simple: Comm. topology MILC PARATEC IMPACT-T CAM MAESTROGTC 20

The transition to many-core has brought complexity to the once orderly space of hardware performance counters. NERSC, UCB, and UTK are all working on improving things IPM on XE, currently just the banner is in place. We think PAPI is working (recently worked with Cray on bug fixes) The state of HW counters

Next up…Richard. 22