A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

Computer Organization CS224 Fall 2012 Lesson 12. Synchronization  Two processors or threads sharing an area of memory l P1 writes, then P2 reads l Data.

8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Multi-core Programming Thread Checker. 2 Topics What is Intel® Thread Checker? Detecting race conditions Thread Checker as threading assistant Some other.

University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.

University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

MicroC/OS-II Embedded Systems Design and Implementation.

Advanced Computing Technology Center © 2005 IBM Corporation The IBM High Performance Computing Toolkit Guojing Cong.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Parallel Programming in Java with Shared Memory Directives.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

Multiprocessors and Multi-computers Multi-computers –Distributed address space accessible by local processors –Requires message passing –Programming tends.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

University of Maryland The DPCL Hybrid Project James Waskiewicz.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Correcting Threading Errors with Intel® Parallel Inspector.

© 2001 Barton P. MillerParadyn/Condor Week (12 March 2001, Madison/WI) The Paradyn Port Report Barton P. Miller Computer Sciences Department.

Introduction to OpenMP

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Threaded Programming Lecture 2: Introduction to OpenMP.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IMB Contact – Karl Erickson.

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.

Parallel Computing Presented by Justin Reschke

CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Tuning Threaded Code with Intel® Parallel Amplifier.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.

Process Synchronization

Computer Engg, IIT(BHU)

CompSci 725 Presentation by Siu Cho Jun, William.

Programming Problem solving Debugging

Allen D. Malony Computer & Information Science Department

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

Presentation transcript:

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita Nagarajan

Introduction OpenMP – Standard for shared memory parallel programming – Set of directives and library routines for Fortran and C/C++ Performance Tools – Need: Analyse parallel behaviour. Determine causes for OpenMP application performance problems. – Properties: Minimize intrusion cost, maximize performance data captured

Introduction(Contd.)… Dynamic Instrumentation – Instrument application while it is executing, recompilation not required. Dynamic Probe Class Library(DPCL) – Developed at IBM, built on top of the Dyninst API. – Using DPCL, performance tool “attaches” to application, “inserts code patches” into the binary, “starts/continues” its execution – Program instrumentation can be done at “function entry points”, “exit points” and “call sites”.

DPCL DPCL consists of – Client library – Runtime library – Daemon – Super-daemon

OMPtrace Built on top of DPCL IBM compiler translates OpenMP directives into function calls.

Translation of OpenMP Directives

OMPtrace

OMPtrace(Contd.)…

Hardware Counters – OMPtrace can access hardware counters, and provide statistics of the hardware events. Eg.L1/L2 hits, L1/L2 misses, number of instructions Paraver – Computes “Derived Metrics” from hardware events. Eg. L1 misses per second

Case Study: Sweep3D Multidimensional wavefront algorithm for “discrete ordinates” deterministic particle transport simulation.

Sweep3D(Contd.)… diag - original version of Sweep3D mkj – “do idiag” and “do jkm” loops replaced by a triple nested loop (“do m”, “do k”, “do j”) ccrit - based on “mkj”, outer loop parallelized, synchronization implemented using the “CRITICAL” directive. cpipe – based on “mkj”, outer loop parallelized, synchronization implemented using shared arrays and busy waiting.

Results from Experiments version Ccrit Cpipe Diag Elapsed time in seconds for the different OpenMP versions

Analysis of Results using Paraver Ccrit – Not scalable Overhead of mutex lock and unlock, contention Red: Trying to obtain lock Blue: Using lock Green – Release lock Light Blue – Execution outside critical section

Cpipe – Better performance than ccrit. – Poor locality because the “m” loop has an iteration count of 6.

Diag – Limited scalability due to high number of L2 misses Blue: Large values Green: Low values

Optimization kjmi – Interchange loops – Good performance, better scalability kjmi

Conclusions OMPtrace and Paraver form a useful tool for performance analysis and optimization of OpenMP applications.