1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.

Slides:



Advertisements
Similar presentations
WHAT IS DRS? Thermoflow, Inc.
Advertisements

DETAILED DESIGN, IMPLEMENTATIONA AND TESTING Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Chapter 2 Operating System Overview Operating Systems: Internals and Design Principles, 6/E William Stallings.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
CS 345 Computer System Overview
Operating System (O.S.) Objectives & Functions
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Figure 1.1 Interaction between applications and the operating system.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
By Mr. Abdalla A. Shaame.  An operating system is a software component that acts as the core of a computer system.  It performs various functions and.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen und Systeme mbH Pallas GmbH Hermülheimer Straße 10 D
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)
Parameter Range Study of Numerically-Simulated Isolated Multicellular Convection Z. DuFran, B. Baranowski, C. Doswell III, and D. Weber This work is supported.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Outline Introduction Research Project Findings / Results
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Introduction to Operating Systems Prepared by: Dhason Operating Systems.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Tuning Threaded Code with Intel® Parallel Amplifier.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Roy Taragan Shaham Kenat
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
For Massively Parallel Computation The Chaotic State of the Art
Parallel Programming By J. H. Wang May 2, 2017.
Characterization of Parallel Scientific Simulations
Performance Analysis, Tools and Optimization
Parallel Programming in C with MPI and OpenMP
Allen D. Malony Computer & Information Science Department
Introduction to Operating Systems
Outline Chapter 2 (cont) OS Design OS structure
System calls….. C-program->POSIX call
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for Computational Science and Engineering 2 Department of Mechanical and Environmental Engineering and Department of Mathematics University of California, Santa Barbara

2 Acknowledgements This material is based upon work supported by the National Science Foundation under Grant # This research was conducted using the resources of the San Diego Supercomputer Center.

3 Outline of Presentation 1.Introduction (Physical problem) 2.Problem formulation 3.Fixed domain formulation 4.Numerical algorithm 5.Test case 6.Performance tools and considerations a. VAMPIR b. PARAVER 7.Diagnostic example 8.Conclusions

4 Physical problem A gap of width “2a” and length “L” is to be etched in a flat plate. The remainder of the plate is covered with a protective (photoresist) layer. Since it is assumed that L >>2a, the problem can be considered as two-dimensional. Figure 1. Physical problem

5 Simplifying assumptions i. There is no convection in the etching medium ii. The etching process is isotropic iii. The thickness of the photoresist layer is infinitely small iv. Only one component of the etching liquid determines the process

6 Problem formulation Mathematical model: The etching fluid  (t) is bounded by the outer boundary  1 ; the photoresist layer   (t); and the moving boundary S(t). D\  (t) denotes part of the solid. Figure 2. Side view of physical problem showing mathematical problem setup.

7 Fixed domain formulation Figure 3. Fixed domain mathematical formulation.

8 Numerical Algorithm The basic numerical algorithm is: with inand

9 Numerical algorithm (cont.) with in (the rectangular region of the plate’s cross section)

10 Test case Maxrow1 (# of rows in the top region) = 280 Maxcol1 (# of columns in the top region) = 321 Maxrow2 (# of rows in the bottom region) = 80 Maxcol2 (# of columns in the bottom region) = 161 Maxtime (# of time steps) = 5  t (size of time steps) = 1  (successive over-relaxation factor) = B (non-dimensional number) = 10.0

11 Domain decomposition Figure 5. Domain decomposition of mathematical problem into sixteen subregions showing the flow of computations.

12 Load balancing information for the test case Processors Bottom Processors Bottom Points Top Processors Top Points Diff Points

13 Figure 4. Ideal versus obtained speedup

14 Figure 6. Moving boundaries at various times.

15 Performance tools and considerations The parallel program is monitored while it is executed. Monitoring produces performance data that is interpreted in order to reveal areas of poor performance. The program is then altered and the process is repeated until an acceptable level of performance is reached.

16 VAMPIR (Visualization and Analysis of MPI Resources – 2.0) VAMPIR 2.0 is a post-mortem trace visualization tool from Pallas GmbH It uses the profile extensions to MPI and permits analysis of the message events where data is transmitted between processors during execution of a parallel program. It has a convenient user-interface and an excellent zooming and filtering. Global displays show all selected processes.

17 Global Timeline: detailed application execution over time axis Activity Chart: presents per-process profiling information Summaric Chart: aggregated profiling information Communication Statistics: message statistics for each process pair Global Communication Statistics: collective operations statistics I/O Statistics: MPI I/O operation statistics Calling Tree: global dynamic calling tree

18

19

20

21

22

23

24

25

26

27 PARAVER (Parallel Program Visualization and Analysis Tool) PARAVER is a flexible parallel program visualization and analysis tool based on an easy-to-use Motif GUI (graphical user interface) PARAVER was developed to respond to the basic need to have a qualitative perception of the application behavior by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems.

28 Paraver (Parallel Program Visualization and Analysis Tool) Powerful flexible parallel program visualization tool based on an easy-to-use Motif GUI (graphical user interface) Developed by : European Center for Parallelism of Barcelona (CEPBA) Universitat Politecnica de Catalunya

29 Paraver is designed to visualize and analyze - Communication and load balance - Combining OpenMP and MPI - Hardware performance and counters Usage - Compile programs with special libraries - Run programs to produce trace files - View and analyze traces - Designed to help in program understanding and optimization

30

31

32

33

34

35

36

37

38 Inefficient programming example Load imbalance (inefficient memory use) Cache misses and page faults Stride minimization (efficient memory use)

39 Load imbalance - VAMPIR Figure 7. Load imbalance in the plate region.

40 Load imbalance - PARAVER Figure 8. Load imbalance in the plate region.

41 The memory hierarchy

42 Array Allocation

43 Example of coding Figure 9. A piece of the etching code (non-optimized on the left and optimized on the right). c- c- calculate error in the top region and update u1old do 370 i = iam*numrows1 + 1,lastrow do 380 j = 2,maxcol1 if (abs(u1new(i,j) - u1old(i,j)).gt.err) then err = abs(u1new(i,j) - u1old(i,j)) endif u1old(i,j) = u1new(i,j) 380 continue 370 continue c- do 380 j = 2,maxcol1 do 370 i = iam*numrows1 + 1,lastrow if ( abs(u1new(i,j) - u1old(i,j)).gt. err) then err = abs(u1new(i,j) - u1old(i,j)) endif u1old(i,j) = u1new(i,j) 370 continue 380 continue c-

44 Load balance - VAMPIR Figure 10. Approximate load balance.

45 Load balance - PARAVER Figure 11. Approximate load balance.

46 Conclusions A significant factor that affects the performance of a parallel application is the balance between communication and workload. The challenge of the message passing model is in reducing message traffic over the interconnection network. To fully understand the performance behavior of such applications, analysis and visualization tools are needed. Two such tools, VAMPIR and PARAVER, were used to analyze the performance of the etching application. It was seen that optimization of the parallel code can be carried out in an iterative process involving these tools to investigate performance issues.

47 Web Sites Project site San Diego Supercomputer Center html VAMPIR PARAVER