MPI Program Performance Self Test with solution. Matching 1.Amdahl's Law 2.Profiles 3.Relative efficiency 4.Load imbalances 5.Timers 6.Asymptotic analysis.

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

1 Various Methods of Populating Arrays Randomly generated integers.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Parallel Programming in C with MPI and OpenMP
Virtual Topologies Self Test with solution. Self Test 1.When using MPI_Cart_create, if the cartesian grid size is smaller than processes available in.
Reference: / MPI Program Structure.
MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Getting Started with MPI Self Test with solution.
Reference: Message Passing Fundamentals.
Introduction to Analysis of Algorithms
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
1 Lecture 6 Performance Measurement and Improvement.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Point-to-Point Communication Self Test with solution.
Collective Communications Self Test with solution.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Communicators Self Test with solution. Self Test 1.MPI_Comm_group may be used to: a)create a new group. b)determine group handle of a communicator. c)create.
Derived Datatypes and Related Features Self Test with solution.
Collective Communications Solution. #include #define N 300 int main(int argc, char **argv) { int i, target;/*local variables*/ int b[N], a[N/4];/*a is.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture No.01 Data Structures Dr. Sohail Aslam
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Hybrid MPI and OpenMP Parallel Programming
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Pointers OVERVIEW.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Chapter 1 Object Oriented Programming. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.
CSC 211 Data Structures Lecture 13
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Operating System Principles And Multitasking
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Project18 Communication Design + Parallelization Camilo A Silva BIOinformatics Summer 2008.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
Concurrency and Performance Based on slides by Henri Casanova.
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
PVM and MPI.
OPERATING SYSTEMS CS 3502 Fall 2017
Operating Systems (CS 340 D)
Computer Engg, IIT(BHU)
MPI Message Passing Interface
Operating Systems (CS 340 D)
EE 193: Parallel Computing
Parallel Processing - MPI
Objective of This Course
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Presentation transcript:

MPI Program Performance Self Test with solution

Matching 1.Amdahl's Law 2.Profiles 3.Relative efficiency 4.Load imbalances 5.Timers 6.Asymptotic analysis 7.Execution time 8.Cache effects 9.Event traces 10.Absolute speedup a)The time elapsed from when the first processor starts executing a problem to when the last processor completes execution. b)T 1 /(P*T p ), where T 1 is the execution time on one processor and T p is the execution time on P processors. c)The execution time on one processor of the fastest sequential program divided by the execution time on P processors. d)When the sequential component of an algorithm accounts for 1/s of the program's execution time, then the maximum possible speedup that can be achieved on a parallel computer is s. e)Characterizing performance in a large limit. f)When an algorithm suffers from computation or communication imbalances among processors. g)When the fast memory on a processor gets used more often in a parallel implementation, causing an unexpected decrease in the computation time. h)A performance tool that shows the amount of time a program spends on different program components. i)A performance tool that determines the length of time spent executing particular piece of code. j)The most detailed performance tool that generates a file which records the significant events in the running of a program.

Answer 1.D 2.H 3.B 4.F 5.I 6.E 7.A 8.G 9.J 10.C

Self Test 2.The following is not a performance metric: a)speedup b)efficiency c)problem size

Answer a)Incorrect No, speedup is a performance metric. Relative speedup is defined as T 1 /T p where T 1 is the execution time on one processor and T p is the execution time on P processors. Absolute speedup is obtained by replacing the execution time on one processor with the execution time of the fastest sequential algorithm. b)Incorrect No, efficiency is a performance metric. Relative efficiency is defined as T 1 /(P* T p ), where T 1 is the execution time on one processor and T p is the execution time on P processors. Absolute efficiency is obtained by replacing the execution time on one processor with the execution time of the fastest sequential algorithm. c)Correct That's correct! Problem size is a factor that affects a program's execution time but it is not a metric for analyzing performance.

Self Test 3.A good question to ask in scalability analysis is: a)How can one overlap computation and communications tasks in an efficient manner? b)How can a single performance measure give an accurate picture of an algorithm's overall performance? c)How does efficiency vary with increasing problem size? d)In what parameter regime can I apply Amdahl's law?

Answer a)Incorrect Sorry, that's not correct. b)Incorrect Sorry, that's not correct. c)Correct That's correct! d)Incorrect Sorry, that's not correct.

Self Test 4.If an implementation has unaccounted-for overhead, a possible reason is: a)an algorithm may suffer from computation or communication imbalances among processors. b)the cache, or fast memory, on a processor may get used more often in a parallel implementation causing an unexpected decrease in the computation time. c)you failed to employ a domain decomposition. d)there is not enough communication between processors.

Answer a)Correct That's correct! b)Incorrect Sorry, that's not correct. c)Incorrect Sorry, that's not correct. d)Incorrect Sorry, that's not correct.

Self Test 5.Which one of the following is not a data collection technique used to gather performance data: a)counters b)profiles c)abstraction d)event traces

Answer a)Incorrect No, counters are data collection subroutines which increment whenever a specified event occurs. b)Incorrect No, profiles show the amount of time a program spends on different program components. c)Correct That's correct. Abstraction is not a data collection technique. A good performance tool allows data to be examined at a level of abstraction appropriate for the programming model of the parallel program. d)Incorrect No, event traces contain the most detailed program performance information. A trace based system generates a file that records the significant events in the running of a program.

Course Problem

In this chapter, the broad subject of parallel code performance is discussed both in terms of theoretical concepts and some specific tools for measuring performance metrics that work on certain parallel machines. Put in its simplest terms, improving code performance boils down to speeding up your parallel code and/or improving how your code uses memory.

Course Problem As you have learned new features of MPI in this course, you have also improved the performance of the code. Here is a list of performance improvements so far: –Using Derived Datatypes instead of sending and receiving the separate pieces of data –Using Collective Communication routines instead of repeating/looping individual sends and receives –Using a Virtual Topology and its utility routines to avoid extraneous calculations –Changing the original master-slave algorithm so that the master also searches part of the global array (The slave rebellion: Spartacus!) –Using "true" parallel I/O so that all processors write to the output file simultaneously instead of just one (the master)

Course Problem But more remains to be done - especially in terms of how the program affects memory. And that is the last exercise for this course. The problem description is the same as the one given in Chapter 9 but you will modify the code you wrote using what you learned in this chapter.

Course Problem Description –The initial problem implements a parallel search of an extremely large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file. Exercise –Modify your code from Chapter 9 so that it uses dynamic memory allocation to use only the amount of memory it needs and only for as long as it needs it. Make both the arrays a and b ALLOCATED DYNAMICALLY and connect them to memory properly. You may also assume that the input data file "b.data" now has on its first line the number of elements in the global array b. The second line now has the target value. The remaining lines are the contents of the global array b.

Solution Note: The sections of code shown in red are new code in which the arrays a and b are declared dynamically, actually allocated, and deallocated when they are no longer needed. Further, note that only processor 0 is concerned with allocating/deallocating the global array b while all processors dynamically create and destroy their individual subarrays a.

Solution #include int main(int argc, char **argv) { int i, target;/*local variables*/ /* Arrays a and b have no memory assigned to them when they are declared */ int *b, *a;/*a is name of the array each slave searches*/ int length=0; int rank, size, err; MPI_Status status; int end_cnt, gi; FILE *sourceFile; /* Variables needed to prepare the file */ int amode, etype, filetype, intsize; MPI_Info info; MPI_File fh; MPI_Offset disp;

Solution err = MPI_Init(&argc, &argv); err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); err = MPI_Comm_size(MPI_COMM_WORLD, &size); if(size != 4) { printf("Error: You must use 4 processes to run this program.\n"); return 1; } /* Create the file and make it write only */ amode=(MPI_MODE_CREATE|MPI_MODE_WRONLY); info=0; /* Name the file and open it to all processors */ err = MPI_File_open(MPI_COMM_WORLD,"found.dat",amode,info,&fh); intsize = sizeof(MPI_INT); disp=rank*intsize; etype=MPI_INT; filetype=MPI_INT; err = MPI_File_set_view(fh,disp,etype,filetype,"native",info); /*This and the preceeding four ! lines prepare the "view" each processor has of the output file. This view tells where in ! the file each processor should put the target locations its finds. In our case, P0 will ! start putting data at the beginning of the file, P1 will start putting data one integer's ! length from the beginning of the file, and so on.*/

Solution if (rank == 0) { /* File b1.data has the length value on the first line, then target */ /* The remaining 300 lines of b.data have the values for the b array */ sourceFile = fopen("b1.data", "r"); if(sourceFile==NULL) { printf("Error: can't access file.c.\n"); return 1; } else { /* Read in the target */ fscanf(sourceFile, "%d", &length); fscanf(sourceFile, "%d", &target); } /*Notice the broadcast is outside of the if, all processors must call it*/ err = MPI_Bcast(&target, 1, MPI_INT, 0, MPI_COMM_WORLD); err = MPI_Bcast(&length, 1, MPI_INT, 0, MPI_COMM_WORLD);

Solution if (rank == 0) { /* Only at this point is b connected to exactly the correct amount of memory */ b = (int *)malloc(length*sizeof(int)); /* Read in b array */ for (i=0; i<length; i++) { fscanf(sourceFile,"%d", &b[i]); } fclose(sourceFile); } /* Only at this point is each processor's array a connected to a smaller amount of memory */ a = (int *)malloc(length/size*sizeof(int)); /* Again, the scatter is after the if, all processors must call it */ err = MPI_Scatter(b, (length/size), MPI_INT, a, (length/size), MPI_INT, 0, MPI_COMM_WORLD); /* Processor 0 no longer needs b */ if (rank == 0) free(b);

Solution for (i=1; i<length/size; i++) { if (a[i] == target) { gi=(rank)*75+i+1; /* Each processor writes to the file */ err = MPI_File_write(fh,&gi,1,MPI_INT,&status); } free(a);/* All the processors are through with a */ err = MPI_File_close(&fh); err = MPI_Finalize(); return 0; }

Solution The results obtained from running this code are in the file "afound.dat". As before, it must be viewed with the special octal dump (od) command. If you type od -d found.dat you get which as you can see is exactly what you obtained in Chapter 9 but with statically allocated arrays a and b.

Solution Many improvements in the performance of any code are possible. This is left as your final exercise: come up with some other ideas and try them! Good luck and happy MPI programming!