An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.

Slides:



Advertisements
Similar presentations
MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
Advertisements

CS 140: Models of parallel programming: Distributed memory and MPI.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Reference: / MPI Program Structure.
Tutorial on MPI Experimental Environment for ECE5610/CSC
High Performance Computing
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
1 Friday, October 06, 2006 Measure twice, cut once. -Carpenter’s Motto.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
12b.1 Introduction to Message-passing with MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
Parallel Processing LAB NO 1.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Director of Contra Costa College High Performance Computing Center
2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 17, 2012.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 14, 2013.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
MPI and High Performance Computing: Systems and Programming Barry Britt, Systems Administrator Department of Computer Science Iowa State University.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
PP Lab MPI programming II. Program#1 Write a program that prints hello from every created process. Like: Hello World from process 0 of 5 Hello World from.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
CS4230 CS4230 Parallel Programming Lecture 13: Introduction to Message Passing Mary Hall October 23, /23/2012.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Running on GCB part1 By: Camilo Silva. Simple steps to run MPI 1.Use putty or the terminal 2.SSH to gcb.fiu.edu 3.Loggin by providing your username and.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
Introduction to MPI Nischint Rajmohan 5 November 2007.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
1 Running MPI on “Gridfarm” Bryan Carpenter February, 2005.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Message Passing Interface Using resources from
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Parallel Programming C. Ferner & B. Wilkinson, 2014 Introduction to Message Passing Interface (MPI) Introduction 9/4/
PVM and MPI.
MPI Message Passing Interface
CS 584.
Introduction to Message Passing Interface (MPI)
Message Passing Models
Lecture 14: Inter-process Communication
Introduction to parallelism and the Message Passing Interface
MPI MPI = Message Passing Interface
Introduction to Parallel Computing with MPI
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Parallel Processing - MPI
MPI Message Passing Interface
Some codes for analysis and preparation for programming
CS 584 Lecture 8 Assignment?.
Presentation transcript:

An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos

What is Serial Computing? Traditionally, software has been written for serial computation: – To be run on a single computer having a single Central Processing Unit (CPU); – A problem is broken into a discrete series of instructions. – Instructions are executed one after another. – Only one instruction may execute at any moment in time.

Serial Computing:

What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:  To be run using multiple CPUs  A problem is broken into discrete parts that can be solved concurrently  Each part is further broken down to a series of instructions  Instructions from each part execute simultaneously on different CPUs

Parallel Computing

Computer Architecture(von Neumann) Comprised of four main components: – Memory – Control Unit – Arithmetic Logic Unit – Input/Output Read/write, random access memory is used to store both program instructions and data – Program instructions are coded data which tell the computer to do something – Data is simply information to be used by the program Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. Aritmetic Unit performs basic arithmetic operations Input/Output is the interface to the human operator

UMA, or Uniform Memory Access In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:

UMA, or Uniform Memory Access UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.

NUMA(Non-Uniform Memory Access) In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly and with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:

NUMA(Non-Uniform Memory Access) What gives NUMA its name is that memory access time varies with the location of the data to be accessed. If data resides in local memory, access is fast. If data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.

Modern multiprocessor systems In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node". Processors within a node share access to memory modules as per the UMA shared memory architecture. At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.

Distributed computing A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.

Parallel algorithm for Distributed Memory Computing We assumed that we have these numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] and we want to add them with a parallel algorithm for 4 CPUs Solution: CPU0 CPU1 CPU2 CPU3 10, 11, 12 4, 5, 6 7,8,9 1, 2, = 78 CPU0

What’s the benefits from a parallel program We assume that Tn is the time to pass a Message through the Network and To it’s the time of an operation to execute. In our example we would need: Tn + 3To + Tn + 4To = 2Tn + 4To For a serial program: 12To We assume that To=1 and Tn=10 Parallel = 2x10 + 4x1 = 24 Serial = 12x1 = 12 Conclusion Serial is faster than parallel

What’s the benefits from a parallel program We assume that we have 12,000 numbers to add Parallel: Tn + 3,000To + Tn + 4To = ,000x x1 = 3,024 Serial: 12,000To = 12,000x1 = 12,000 Conclusion parallel it’s about 4 times faster than serial Parallel computing is begum beneficial for large scale computational problems

MPICH MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. Message Passing Interface (MPI) is a specification for an API that allows many computers to communicate with one another. MPICH is a library for C/C++ or Fortran

Installation MPICH for Linux Web page: Download for Linux: ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz Untar: tar xvfz mpich.tar.gz Configure: as root:./configure - -prefix=/usr/local –rsh=ssh as user:./configure - -prefix=/home/username –rsh=ssh Compile: make Install: make install

Testing MPICH $ which mpicc It should give a path of mpicc where we install it like: ~/bin/mpicc and the same for mpirun To run a test: from mpich installation dir $cd examples/basic $make $mpirun –np 2 cpi result: Process 0 of 2 on localhost.localdomain pi is approximately , Error is wall clock time = Process 1 of 2 on localhost.localdomain

Possible Errors Not find the path of mpicc $cd ~ $gedit.bashrc add the following line at the bottom export PATH=$PATH:/path_of_mpich/bin save and relogin When we run: mpirun -np 2 cpi p0_29223: p4_error: Could not gethostbyname for host buster.localdomain; may be invalid name that means it cannot resolve buster.localdomain that is our hostname as root: gedit /etc/hosts locate and add at the and of this line the hostname example: localhost.localdomain localhost buster.localdomain

ssh login without password To avoid typing our password as many times as the np value we can make an login without password $ssh-keygen by finishing this process it will create two files $cd ~.ssh $ls id_rsa id_rsa.pub known_hosts $cp id_rsa.pub authorized_keys2 So when we do $ssh localhost it will login without password

hello.c mpich program #include #include “mpi.h” main(int argc, char** argv){ int my_rank; int size; int namelen; char proc_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(proc_name, &namelen); if (my_rank == 2) printf(“Hello – I am process 2\n”); else printf(“Hello from process %d of %d on %s\n”, my_rank, size, proc_name); MPI_Finalize(); }

Run hello.c $mpicc hello.c $mpirun –np 4 a.out result: Hello from process 0 of 4 on localhost.localdomain Hello from process 1 of 4 on localhost.localdomain Hello from process 3 of 4 on localhost.localdomain Hello - I am process 2 NOTE: the results may displayed by different order that’s depends on how the operating system manages the processes

From Documentation of MPICH MPI_MAX_PROCESSOR_NAME Maximum length of name returned by MPI_GET_PROCESSOR_NAME MPI_Init Initialize the MPI execution environment MPI_Comm_rank Determines the rank of the calling process in the communicator MPI_Comm_size Determines the size of the group associated with a communicator MPI_Get_processor_name Gets the name of the processor MPI_Finalize Terminates MPI execution environment

Prepare data for parallel sum if (my_rank == 0){ //ON CPU0 array_size = 12; for(i=0;i<array_size;i++) data[i] = i+1 ; //FILL THE data array 1,2,3, for (target = 1; target < p; target++) MPI_Send(&array_size, 1, MPI_INT, target, tag1, MPI_COMM_WORLD); //send array size to the rest CPUs loc_array_size = array_size/p; //calculate locale array size k = loc_array_size; for(target = 1; target < p; target++){ MPI_Send(&data[k], loc_array_size, MPI_INT, target, tag2, MPI_COMM_WORLD); //send data to the rest CPUs k+=loc_array_size; } //k = 3,6,9,12 for(k=0; k<loc_array_size; k++) data_loc[k]=data[k]; //initialize CPU0 local array } else{ MPI_Recv(&array_size, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status); //receive array size from CPU0 loc_array_size = array_size/p; MPI_Recv(&data_loc[0], loc_array_size, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); //receive locale array from CPU0 }

Parallel sum res = 0; //parallel sum for (k=0; k<loc_array_size; k++) res = res + data_loc[k]; if (my_rank != 0){ MPI_Send(&res, 1, MPI_INT, 0, tag3, MPI_COMM_WORLD); //send result to CPU0 } else{ finres = res; //res of CPU0 printf("\n Result of process %d: %d\n", my_rank, res); for (source = 1; source < p; source++) { MPI_Recv(&res, 1, MPI_INT, source, tag3, MPI_COMM_WORLD, &status); //receive results from CPUs finres = finres + res; printf("\n Result of process %d: %d\n", source, res); } printf("\n\n\n Final Result: %d\n", finres); } MPI_Finalize();

Parallel Sum Output $ mpirun -np 4 a.out Result of process 0: 6 Result of process 1: 15 Result of process 2: 24 Result of process 3: 33 Final Result: 78

MPI_Send Performs a basic send Synopsis – #include "mpi.h" int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Input Parameters – buf initial address of send buffer (choice) – count number of elements in send buffer (nonnegative integer) – datatype datatype of each send buffer element (handle) – dest rank of destination (integer) – tag message tag (integer) – comm communicator (handle)

MPI_Recv Basic receive Synopsis – #include "mpi.h" int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) Output Parameters – buf initial address of receive buffer (choice) – status status object (Status) Input Parameters – count maximum number of elements in receive buffer (integer) – datatype datatype of each receive buffer element (handle) – source rank of source (integer) – tag message tag (integer) – comm communicator (handle)