Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Distributed Systems CS

SE-292 High Performance Computing

Introduction to Parallel Computing

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Introductory Courses in High Performance Computing at Illinois David Padua.

Reference: Message Passing Fundamentals.

Introduction CS 524 – High-Performance Computing.

CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

Parallel Computing Overview CS 524 – High-Performance Computing.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

Parallel Programming Models and Paradigms

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.

Mapping Techniques for Load Balancing

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Parallel Computing Through MPI Technologies Author: Nyameko Lisa Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

GPU Architecture and Programming

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

Distributed Systems CS /640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andVinay.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Parallel Computing Presented by Justin Reschke

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Lecture 29 Fall 2011 Lecture 29: Parallel Programming Overview.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming pt.1

Introduction to Parallel Processing

PARALLEL COMPUTING Submitted By : P. Nagalakshmi

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

Parallel Algorithm Design

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Multi-Processing in High Performance Computer Architecture:

Guoliang Chen Parallel Computing Guoliang Chen

MPI-Message Passing Interface

Summary Background Introduction in algorithms and applications

STUDY AND IMPLEMENTATION

Introduction to parallelism and the Message Passing Interface

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Presentation transcript:

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview

Lecture 29 Fall 2006 Parallelization in Everyday Life  Example 0: organizations consisting of many people l each person acts sequentially l all people are acting in parallel  Example 1: building a house (functional decomposition) l Some tasks must be performed before others: dig hole, pour foundation, frame walls, roof, etc. l Some tasks can be done in parallel: install kitchen cabinets, lay the tile in the bathroom, etc.  Example 2: digging post holes (“data” parallel decomposition) l If it takes one person an hour to dig a post hole, how long will it take 30 men to dig a post hole? l How long would it take 30 men to dig 30 post holes?

Lecture 29 Fall 2006 Parallelization in Everyday Life  Example 3: car assembly line (pipelining)

Lecture 29 Fall 2006 Parallel Programming Paradigms --Various Methods  There are many methods of programming parallel computers. Two of the most common are message passing and data parallel. l Message Passing - the user makes calls to libraries to explicitly share information between processors. l Data Parallel - data partitioning determines parallelism l Shared Memory - multiple processes sharing common memory space l Remote Memory Operation - set of processes in which a process can access the memory of another process without its participation l Threads - a single process having multiple (concurrent) execution paths l Combined Models - composed of two or more of the above.  Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support. An effective implementation is one which closely matches its target hardware and provides the user ease in programming.

Lecture 29 Fall 2006 Parallel Programming Paradigms: Message Passing The message passing model is defined as: l set of processes using only local memory l processes communicate by sending and receiving messages l data transfer requires cooperative operations to be performed by each process (a send operation must have a matching receive)  Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.

Lecture 29 Fall 2006 Parallel Programming Paradigms: Data Parallel  The data parallel model is defined as: l Each process works on a different part of the same data structure l Commonly a Single Program Multiple Data (SPMD) approach l Data is distributed across processors l All message passing is done invisibly to the programmer l Commonly built "on top of" one of the common message passing libraries  Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.  The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.

Lecture 29 Fall 2006 Implementation of Message Passing: MPI  Message Passing Interface often called MPI.  A standard portable message-passing library definition developed in 1993 by a group of parallel computer vendors, software writers, and application scientists.  Available to both Fortran and C programs.  Available on a wide variety of parallel machines.  Target platform is a distributed memory system  All inter-task communication is by message passing.  All parallelism is explicit: the programmer is responsible for parallelism the program and implementing the MPI constructs.  Programming model is SPMD (Single Program Multiple Data)

Lecture 29 Fall 2006 Implementations: F90 / High Performance Fortran (HPF)  Fortran 90 (F90) - (ISO / ANSI standard extensions to Fortran 77).  High Performance Fortran (HPF) - extensions to F90 to support data parallel programming.  Compiler directives allow programmer specification of data distribution and alignment.  New compiler constructs and intrinsics allow the programmer to do computations and manipulations on data with different distributions.

Lecture 29 Fall 2006 Steps for Creating a Parallel Program 1. If you are starting with an existing serial program, debug the serial code completely 2. Identify the parts of the program that can be executed concurrently: l Requires a thorough understanding of the algorithm l Exploit any inherent parallelism which may exist. l May require restructuring of the program and/or algorithm. May require an entirely new algorithm. 3. Decompose the program: l Functional Parallelism l Data Parallelism l Combination of both 4. Code development l Code may be influenced/determined by machine architecture l Choose a programming paradigm l Determine communication l Add code to accomplish task control and communications 5. Compile, Test, Debug 6. Optimization l Measure Performance l Locate Problem Areas l Improve them

Lecture 29 Fall 2006 Recall Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

Lecture 29 Fall 2006 Examples: Amdahl’s Law  Amdahl’s Law tells us that to achieve linear speedup with 100 processors (e.g., speedup of 100), none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less  What speedup could we achieve from 100 processors if 30% of the original program is scalar? Speedup w/ E = 1 / ((1-F) + F/S) = 1 / ( /100) = 1.4  Serial program/algorithm might need to be restructuring to allow for efficient parallelization.

Lecture 29 Fall 2006 Decomposing the Program  There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both  Functional Decomposition (Functional Parallelism) Functional Decomposition (Functional Parallelism) l Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution l Good to use when there is not static structure or fixed determination of number of calculations to be performed  Domain Decomposition (Data Parallelism) Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l Good to use for problems where: -data is static (factoring and solving large matrix or finite difference calculations) -dynamic data structure tied to single entity where entity can be subsetted (large multi- body problems) -domain is fixed but computation within various regions of the domain is dynamic (fluid vortices models) l There are many ways to decompose data into partitions to be distributed: -One Dimensional Data DistributionOne Dimensional Data Distribution –Block DistributionBlock Distribution –Cyclic DistributionCyclic Distribution -Two Dimensional Data DistributionTwo Dimensional Data Distribution –Block Block DistributionBlock Block Distribution –Block Cyclic DistributionBlock Cyclic Distribution –Cyclic Block DistributionCyclic Block Distribution

Lecture 29 Fall 2006 Functional Decomposing of a Program l Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution l Good to use when there is not static structure or fixed determination of number of calculations to be performed

Lecture 29 Fall 2006 Functional Decomposing of a Program

Lecture 29 Fall 2006 Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed:

Lecture 29 Fall 2006 Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed:

Lecture 29 Fall 2006 Cannon's Matrix Multiplication