Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Distributed Systems CS

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Reference: Message Passing Fundamentals.

Programming Parallel Computers

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

Multiscalar processors

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Introduction to Parallel Computing

CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Department of Computer Science and Software Engineering

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Parallel Computing Presented by Justin Reschke

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Designing Parallel Programs  Automatic vs. Manual Parallelization  Understand the Problem and the Program  Communications  Synchronization  Data Dependencies.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Auburn University

OPERATING SYSTEMS CS 3502 Fall 2017

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Processing

PARALLEL COMPUTING Submitted By : P. Nagalakshmi

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Algorithm Design

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Operating Systems (CS 340 D)

Parallel Programming in C with MPI and OpenMP

EE 193: Parallel Computing

8) Granularity Computation / Communication Ratio:

Lecture 3 : Performance of Parallel Programs

Threads Chapter 4.

Designing Parallel Programs

Distributed Systems CS

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Lecture 2 The Art of Concurrency

Mattan Erez The University of Texas at Austin

COMP60611 Fundamentals of Parallel and Distributed Systems

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker

Manual vs. Automatic Parallelization Designing and developing parallel programs has been a very MANUAL process. The programmer is responsible for both: – Identifying & Implementing parallelism Manually developing parallel codes is a – Time consuming – Complex – Error-prone – Iterative process

Outline Parallelization Partitioning Communication Efficiency Synchronization Data Dependency Load Balancing Granularity I/O Amdhal’s Law Complexity Portability Resource Requirements Scalability MPI demo – Matrix Share Memory – Matrix multiplication – Alltoall – Heat Equation

Parallelizing Compiler (Pre-Processor) Most common type of tool used to automatically parallelize a serial program into parallel programs Parallelizing Compiler works in 2 different ways: – Fully Automatic – Programmer Directed

Parallelizing Compiler (Fully Automatic) The compiler analyzes the source code and identifies opportunities for parallelism The analysis includes: – Identifying inhibitors to parallelism – Possibly a cost weighting on whether or not the parallelism would actually improve performance – Loops (do, for) loops are the most frequent target for automatic parallelization

Parallelizing Compiler (Programmer Directed) Using “compiler directives” or possibly compiler flags, the programmer explicitly tell the compiler how to parallelize the code May be able to be used in conjunction with some degree of automatic parallelization also

Automatic Parallelization(Caveats) Wrong results may be produced Performance may actually degrade Much less flexible than manual parallelization Limited to a subset (mostly loops) of code May actually not parallelize code if the analysis suggest there are inhibitors or the code is too complex

Understand the Problem & the Program First step in developing parallel software is to: – Understand the problem that you wish to solve in parallel (from serial program you need to understand the existing code) – Before spending time : determine whether or not the problem is one that can actually be parallelized – Identify the program’s hotspots (Know where of the real work is being done. Performance analysis tools can help here) – Identify bottlenecks ( I/O is usually something that slows a program down. Change algorithms to reduce or eliminate unnecessary slow areas) – Investigate other algorithms – Investigate inhibitors to parallelism. One common class of inhibitor is data dependence

Examples (Parallelizable?) – Example of Parallelizable Problem Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation Each of the molecular conformation is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem – Example of Non-parallelizable Problem Calculation of the Fibonacci series (1,1,2,3,5,8,13,21) F(K + 2) = F(K + 1) + F(K) The calculation of the Fibonacci sequences as shown would entail dependent calculations rather than independents ones. The calculation of the k + 2 values uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel

P ARTITIONING – Break the problem into discrete “chunks” of work that can be distributed to multiple tasks – Domain decomposition & Functional decomposition

Partition Domain Decomposition: the data associated with a problem is decomposed. Each parallel task then works on a portion of of the data.

Partition Functional Decomposition: In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

Partition (Functional Decomposition)

Communications Who needs Communications : – You don’t need : Some types of problems can be decomposed and execute in parallel. Embarrassingly parallel. Very little inter-task communication is required Eg. Image processing operation, every pixel in a black and white image needs to have its color reversed – You do need Most parallel applications do require to share data with each other. (Eg. Ecosystem)

C OMMUNICATIONS (Factors to consider) There are a number of important factors to consider when designing program’s inter-task communications: – Cost of communications – Latency vs. Bandwidth – Visibility of communications – Synchronous vs. Asynchronous communication – Scope of communications – Efficiency of communications

Communications (Cost) Inter-task communication virtually always implies overhead Machine cycles and resources that could be used for computation are instead used to package and transmit data. Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work. Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems

Communications (Latency vs. Bandwidth) latency is the time it takes to send a minimal (0 byte) message from point A to point B. Commonly expressed as microseconds. bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed as megabytes/sec or gigabytes/sec. Sending many small messages can cause latency to dominate communication overheads. Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth.

Communications (Visibility) Message passing Model: communications are explicit (under control of the programmer) Data Parallel Model: communications occur transparently to the programmer, usually on distributed memory architectures.

Communications (Synchronous vs. Asynchronous) Synchronous requires some type of “handshaking” between task that are sharing data. Synchronous : Blocking communications Asynchronous allow tasks to transfer data independently from one another. Asynchronous: Non-Blocking communications Interleaving computation with communication is the greatest benefit.

Communications (Scope) Knowing which tasks must communicate with each other is critical during design stage of a parallel code. Two scoping can be implementing sync. Or async. – Point to Point: 2 task (sender/producer of data and receiver/consumer) – Collective: data sharing between more than two tasks

Communications (Scope-Collective)

Efficiency of communications Very often, the programmer will have a choice with regard to factors that can affect communications performance. Which implementation for a given model should be used? (Eg.MPI implementation may be faster on a given hardware platform than another) What type of communication operations should be used? (Eg. asynchronous communication operations can improve overall program performance) Network media - some platforms may offer more than one network for communications. Which one is best?

S YNCHRONIZATION (Types) Barrier – All tasks are involved – Each task perform its work. When the last task reaches the barrier, all task are synchronized Lock / semaphore – Typically used to serialize access to global data or section of code. Task must wait to use the code Synchronous communication operations – Involves only those tasks executing a communication operations (handshaking)

Data Dependencies A dependence exists between program statements when the order of statements execution affects the results of the program A data dependence results from multiple use of the same location(s) in storage by different tasks. Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism

Data Dependencies Loop carried data dependence (most important) DO J = MYSTART,MYEND A(J) = A(J-1) * 2.0 CONTINUE The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1). Parallelism is inhibited. If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of A(J) necessitates: 1 Calculate, 2 get value Loop independent data dependence – task 1 task 2 X = 2 X = 4 Y = X**2 Y = X**3 As with the previous example, parallelism is inhibited. The value of Y is dependent on:

Data Dependencies How to Handle Data Dependencies: – Distributed memory architectures - communicate required data at synchronization points. – Shared memory architectures -synchronize read/write operations between tasks.

Load Balancing Refers to the practice of distributing work among tasks so that all task are kept busy all of the time. It can be considered a Minimization of task idle time Important for performance reasons

Load Balancing How to achieve – Equally partition the work each task receives – Use dynamic work assignment

How to Achieve (Load Balancing) Equally partition the work each task receives – For array/matrix operations where each task performs similar work, evenly distribute the data set among the tasks. – For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks.

How to Achieve (Load Balancing) Use dynamic work assignment – When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler - task pool approach. As each task finishes its work, it queues to get a new piece of work. – It may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code. Sparse arrays:some task with zeros Adaptive grid methods: some task need to refine their mesh

How to Achieve (Load Balancing)

Granularity (Computation / Communication Ratio) Granularity is a qualitative measure of the ratio of computation to communication Periods of computation are typically separated from periods of communication by synchronization events Two types – Fine-grain Parallelism – Coarse-grain Parallelism

Granularity ( Fine-grain Parallelism ) Relatively small amounts of computational work are done between communication events Low computation to communication ratio Implies high communication overhead If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation

Granularity ( Coarse-grain Parallelism ) Relatively large amounts of computational work are done between communication/synchronization events High computation to comunication rate Implies more opportunity for performance increase Harder to load balance efficiently

Granularity (What is Best?) The most efficient granularity depend on the algorithm and the hardware environment in which it runs In most cases the overhead associated with communication and synchronization is high relative to execution speed so it is advantageous to have coarse granularity Fine-grain parallelism can help reduce overheads due to load imbalance. Facilitates load balancing

I/O I/O operations are inhibitors to parallelism Parallel I/O systems may be inmature or not available for all platforms If all of the tasks see the same file space, WRITE operations can result in file overwriting Read operations can be affected by the file server’s ability to handle multiple read requests at the same time I/O over networks can cause bottlenecks/crash file servers

Amdahl’s Law States that: “Potential program speedup is defined by the fraction of code (P) that can be parallelized” Speedup = 1 / (1 – P) If P = 0 then speedup = 1 (no code parallelized) If P = 1 then speedup is infinite (all code parallelized) If P =.5 then speedup is 2 (50% of the code parallelized) meaning the code will run twice as fast.

Amdahl’s Law Introducing the number of processors performing the parallel fraction of work Speedup = 1 / ((P / N) + S) P = parallel fraction, N = number of processors S = serial fraction

Complexity Parallel applications are much more complex than corresponding serial applications. Cost of complexity is measured in programmer time in every aspect of the software development cycle – Design, Coding, Debugging, Tuning, Maintenance

Portability There are standardization in some API’s s.t. MPI – Implementations will differ in a number of details, requiring code modifications – Hw architectures can affect portability – Operating systems can play a key role in code portability issues – All of the portability issues associated with serial programs apply to parallel programs

Resource Requirements Goal of Parallel programming is decrease execution wall clock time, more CPU time is required. Eg. 1 parallel code that runs 1 hour on 8 processors actually use 8 hours of CPU time Amount of memory can be greather in parallel Short parallel code it is possible a decrease in performance. (setting up the parallel environment, task creation/termination, communication)

Scalability Result of a numer of interrelated factors Adding more machines is rarely the answer At some point, adding more resources causes performance to decrease Hardware factors play a significant role in scalability. – Communications network bandwidth – Amount of memory available on any machine Parallel support libraries and subsystems (limit)

References Author: Blaise Barney, Livermore Computing.Blaise Barney A search on the WWW for "parallel programming" or "parallel computing" will yield a wide variety of information. "Designing and Building Parallel Programs". Ian Foster. "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar.

Question Mention 5 Communication factors to be consider when you are designing a Parallel Program – Cost of Communication – Latency, Bandwidth – Visibility – Synchronous, Asynchronous – Scope