DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Chapter 7 Memory Management Operating Systems: Internals and Design Principles, 6/E William Stallings Dave Bremer Otago Polytechnic, N.Z. ©2009, Prentice.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Memory Management Chapter 5.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Computer Organization and Architecture
Parallel System Performance CS 524 – High-Performance Computing.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 4: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Memory Management. Roadmap Basic requirements of Memory Management Memory Partitioning Basic blocks of memory management –Paging –Segmentation.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Chapter 7 Memory Management
Memory Management Chapter 7.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
L21: Putting it together: Tree Search (Ch. 6)
CSE8380 Parallel and Distributed Processing Presentation
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Memory System Performance Chapter 3
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15 Ext : 6532 FSKTM, UPM

Programming Models for Parallel Computing First look at an example of parallelizing a real-world task, taken from Solving Problems on Concurrent Processors, Fox et al. Hadrian's Wall was built by the ancient Romans to keep the marauding Scottish barbarians out of Roman England. It was originally 120 km long and 2 meters high. How would you build such a huge structure in the shortest possible time? Clearly the sequential approach | a single bricklayer building the entire wall | would be too slow. Can have modest task parallelism by having different specialist workers concurrently:  making the bricks  delivering the bricks  laying the bricks But unless we can replicate these workers, this only givesus a threefold speedup of the process

Data Parallelism To really speed up completion of the task, need many workers laying the bricks concurrently. In general there are two ways to do this:  pipelining (vectorization)  replication (parallelization) Concurrent execution of a task requires assigning different sections of the problem (processes and data) to different processors. We will concentrate on data parallelism, where the processors work concurrently on their own section of the whole data set, or domain. Splitting up the domain between processors is known as domain decomposition. Each processor has their own sub-domain, or grain, to work on. Parallelism may be described as fine-grained (lots of very small domains) or coarse-grained (a smaller number of larger domains). Deciding how the domain decomposition is done is a key issue in implementing efficient parallel processing. For the most efficient (least time) execution of the task, need to:  minimize communication of data between processors  distribute workload equally among the processors (known as load balancing)

Vectorization (Pipelining) One possible approach is to allocate a different bricklayer to each row of the wall, i.e. a horizontal decomposition of the problem domain. This is a pipelined approach - each bricklayer has to wait until the row underneath them has been started, so there is some inherent inefficiency. Once all rows have been started (the pipeline is full) all the bricklayers (processors) are working effciently at the same time, until the end of the task, when there is some overhead (idle workers) while the upper levels are completed (the pipeline is flushed).

Parallelization (Replication) Another approach is to do a vertical decomposition of the problem domain, so each bricklayer gets a vertical section of the wall to complete. In this case, the workers must communicate and synchronize their actions at the edges where the sub-domains meet. In general this communication and synchronization will incur some overhead, so there is some inefficiency. However each worker has an inner section of wall within their sub-domain that is completely independent of the others, which they can build just as efficiently as if there were no other workers. As long as the time taken to build this inner section is much longer than the time taken up by the communication and synchronization overhead, then the parallelization will be efficient and give good speedup over using a single worker.

Parallel I/O For large tasks with lots of data, need efficient means to pass the appropriate data to each processor. In building a wall, need to keep each bricklayer supplied with bricks. Simple approach has a single host processor connected to outside network and handling all I/O. Passes data to other processors through internal comms network of the machine. Host processor is a sequential bottleneck for I/O. Bandwidth is limited to a single network connection. Better approach is to do I/O in parallel, so each node (or group of nodes) has a direct I/O channel to a disk array.

Domain Decompositions Some standard domain decompositions of a regular 2D grid (array) include: BLOCK - contiguous chunks of rows or columns of data on each processor. BLOCK-BLOCK - block decomposition in both dimensions. CYCLIC - data is assigned to processors like cards dealt to poker players, so neighbouring points are on different processors. This can be good for load balancing applications with varying workloads that have certain types of communication, e.g. very little, or a lot (global sums or all-to-all), or strided. BLOCK-CYCLIC - a block decomposition in one dimension, cyclic in the other. SCATTERED - points are scattered randomly across processors. This can be good for load balancing applications with little (or lots of) communication. The human brain seems to work this way - neighboring sections may control widely separated parts of the body.

Domain Decompositions Illustrated The squares represent the 2D data array, the colours represent the processors where the data elements are stored. BLOCKBLOCK-BLOCK CYCLICBLOCK-CYCLIC

Static Load Balancing For maximum efficiency, domain decomposition should give equal work to each processor. In building the wall, can just give each bricklayer an equal length segment. But things can become much more complicated:  What if some bricklayers are faster than others? (this is like an inhomogeneous cluster of different workstations)  What if there are guard towers every few hundred meters, which require more work to construct? (in some applications, more work is required in certain parts of the domain) If we know in advance  1. the relative speed of the processors, and  2. the relative amount of processing required for each part of the problem then we can do a domain decomposition that takes this into account, so that different processors may have different sized domains, but the time to process them will be about the same. This is static load balancing, and can be done at compile-time. For some applications, maintaining load balance while simultaneously minimising communication can be a very diffcult optimisation problem

Irregular Doman Decomposition In this figure the airflow over an aeroplane wing is modeled on an irregular triangulated mesh. The grid is finer in areas where there is the most change in the airflow (e.g. turbulent regions) and coarsest where the flow is more regular (laminar). The domain is distributed among processors given by different colours.

Dynamic Load Balancing In some cases we do not know in advance one (or both) of: the effective performance of the processors - may be sharing the processors with other applications, so the load and available CPU may vary the amount of work required for each part of the domain - many applications are adaptive or dynamic, and the workload is only known at runtime (e.g. dynamic irregular mesh for CFD, or varying convergence rates for PDE solvers in different sections of a regular grid) In this case we need to dynamically change the domain decomposition by periodically repartitioning the data between processors. This is dynamic load balancing, and it can involve substantial overheads in: Figuring out how to best repartition the data - may need to use a fast method that gives a good (but not optimal) domain decomposition, to reduce computation. Moving the data between processors - could restrict this to local (neighbouring processor) moves to reduce communication. Usually repartition as infrequently as possible (e.g. every few iterations instead of every iteration). There is a tradeoff between performance improvement and repartitioning overhead.

Causes of Inefficiency in Parallel Programs Communication Overhead In some cases communication can be overlapped with computation, i.e. needed data can be prefetched at the same time as useful computation is being done. But often processors must wait for data, causing inefficiency (like a cache miss on a sequential machine, but worse). Load Imbalance Overhead Any load imbalance will cause some processors to be idle some of the time. In some problems the domain will change during the computation, requiring expensive dynamic load balancing. Algorithmic Overhead The best algorithm to solve a problem in parallel is often slightly dfferent (or in some cases very different) to the best sequential algorithm. Any excess cycles caused by the difference in algorithm (more operations or more iterations) gives an algorithmic overhead. Note that to calculate speedup honestly, should not use the time for the parallel algorithm on 1 processor, but the time for the best sequential algorithm. Sequential Overhead The program may have parts that are not parallelizable, so that each processor replicates the same calculation.

Parallel Speedup If N workers or processors are used, expect to be able to finish the task N times faster. Speedup = Time on 1 processor Time on N processors Speedup will (usually) be at most N on N processors. (Question: How can it be more?) N.B. Usually better to plot speedup vs number of processors, rather than time taken (1/speedup) vs #procs, since it is much harder to judge deviation from a 1/x curve than from a straight line. Any problem of fixed size will have a maximum speedup, beyond which adding more processors will not reduce (and will usually increase) the time taken. Linear Speedup (Perfect Speedup) Actual Speedup No. of Processors Speedup

Superlinear Speedup Answer to previous question..... It is possible to get speedups greater than N on N processors if the parallel algorithm is very efficient (no load imbalance or algorithmic overhead, and any communication is overlapped with computation), and if splitting the data among processors allows a greater proportion of the data to _t into cache memory on each processor, resulting in faster sequential performance of the program on each processor. Same idea applies to programs that would be out-of-core on 1 processor (i.e. require costly paging to disk) but can have all data in core memory if they are spread over multiple processors. Some parallel implementations of tree search applications (e.g. branch and bound optimisation algorithms) can also have superlinear speedup, by exploring multiple branches in parallel, allowing better pruning and thus finding the solution faster. However the same efect can be achieved by running multiple processes (or threads) on a single processor.