Parallel Programming Sathish S. Vadhiyar Course Web Page:

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Parallel Computers Chapter 1
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
Parallel Architectures: Topologies Heiko Schröder, 2003.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
History of Distributed Systems Joseph Cordina
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Parallel Computing Platforms
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Multiprocessor systems Objective n the multiprocessors’ organization and implementation n the shared-memory in multiprocessor n static and dynamic connection.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Parallel Computing Presented by Justin Reschke
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Overview Parallel Processing Pipelining
Parallel Architecture
Course Outline Introduction in algorithms and applications
Parallel Programming By J. H. Wang May 2, 2017.
Data Structures and Algorithms in Parallel Computing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Parallel Programming in C with MPI and OpenMP
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Parallel Programming Sathish S. Vadhiyar Course Web Page:

Motivation for Parallel Programming Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10% Grand challenge problems (more later)

Challenges / Problems in Parallel Algorithms  Building efficient algorithms.  Avoiding Communication delay Idling Synchronization

Challenges P0 P1 Idle time Computation Communication Synchronization

How do we evaluate a parallel program?  Execution time, T p  Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)  Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p.

Speedups and efficiency Ideal p S Practical p E

Limitations on speedup – Amdahl’s law  Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.  Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.  Places a limit on the speedup due to parallelism.  Speedup = 1 (f s + (f p /P))

Amdahl’s law Illustration Courtesy: S = 1 / (s + (1-s)/p)

Amdahl’s law analysis fP=1P=4P=8P=16P= For the same fraction, speedup numbers keep moving away from processor size. Thus Amdahl’s law is a bit depressing for parallel programming. In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law  Amdahl’s law – keep the parallel work fixed  Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time  For a particular number of processors, find the problem size for which parallel time is equal to the constant time  For that problem size, find the sequential time and the corresponding speedup  Thus speedup is scaled or scaled speedup

Metrics (Contd..) NP=1P=4P=8P=16P= Table 5.1: Efficiency as a function of n and p.

Scalability  Efficiency decreases with increasing P; increases with increasing N  How effectively the parallel algorithm can use an increasing number of processors  How the amount of computation performed must scale with P to keep E constant  This function of computation in terms of P is called isoefficiency function.  An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Scalability Analysis – Finite Difference algorithm with 1D decomposition Hence isoefficiency function = O(P 2 ) since computation is O(N 2 ) Can be satisfied with N = P, except for small P. For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Scalability Analysis – Finite Difference algorithm with 2D decomposition Hence isoefficiency function = O(P) Can be satisfied with N = sqroot(P) 2D algorithm is more scalable than 1D

Parallel Algorithm Design

Steps  Decomposition – Splitting the problem into tasks or modules  Mapping – Assigning tasks to processor  Mapping’s contradictory objectives To minimize idle times To reduce communications

Mapping  Static mapping Mapping based on Data partitioning  Applicable to dense matrix computations  Block distribution  Block-cyclic distribution Graph partitioning based mapping  Applicable for sparse matrix computations Mapping based on task partitioning

Based on Task Partitioning  Based on task dependency graph  In general the problem is NP complete

Mapping  Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing

Interaction Overheads  In spite of the best efforts in mapping, there can be interaction overheads  Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.  Some techniques can be used to minimize interactions

Parallel Algorithm Design - Containing Interaction Overheads  Maximizing data locality Minimizing volume of data exchange  Using higher dimensional mapping  Not communicating intermediate results Minimizing frequency of interactions  Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes

Parallel Algorithm Design - Containing Interaction Overheads  Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2  Overlapping interactions with interactions  Replicating data or computations Balancing the extra computation or storage cost with the gain due to less communication

Parallel Algorithm Classification – Types - Models

Parallel Algorithm Types  Divide and conquer  Data partitioning / decomposition  Pipelining

Divide-and-Conquer  Recursive in structure Divide the problem into sub-problems that are similar to the original, smaller in size Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner Combine the solutions to create a solution to the original problem

Divide-and-Conquer Example: Merge Sort  Problem: Sort a sequence of n elements  Divide the sequence into two subsequences of n/2 elements each  Conquer: Sort the two subsequences recursively using merge sort  Combine: Merge the two sorted subsequences to produce sorted answer

Partitioning 1.Breaking up the given problem into p independent subproblems of almost equal sizes 2.Solving the p subproblems concurrently  Mostly splitting the input or output into non-overlapping pieces  Example: Matrix multiplication  Either the inputs (A or B) or output (C) can be partitioned.

Pipelining Occurs with image processing applications where a number of images undergoes a sequence of transformations.

Parallel Algorithm Models  Data parallel model Processes perform identical tasks on different data  Task parallel model Different processes perform different tasks on same or different data – based on task dependency graph  Work pool model Any task can be performed by any process. Tasks are added to a work pool dynamically  Pipeline model A stream of data passes through a chain of processes – stream parallelism

Parallel Program Classification - Models - Structure - Paradigms

Parallel Program Models  Single Program Multiple Data (SPMD)  Multiple Program Multiple Data (MPMD) Courtesy:

Parallel Program Structure Types  Master-Worker / parameter sweep / task farming  Embarassingly/plea singly parallel  Pipleline / systolic / wavefront  Tightly-coupled  Workflow P0 P1P2P3P4 P0P1P2P3P4

Programming Paradigms  Shared memory model – Threads, OpenMP  Message passing model – MPI  Data parallel model – HPF Courtesy:

Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

Classification of Architectures – Flynn’s classification  Single Instruction Single Data (SISD): Serial Computers  Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy:

Classification of Architectures – Flynn’s classification  Multiple Instruction Single Data (MISD): Not popular  Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy:

Classification of Architectures – Based on Memory  Shared memory  2 types – UMA and NUMA UMA NUMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy:

Classification of Architectures – Based on Memory  Distributed memory Courtesy:  Recently multi-cores  Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Cache Coherence - for details, read of book Interconnection networks - for details, read of book

Cache Coherence in SMPs Main Memory CPU0CPU1CPU2CPU3 cache0 cache1cache2cache3 a aaaa All processes read variable ‘x’ residing in cache line ‘a’ Each process updates ‘x’ at different points of time Challenge: To maintain consistent view of the data Protocols: Write update Write invalidate

Caches Coherence Protocols and Implementations  Write update – propagate cache line to other processors on every write to a processor  Write invalidate – each processor get the updated cache line whenever it reads stale data  Which is better??

Caches –False sharing Main Memory CPU0 CPU1 cache1 A0 – A8 A1, A3, A5… A9 – A15 cache0 A0, A2, A4… Modify the algorithm to change the stride Different processors update different parts of the same cache line Leads to ping-pong of cache lines between processors Situation better in update protocols than invalidate protocols. Why?

Caches Coherence using invalidate protocols  3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0  Implementations Snoopy  for bus based architectures  Memory operations are propagated over the bus and snooped  Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors Directory-based  A central directory maintains states of cache blocks, associated processors  Implemented with presence bits

Interconnection Networks  An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network  Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches

Interconnection Networks  Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D), Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network  Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.  For more details, and evaluation of topologies, refer to book

Evaluating Interconnection topologies  Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube -  Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP d

Evaluating Interconnection topologies  bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P 2 /4 P/2

Evaluating Interconnection topologies  channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes  channel rate – performance of a single physical wire  channel bandwidth – channel rate times channel width  bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

 END