S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

SE-292 High Performance Computing

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Today’s topics Single processors and the Memory Hierarchy

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

Reference: Message Passing Fundamentals.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Parallel Computing Overview CS 524 – High-Performance Computing.

Chapter 17 Parallel Processing.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Parallel Processing Architectures Laxmi Narayan Bhuyan

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Parallel Computer Architecture and Interconnect 1b.1.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

1 Dynamic Interconnection Networks Miodrag Bolic.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Data Structures and Algorithms in Parallel Computing Lecture 1.

1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.

Outline Why this subject? What is High Performance Computing?

Super computers Parallel Processing

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Parallel Computing Presented by Justin Reschke

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Advanced Architectures

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Parallel Programming By J. H. Wang May 2, 2017.

Cache Memory Presentation I

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Parallel Processing Architectures

Constructing a system with multiple computers or processors

Chapter 4 Multiprocessors

Types of Parallel Computers

Presentation transcript:

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing Center Scalable Parallel Architectures and their Software

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 2 Introduction Overview of RISC CPUs, Memory Hierarchy Parallel Systems - General Hardware Layout (SMP, Distributed, Hybrid) Communications Networks for Parallel Systems Parallel I/O Operating Systems Concepts Overview of Parallel Programming Methodologies –Distributed Memory –Shared-Memory Hardware Specifics of NPACI Parallel Machines – IBM SP Blue Horizon –New CPU Architectures IBM Power 4 Intel IA-64

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 3 What is Parallel Computing? Parallel computing: the use of multiple computers or processors or processes working together on a common task. –Each processor works on its section of the problem –Processors are allowed to exchange information (data in local memory) with other processors CPU #1 works on this area of the problem CPU #3 works on this area of the problem CPU #4 works on this area of the problem CPU #2 works on this area of the problem Grid of Problem to be solved y x exchange

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 4 Why Parallel Computing? Limits of single-CPU computing –Available memory –Performance - usually “time to solution” Limits of Vector Computers – main HPC alternative –System cost, including maintenance –Cost/MFlop Parallel computing allows: –Solving problems that don’t fit on a single CPU –Solving problems that can’t be solved in a reasonable time on one CPU We can run… –Larger problems –Finer resolution –Faster –More cases

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 5 Scalable Parallel Computer Systems ( Scalable) [ ( CPUs) + (Memory) + (I/O) + (Interconnect) + (OS) ] = Scalable Parallel Computer System

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 Scalable Parallel Computer Systems Scalablity: A parallel system is scalable if it is capable of providing enhanced resources to accommodate increasing performance and/or functionality Resource scalability: scalability achieved by increasing machine size ( # CPUs, memory, I/O, network, etc.) Application scalability –machine size –problem size

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 7 Shared and Distributed Memory Systems Multiprocessor (Shared memory) -Single address space. All processors have access to a pool of shared memory. Examples: SUN HPC, CRAY T90, NEC SX-6 Methods of memory access : - Bus - Crossbar Multicomputer (Distributed memory) -Each processor has it’s own local -memory. Examples: CRAY T3E, IBM SP2, PC Cluster MEMORY BUS/CROSSBAR CPU MMMM NETWORK

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 8 Hybrid (SMP Clusters) Systems MEMORY Interconnect CPU MEMORY Interconnect CPU MEMORY Interconnect CPU Network Hybrid Architecture – Processes share memory on-node, may/must use message-passing off-node, may share off-node memory Example: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 9 RISC-Based Computer Hardware Concepts RISC CPUs most common CPUs in HPC – many design concepts transferred from vector CPUs to RISC to CISC Multiple Functional Units Pipelined Instructions Memory Hierarchy Instructions typically take 1-several CPU clock cycles –Clock cycles provide time scale for measurement Data transfers – memory-to-CPU, network, I/O, etc.

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 10 Processor Related Terms RISC : Reduced Instruction Set Computers PIPELINE : Technique where multiple instructions are overlapped in execution SUPERSCALAR : Computer design feature - multiple instructions can be executed per clock period Laura C. Nett: Instruction set is just how each operation is processed x=y+1 load y and a add y and a put in x Laura C. Nett: Instruction set is just how each operation is processed x=y+1 load y and a add y and a put in x

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 11 ‘Typical’ RISC CPU r0 r r2 r1 registers FP Add FP Multiply FP Divide Functional Units Loads & Stores CPU Chip Memory/Cache FP Multiply & Add

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 12 Functional Unit D(I) C(I) A(I) Multiply pipeline length Fully Segmented - A(I)=C(I)*D(I) Chair Building Function Unit Carpenter 1Carpenter 2Carpenter 3Carpenter 4Carpenter 5

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 13 Dual Hardware Pipes odd C(I) A(I) & A(I+1) even D(I+1) even C(I+1) A(I) = C(I)*D(I)

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 14 RISC Memory/Cache Related Terms  ICACHE : Instruction cache  DCACHE (Level 1) : Data cache closest to registers  SCACHE (Level 2) : Secondary data cache  Data from SCACHE has to go through DCACHE to registers  SCACHE is larger than DCACHE  All processors do not have SCACHE  CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy  TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one level in memory to another CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 15 CPU Level 1 Cache Level 2 Cache MAIN MEMORY RISC CPU, CACHE, and MEMORY Basic Layout Registers

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 16 RISC Memory/Cache Related Terms  ICACHE : Instruction cache  DCACHE (Level 1) : Data cache closest to registers  SCACHE (Level 2) : Secondary data cache  Data from SCACHE has to go through DCACHE to registers  SCACHE is larger than DCACHE  All processors do not have SCACHE  CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy  TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one level in memory to another CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 17 RISC Memory/Cache Related Terms (cont.) Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. cache Main memory

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 18 RISC Memory/Cache Related Terms (cont.) Fully associative cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache. cache Main memory

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 19 RISC Memory/Cache Related Terms (cont.) Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set- associative cache. In a n-way set-associative cache a block from main memory can go into n (n at least 2) locations in the cache. 2-way set-associative cache Main memory

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 20 RISC Memory/Cache Related Terms The data cache was designed to allow programmers to take advantage of common data access patterns :  Spatial Locality  When an array element is referenced, its neighbors are likely to be referenced  Cache lines are fetched together  Work on consecutive data elements in the same cache line  Temporal Locality  When an array element is referenced, it is likely to be referenced again soon  Arrange code so that data in cache is reused as often as possible

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 21 Typical RISC Floating-Point Operation Times IBM POWER3 II CPU Clock Speed – 375 MHz ( ~ 3 ns) Instruction32-Bit64-Bit FP Multiply or Add 3-4 FP Multiply- Add 3-4 FP Square Root FP Divide

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 22 Typical RISC Memory Access Times IBM POWER3 II AccessBandwidth (GB/s) Time (Cycles) Load Register From L Store Register To L Load/Store L1 from/to L Load/Store L1 From/to RAM 1.635

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 23 Single CPU Optimization Optimization of serial (single CPU) version is very important Want to parallelize best serial version – where appropriate

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 24 New CPUs in HPC New CPU designs with new features IBM POWER 4 –U Texas Regatta nodes – covered on Wednesday Intel IA-64 –SDSC DTF TeraGrid PC Linux Cluster

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 25 Parallel Networks Network function is to transfer data from source to destination in support of network transactions used to realize supported programming model(s). Data transfer can be for message-passing and/or shared-memory operations.  Network Terminology  Common Parallel Networks

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 26 System Interconnect Topologies Send Information among CPUs through a Network - Best choice would be a fully connected network in which each processor has a direct link to every other processor – Fully Connected Network. This type of network would be very expensive and difficult to scale ~N*N. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc. 3-D Hypercube 2-D Mesh 2-D Torus

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 27 Network Terminology Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better. Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better. –May vary with data size For IBM Blue Horizon:

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 28 Network Terminology Bus Shared data path Data requests require exclusive access Complexity ~ O(N) Not scalable – Bandwidth ~ O(1) Crossbar Switch Non-blocking switching grid among network elements Bandwidth ~ O(N) Complexity ~ O(N*N) Multistage Interconnection Network (MIN) Hierarchy of switching networks – e.g., Omega network for N CPUs, N memory banks: complexity ~ O(ln(N))

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 29 Network Terminology (Continued) Diameter – maximum distance (in nodes) between any two processors Connectivity – number of distinct paths between any two processors Channel width – maximum number of bits that can be simultaneously sent over link connecting two processors = number of physical wires in each link Channel rate – peak rate at which data can be sent over single physical wire Channel bandwidth – peak rate at which data can be sent over link = (channel rate) * (channel width) Bisection width – minimum number of links that have to be removed to partition network into two equal halves Bisection bandwidth – maximum amount of data between any two halves of network connecting equal numbers of CPUs = (bisection width) * (channel bandwidth)

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 30 Communication Overhead Time to send a message of M bytes – simple form: Tcomm = TL + M*Td + TContention TL = Message Latency T = 1byte/bandwidth Tcontention – Takes into account other network traffic

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 31 Communication Overhead Time to send a message of M bytes – simple form: Tcomm = TL + M*Td + TContention TL = Message Latency T = 1byte/bandwidth Tcontention – Takes into account other network traffic

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 32 Parallel I/O I/O can be limiting factor in parallel application I/O system properties – capacity, bandwidth, access time Need support for Parallel I/O in programming system Need underlying HW and system support for parallel I/O –IBM GPFS – low-level API for developing high-level parallel I/O functionality – MPI I/O, HDF 5, etc. –

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 33 Unix OS Concepts for Parallel Programming Most Operating Systems used by Parallel Computers are Unix-based Unix Process (task) –Executable code –Instruction pointer –Stack –Logical registers –Heap –Private address space –Task forking to create dependent processes – thousands of clock cycles Thread – “lightweight process” –Logical registers –Stack –Shared address space Hundreds of clock cycles to create/destroy/synchronize threads

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 34 Parallel Computer Architectures (Flynn Taxonomy) Control Mechanism Memory Model SIMDMIMD shared-memory Hybrid (SMP cluster) distributed-memory

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 35 Hardware Architecture Models for Design of Parallel Programs Sequential computers - von Neumann model (RAM) is universal computational model Parallel computers - no one model exists Model must be sufficiently general to encapsulate hardware features of parallel systems Programs designed from model must execute efficiently on real parallel systems

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 36 Designing and Building Parallel Applications Donald Frederick San Diego Supercomputing Center

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 37 What is Parallel Computing? Parallel computing: the use of multiple computers or processors or processes concurrently working together on a common task. –Each processor/process works on its section of the problem –Processors/process are allowed to exchange information (data in local memory) with other processors/processes CPU #1 works on this area of the problem CPU #3 works on this area of the problem CPU #4 works on this area of the problem CPU #2 works on this area of the problem Grid of Problem to be solved y x exchange

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 38 Shared and Distributed Memory Systems Mulitprocessor Shared memory - Single address space. Processes have access to a pool of shared memory. Single OS. Multicomputer Distributed memory - Each processor has it’s own local memory. Processes usually do message passing to exchange data among processors. Usually multiple Copies of OS MEMORY Interconnect CPU MMMM NETWORK

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 39 Hybrid (SMP Clusters) System MEMORY Interconnect CPU MEMORY Interconnect CPU MEMORY Interconnect CPU Network Must/may use message-passing. Single or multiple OS copies Node-Local operations less costly than off-node

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 40 Unix OS Concepts for Parallel Programming Most Operating Systems used are Unix-based Unix Process (task) –Executable code –Instruction pointer –Stack –Logical registers –Heap –Private address space –Task forking to create dependent processes – thousands of clock cycles Thread – “lightweight process” –Logical registers –Stack –Shared address space Hundreds of clock cycles to create/destroy/synchronize threads

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 41 Generic Parallel Programming Models Single Program Multiple Data Stream (SPMD) –Each CPU accesses same object code –Same application run on different data Data exchange may be handled explicitly/implicitly –“Natural” model for SIMD machines –Most commonly used generic parallel programming model Message-passing Shared-memory –Usually uses process/task ID to differentiate –Focus of remainder of this section Multiple Program Multiple Data Stream (MPMD) –Each CPU accesses different object code –Each CPU has only data/instructions needed –“Natural” model for MIMD machines

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 42 Parallel “Architectures” – Mapping Hardware Models to Programming Models Control Mechanism Memory Model Programming Model SIMDMIMD shared-memory Hybrid (SMP cluster) distributed-memory SPMDMPMD

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 43 Methods of Problem Decomposition for Parallel Programming Want to map (Problem + Algorithms + Data) Architecture Conceptualize mapping via e.g., pseudocode Realize mapping via programming language Data Decomposition - data parallel program –Each processor performs the same task on different data –Example - grid problems Task (Functional ) Decomposition - task parallel program –Each processor performs a different task –Example - signal processing – adding/subtracting frequencies from spectrum Other Decomposition methods

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 44 Designing and Building Parallel Applications Generic Problem Architectures Design and Construction Principles Incorporate Computer Science Algorithms Use Parallel Numerical Libraries Where Possible

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 45 Designing and Building Parallel Applications Know when (not) to parallelize is very important Cherri Pancake’s “Rules” summarized: Frequency of Use Execution Time Resolution Needs Problem Size

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 46 Categories of Parallel Problems Generic Parallel Problem “Architectures” ( after G Fox) Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”) –Same application run on different data –Could be run on separate machines –Example: Parameter Studies Almost Ideally Parallel –Similar to Ideal case, but with “minimum” coordination required –Example: Linear Monte Carlo calculations, integrals Pipeline Parallelism –Problem divided into tasks that have to be completed sequentially –Can be transformed into partially sequential tasks –Example: DSP filtering Synchronous Parallelism –Each operation performed on all/most of data –Operations depend on results of prior operations –All processes must be synchronized at regular points –Example: Modeling Atmospheric Dynamics Loosely Synchronous Parallelism –similar to Synchronous case, but with “minimum” intermittent data sharing –Example: Modeling Diffusion of contaminants through groundwater

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 47 Designing and Building Parallel Applications Attributes of Parallel Algorithms –Concurrency - Many actions performed “simultaneously” –Modularity - Decomposition of complex entities into simpler components –Locality - Want high ratio of of local memory access to remote memory access –Usually want to minimize communication/computation ratio –Performance Measures of algorithmic “efficiency” –Execution time –Complexity usually ~ Execution Time –Scalability

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 48 Designing and Building Parallel Applications Partitioning - Break down main task into smaller ones – either identical or “disjoint”. Communication phase - Determine communication patterns for task coordination, communication algorithms. Agglomeration - Evaluate task and/or communication structures wrt performance and implementation costs. Tasks may be combined to improve performance or reduce communication costs. Mapping - Tasks assigned to processors; maximize processor utilization, minimize communication costs. Mapping may be either static or dynamic. May have to iterate whole process until satisfied with expected performance –Consider writing application in parallel, using either SPMD message-passing or shared-memory –Implementation (software & hardware) may require revisit, additional refinement or re-design

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 49 Designing and Building Parallel Applications Partitioning –Geometric or Physical decomposition (Domain Decomposition) - partition data associated with problem –Functional (task) decomposition – partition into disjoint tasks associated with problem –Divide and Conquer – partition problem into two simpler problems of approximately equivalent “size” – iterate to produce set of indivisible sub- problems

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 50 Generic Parallel Programming Software Systems Message-Passing –Local tasks, each encapsulating local data –Explicit data exchange –Supports both SPMD and MPMD –Supports both task and data decomposition –Most commonly used –Process-based, but for performance, processes should be running on separate CPUs –Example API: MPI, PVM Message-Passing libraries –MP systems, in particular, MPI, will be focus of remainder of workshop Data Parallel –Usually SPMD –Supports data decomposition –Data mapping to cpus may be either implicit/explicit –Example: HPF compiler Shared-Memory –Tasks share common address space –No explicit transfer of data - supports both task and data decomposition –Can be SPMD, MPMD –Thread-based, but for performance, threads should be running on separate CPUs –Example API : OpenMP, Pthreads Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data decomposition –Example: OpenMP + MPI

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 51 Programming Methodologies - Practical Aspects Bulk of parallel programs written in Fortran, C, or C++ Generally, best compiler, tool support for parallel program development Bulk of parallel programs use Message-Passing with MPI Performance, portability, mature compilers, libraries for parallel program development Data and/or tasks are split up onto different processors by: Distributing the data/tasks onto different CPUs, each with local memory (MPPs,MPI) Distribute work of each loop to different CPUs (SMPs,OpenMP, Pthreads) Hybrid - distribute data onto SMPs and then within each SMP distribute work of each loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 52 Typical Data Decomposition for Parallelism Example: Solve 2-D Wave Equation: Original partial differential equation: Finite Difference Approximation: PE #0PE #1PE #2PE #4PE #5PE #6PE #3PE #7 y x

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 53 Sending Data Between CPUs Finite Difference Approximation: PE #0PE #1 PE #3PE #4 y x i=1,25 j=1,25 i=1,25 j=26,50 i=26,50 j=1,25 i=26,50 j=26,50 i=26-50,j=25 i=1-25, j=26 i=26-50,j=25 i=26,j=1-25i=26,j=26-50 i=25,j=1-25i=25,j=26-50 if (taskid=0) then li = 1 ui = 25 lj = 1 uj = 25 send(1:25)=f(25,1:25) elseif (taskid=1)then.... elseif (taskid=2) then... elseif(taskid=3) then... end if do j = lj,uj do i = li,ui work on f(i,j) end do Sample Pseudo Code

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 54 Typical Task Parallel Decomposition Signal processing Use one processor for each independent task Can use more processors if one is overloadedv SPECTRUM IN Subtract Frequency f1 Subtract Frequency f2 Subtract Frequency f3 SPECTRUM OUT Process 0Process 1Process 2

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 55 Basics of Task Parallel Decomposition - SPMD Same program will run on 2 different CPUs Task decomposition analysis has defined 2 tasks (a and b) to be done by 2 CPUs program.f: … initialize... if TaskID=A then do task a elseif TaskID=B then do task b end if …. end program Task A Execution Stream Task B Execution Stream program.f: … Initialize … do task a … end program program.f: … Initialize … do task b … end program

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 56 Multi-Level Task Parallelism Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) Program tskpar Implicit none (declarations) Do loop #1 par block End task #1 (serial work) Do loop #2 par block End task #2 (serial work) MPI network Proc set #1Proc set #2Proc set #3Proc set #4 MPI threads Implementation: MPI and OpenMP

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 57 Parallel Application Performance Concepts Parallel Speedup Parallel Efficiency Parallel Overhead Limits on Parallel Performance

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 58 Parallel Application Performance Concepts Parallel Speedup - ratio of best sequential time to parallel execution time –S(n) = ts/tp Parallel Efficiency - fraction of time processors in use –E(n) = ts/(tp*n) = S(n)/n Parallel Overhead –Communication time (Message-Passing) –Process creation/synchronization (MP) –Extra code to support parallelism, such as Load Balancing –Thread creation/coordination time (SMP) Limits on Parallel Performance

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 59 Parallel Application Performance Concepts Parallel Speedup - ratio of best sequential time to parallel execution time –S(n) = ts/tp Parallel Efficiency - fraction of time processors in use –E(n) = ts/(tp*n) = S(n)/n Parallel Overhead –Communication time (Message-Passing) –Process creation/synchronization (MP) –Extra code to support parallelism, such as Load Balancing –Thread creation/coordination time (SMP) Limits on Parallel Performance

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 60 Limits of Parallel Computing Theoretical upper limits –Amdahl’s Law –Gustafson’s Law Practical limits –Communication overhead –Synchronization overhead –Extra operations necessary for parallel version Other Considerations –Time used to re-write (existing) code

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 61 Parallel Computing - Theoretical Performance Upper Limits All parallel programs contain: –Parallel sections –Serial sections Serial sections limit the parallel performance Amdahl’s Law provides a theoretical upper limit on parallel performance for size-constant problems

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 62 S  1 f s  f p /N Amdahl’s Law Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors –Effect of multiple processors on run time for size-constant problems –Effect of multiple processors on parallel speedup, S: –Where fs = serial fraction of code fp = parallel fraction of code N = number of processors t1 = sequential execution time

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 63 Amdahl’s Law

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 64 Amdahl’s Law (Continued)

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 65 Gustafson’s Law Consider scaling problem size as processor count increased Ts = serial execution time Tp(N,W) = parallel execution time for same problem, size W, on N CPUs S(N,W) = Speedup on problem size W, N CPUs S(N,W) = (Ts + Tp(1,W) )/( Ts + Tp(N,W) ) Consider case where Tp(N,W) ~ W*W/N S(N,W) -> (N*Ts + N*W*W)/(N*Ts + W*W) -> N Gustafson’s Law provides some hope for parallel applications to deliver on the promise.

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 66 Parallel Programming Analysis - Example Consider solving 2-D Poisson’s equation by iterative method on a regular grid with M points –

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 67 Parallel Programming Concepts Program must be correct and terminate for some input data set(s) Race condition – result(s) depends upon order in which processes/threads finish calculation(s). May or may not be problem, depending upon results Deadlock – Process/thread requests resource it will never get. To be avoided – common problem in message-passing parallel programs

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 68 Other Considerations Writing efficient parallel applications is usually more difficult than writing serial applications –Serial version may (may not) provide good starting point for parallel version –Communication, synchronization, etc., can limit parallel performance Usually want to overlap communication and computation to minimize ratio of communication to computation time –Serial time can dominate –CPU computational load balance is important Is it worth your time to rewrite existing application? Or create new one? Recall Cherri Pancake’s Rules (simplified version). –Do the CPU and/or memory requirements justify parallelization? –Will the code be used “enough” times to justify parallelization?

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 69 Parallel Programming - Real Life These are the main models in use today (circa 2002) New approaches – languages, hardware, etc., are likely to arise as technology advances Other combinations of these models are possible Large applications will probably use more than one model Shared memory model is closest to mathematical model of application –Scaling to large numbers of cpus is major issue

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 70 Parallel Computing References NPACI PCOMP web-page - –Selected HPC link collection - categorized, updated Online Tutorials, Books –Designing and Building Parallel Programs, Ian Foster. –NCSA Intro to MPI Tutorial –HLRS Parallel Programming Workshop Books –Parallel Programming, B. Wilkinson, M. Allen –Computer Organization and Design, D. Patterson and J. L. Hennessy –Scalable Parallel Computing, K. Huang, Z. Xu