1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Today’s topics Single processors and the Memory Hierarchy
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Computing Overview CS 524 – High-Performance Computing.

Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Parallel Computing.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Parallel Architecture
18-447: Computer Architecture Lecture 30B: Multiprocessors
Multiprocessor Systems
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Chapter 17 Parallel Processing
Parallel Processing Architectures
Part 2: Parallel Models (I)
Presentation transcript:

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

2 Levels of Parallelism  Job level parallelism: Capacity computing  Goal: run as many jobs as possible on a system for given time period. Concerned about throughput; Individual user’s jobs may not run faster.  Of interest to administrators  Program/Task level parallelism: Capability computing  Use multiple processors to solve a single problem.  Controlled by users.  Instruction level parallelism:  Pipeline, multiple functional units, multiple cores.  Invisible to users.  Bit-level parallelism:  Of concern to hardware designers of arithmetic-logic units

3 Granularity of Parallel Tasks  Large/coarse grain parallelism:  Amount of operations that run in parallel is fairly large  e.g., on the order of an entire program  Small/fine grain parallelism:  Amount of operations that run in parallel is relatively small  e.g., on the order of single loop. Coarse/large grains usually result in more favorable parallel performance

4 Flynn’s Taxonomy of Computers  SISD: Single instruction stream, single data stream  MISD: Multiple instruction streams, single data stream  SIMD: Single instruction stream, multiple data streams  MIMD: Multiple instruction streams, multiple data streams

5 Classification of Computers  SISD: single instruction single data  Conventional computers  CPU fetches from one instruction stream and works on one data stream.  Instructions may run in parallel (superscalar).  MISD: multiple instruction single data  No real world implementation.

6 Classification of Computers  SIMD: single instruction multiple data  Controller + processing elements (PE)  Controller dispatches an instruction to PEs; All PEs execute same instruction, but on different data  e.g., MasPar MP-1, Thinking machines CM-1, vector computers (?)  MIMD: multiple instruction multiple data  Processors execute own instructions on different data streams  Processors communicate with one another directly, or through shared memory.  Usual parallel computers, clusters of workstations

7 Flynn’s Taxonomy

8 Programming Model  SPMD: Single program multiple data  MPMD: multiple programs multiple data

9 Programming Model  SPMD: Single program multiple data  Usual parallel programming model  All processors execute same program, on multiple data sets (domain decomposition)  Processor knows its own ID if(my_cpu_id == N){} else {}

10 Programming Model  MPMD: Multiple programs multiple data  Different processors execute different programs, on different data  Usually a master-slave model is used. Master CPU spawns and dispatches computations to slave CPUs running a different program.  Can be converted into SPMD model if(my_cpu_id==0) run function_containing_program_1; else run function_containing_program_2;

11 Classification of Parallel Computers  Flynn’s MIMD computers contain a wide variety of parallel computers  Based on memory organization (address space):  Shared-memory parallel computers Processors can access all memories  Distributed-memory parallel computers Processor can only access local memory Remote memory access through explicit communication

12 Shared-Memory Parallel Computer  Superscalar processors with L2 cache connected to memory modules through a bus or crossbar  All processors have access to all machine resources including memory and I/O devices  SMP (symmetric multiprocessor): if processors are all the same and have equal access to machine resources, i.e. it is symmetric.  SMP are UMA (Uniform Memory Access) machines  e.g., A node of IBM SP machine; SUN Ultraenterprise Prototype shared-memory parallel computer P – processor; C – cache; M – memory. Bus or crossbar M1 P1 C M2 P2 C M3 P3 C Mn Pn C … … memory

13 Shared-Memory Parallel Computer  If bus,  Only one processor can access the memory at a time.  Processors contend for bus to access memory  If crossbar,  Multiple processors can access memory through independent paths  Contention when different processors access same memory module  Crossbar can be very expensive.  Processor count limited by memory contention and bandwidth  Max usually 64 or 128 … … P1 C M1M2Mn P2 C Pn C bus M1 P1 C M2 P2 C M3 P3 C Mn Pn C … … memory crossbar memory

14 Shared-Memory Parallel Computer  Data flows from memory to cache, to processors  Performance depends dramatically on reuse of data in cache  Fetching data from memory with potential memory contention can be expensive  L2 cache plays of the role of local fast memory; Shared memory is analogous to extended memory accessed in blocks

15 Cache Coherency  If a piece of data in one processor’s cache is modified, then all other processors’ cache that contain that data must be updated.  Cache coherency: the state that is achieved by maintaining consistent values of same data in all processors’ caches.  Usually hardware maintains cache coherency; System software can also do this, but more difficult.

16 Programming Shared-Memory Parallel Computers  All memory modules have the same global address space.  Closest to single-processor computer  Relatively easy to program.  Multi-threaded programming:  Auto-parallelizing compilers can extract fine-grain (loop-level) parallelism automatically;  Or use OpenMP;  Or use explicit POSIX (portable operating system interface) threads or other thread libraries.  Message passing:  MPI (Message Passing Interface).

17 Distributed-Memory Parallel Computer  Superscalar processors with local memory connected through communication network.  Each processor can only work on data in local memory  Access to remote memory requires explicit communication.  Present-day large supercomputers are all some sort of distributed- memory machines Communication Network P1 M P2 M Pn M … Prototype distributed-memory computer e.g. IBM SP, BlueGene; Cray XT3/XT4

18 Distributed-Memory Parallel Computer  High scalability  No memory contention such as those in shared-memory machines  Now scaled to > 100,000 processors.  Performance of network connection crucial to performance of applications.  Ideal: low latency, high bandwidth Communication much slower than local memory read/write Data locality is important. Frequently used data  local memory

19 Programming Distributed-Memory Parallel Computer  “Owner computes” rule  Problem needs to be broken up into independent tasks with independent memory  Each task assigned to a processor  Naturally matches data based decomposition such as a domain decomposition  Message passing: tasks explicitly exchange data by message passing.  Transfers all data using explicit send/receive instructions  User must optimize communications  Usually MPI (used to be PVM), portable, high performance  Parallelization mostly at large granularity level controlled by user  Difficult for compilers/auto-parallelization tools

20 Programming Distributed-Memory Parallel Computer  A global address space is provided on some distributed- memory machine  Memory physically distributed, but globally addressable; can be treated as “shared-memory” machine; so-called distributed shared-memory.  Cray T3E; SGI Altix, Origin.  Multi-threaded programs (OpenMP, POSIX threads) can also be used on such machines  User accesses remote memory as if it were local; OS/compilers translate such accesses to fetch/store over the communication network.  But difficult to control data locality; performance may suffer.  NUMA (non-uniform memory access); ccNUMA (cache coherent non-uniform memory access); overhead

21 Hybrid Parallel Computer  Overall distributed memory, SMP nodes  Most modern supercomputers and workstation clusters are of this type  Message passing; or hybrid message passing/threading. MM Bus or crossbar PP MM PP Communication network …… Hybrid parallel computer e.g. IBM SP, Cray XT3

22 Interconnection Network/Topology  Nodes, links  Neighbors: nodes with a link between them  Degree of a node: number of neighbors it has  Scalability: increase in complexity when more nodes are added. RingFully connected network

23 Topology Hypercube

24 Topology 1D/2D mesh/torus 3D mesh/torus

25 Topology Tree Star

26 Topology  Bisection width: minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (plus/minus one node)  Bisection bandwidth: communication bandwidth across the links that are cut in defining bisection width Larger bisection bandwidth  better