Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Lecture 18: Multiprocessors
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Overview Parallel Processing Pipelining
Multiprocessor Systems
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
CSE 586 Computer Architecture Lecture 8
Course Outline Introduction in algorithms and applications
The University of Adelaide, School of Computer Science
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 23: Virtual Memory, Multiprocessors
The University of Adelaide, School of Computer Science
Presentation transcript:

Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although ILP is exploited Single Program Counter -> Single Instruction stream The data is not “streaming” Single Instruction stream, Multiple Data stream (SIMD) –Popular for some applications like image processing –One can construe vector processors to be of the SIMD type. –MMX extensions to ISA reflect the SIMD philosophy Also apparent in “multimedia” processors (Equator Map-1000) –“Data Parallel” Programming paradigm

Multiprocessors CSE 4712 Flynn’s Taxonomy (c’ed) Multiple Instruction stream, Single Data stream (MISD) –Until recently no processor that really fits this category –“Streaming” processors; each processor executes a kernel on a stream of data –Maybe VLIW? Multiple Instruction stream, Multiple Data stream (MIMD) –The most general –Covers: Shared-memory multiprocessors Message passing multicomputers (including networks of workstations cooperating on the same problem; grid computing)

Multiprocessors CSE 4713 Shared-memory Multiprocessors Shared-Memory = Single shared-address space (extension of uniprocessor; communication via Load/Store) Uniform Memory Access: UMA –With a shared-bus, it’s the basis for SMP’s (Symmetric MultiProcessing) –Cache coherence enforced by “snoopy” protocols –Number of processors limited by Electrical constraints on the load of the bus Contention for the bus –Form the basis for clusters (but in clusters access to memory of other clusters is not UMA)

Multiprocessors CSE 4714 SMP (Symmetric MultiProcessors aka Multis) Single shared-bus Systems Proc. Caches Shared-bus I/O adapter Interleaved Memory

Multiprocessors CSE 4715 Shared-memory Multiprocessors (c’ed) Non-uniform memory access: NUMA –NUMA-CC: cache coherent (directory-based protocols or SCI) –NUMA without cache coherence (enforced by software) –COMA: Cache Memory Only Architecture –Clusters Distributed Shared Memory: DSM –Most often network of workstations. –The shared address space is the “virtual address space” –O.S. enforces coherence on a page per page basis

Multiprocessors CSE 4716 UMA – Dance-Hall Architectures & NUMA Replace the bus by an interconnection network –Cross-bar –Mesh –Perfect shuffle and variants Better to improve locality with NUMA, Each processing element (PE) consists of: –Processor –Cache hierarchy –Memory for local data (private) and shared data Cache coherence via directory schemes

Multiprocessors CSE 4717 UMA - Dance-Hall Schematic … Processors Caches Inter- connect Main memory modules …

Multiprocessors CSE 4718 NUMA … Inter- connect … Processors caches Local memories

Multiprocessors CSE 4719 Shared-bus Number of devices is limited (length, electrical constraints) The longer the bus, the alrger number of devices but also becomes slower because –Length –Contention Ultra simplified analysis for contention: –Q = Processor time between L2 misses; T = bus transaction time –Then for 1 process P= bus utilization for 1 processor = T/(T+Q) –For n processors sharing the bus, probability that the bus is busy B(n) = 1 – (1-P) n

Multiprocessors CSE Cross-bars and Direct Interconnection Networks Maximum concurrency between n processors and m banks of memory (or cache) Complexity grows as O(n 2 ) Logically a set of n multiplexers But also need of queuing for contending requests Small cross-bars building blocks for direct interconnection networks –Each node is at the same distance of every other node

Multiprocessors CSE An O(nlogn) network: Butterfly To go from processor i(xyz in binary) to processor j (uvw), start at i and at each stage k follow either the high link if the kth bit of the destination address is 0 or the low link if it is 1. For example the path to go from processor 4 (100) to processor 6 (110) is marked in bold lines.

Multiprocessors CSE Indirect Interconnection Networks Nodes are at various distances of each other Characterized by their dimension –Various routing mechanisms. For example “higher dimension first” 2D meshes and tori 3D cubes More dimensions: hypercubes Fat trees

Multiprocessors CSE Mesh

Multiprocessors CSE Message-passing Systems Processors communicate by messages –Primitives are of the form “send”, “receive” –The user (programmer) has to insert the messages –Message passing libraries (MPI, OpenMP etc.) Communication can be: –Synchronous: The sender must wait for an ack from the receiver (e.g, in RPC) –Asynchronous: The sender does not wait for a reply to continue

Multiprocessors CSE Shared-memory vs. Message-passing An old debate that is not that much important any longer Many systems are built to support a mixture of both paradigms –“send, receive” can be supported by O.S. in shared-memory systems –“load/store” in virtual address space can be used in a message- passing system (the message passing library can use “small” messages to that effect, e.g. passing a pointer to a memory area in another computer)

Multiprocessors CSE The Pros and Cons Shared-memory pros –Ease of programming (SPMD: Single Program Multiple Data paradigm) –Good for communication of small items –Less overhead of O.S. –Hardware-based cache coherence Message-passing pros –Simpler hardware (more scalable) –Explicit communication (both good and bad; some programming languages have primitives for that), easier for long messages –Use of message passing libraries

Multiprocessors CSE Caveat about Parallel Processing Multiprocessors are used to: –Speedup computations –Solve larger problems Speedup –Time to execute on 1 processor / Time to execute on N processors Speedup is limited by the communication/computation ratio and synchronization Efficiency –Speedup / Number of processors

Multiprocessors CSE Amdahl’s Law for Parallel Processing Recall Amdahl’s law –If x% of your program is sequential, speedup is bounded by 1/x At best linear speedup (if no sequential section) What about superlinear speedup? –Theoretically impossible –“Occurs” because adding a processor might mean adding more overall memory and caching (e.g., fewer page faults!) –Have to be careful about the x% of sequentiality. Might become lower if the data set increases. Speedup and Efficiency should have the number of processors and the size of the input set as parameters

Multiprocessors CSE Chip MultiProcessors (CMPs) Multiprocessors vs. multicores –Multiprocessors have private cache hierarchy (on chip) –Multicores have shared L2 (on chop) How many processors –Typically today 2 to 4 –Tomorrow 8 to 16 –Next decade ??? Interconnection –Today cross-bar –Tomorrow ??? Biggest problems –Programming (parallel programming language) –Applications that require parallelism