1 Advanced computer systems (Chapter 12)

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Fundamental of Computer Architecture By Panyayot Chaikan November 01, 2003.
Today’s topics Single processors and the Memory Hierarchy
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multiple Processor Systems
Jie Liu, Ph.D. Professor Department of Computer Science
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Parallel Computers Chapter 1
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Computer System Architecture Dept. of Info. Of Computer. Chap. 13 Multiprocessors 13-1 Chap. 13 Multiprocessors n 13-1 Characteristics of Multiprocessors.
Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Super computers Parallel Processing
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
10-1 Chapter 10 - Trends in Computer Architecture Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of.
In1210/01-PDS 1 TU-Delft Large systems. In1210/01-PDS 2 TU-Delft Why parallelism(1) l Fundamental laws of nature: -example: channel widths are becoming.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Classification of parallel computers Limitations of parallel processing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Overview Parallel Processing Pipelining
Parallel Architecture
Distributed and Parallel Processing
Advanced computer systems (Chapter 12)
Multiprocessor Systems
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Course Outline Introduction in algorithms and applications
Parallel Architectures Based on Parallel Computing, M. J. Quinn
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
AN INTRODUCTION ON PARALLEL PROCESSING
Presentation transcript:

1 Advanced computer systems (Chapter 12)

TU-Delft TI1400/11-PDS 2 Large-Scale Computer Systems Today Low-energy defibrillation -Saves lives -Affects >2M people/year Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010,

TU-Delft TI1400/11-PDS 3 Large-Scale Computer Systems Today Genome sequencing -May save lives -The $1,000 barrier -Large-scale molecular dynamics simulations Tectonic plate movement -May save lives -Adaptive fine mesh simulations -Using 200,000 processors Source: TeraGrid science highlights 2010,

TU-Delft TI1400/11-PDS 4 Large-Scale Computer Systems Today Public Content Generation -Wikipedia -Affects how we think about collaborations “The distribution of effort has increasingly become more uneven, unequal” Sorin Adam Matei Purdue University Source: TeraGrid science highlights 2010,

TU-Delft TI1400/11-PDS 5 Large-Scale Computer Systems Today Online Gaming -World of Warcraft, Zynga -Affects >250M people “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” -75,000 cores -Upkeep: >135,000$/day (?) Source: and and

TU-Delft TI1400/11-PDS 6 Why parallelism (1/4) Fundamental laws of nature: -example: channel widths are becoming so small that quantum properties are going to determine device behaviour -signal propagation time increases when channel widths shrink

TU-Delft TI1400/11-PDS 7 Why parallelism (2/4) Engineering constraints: -Phase transition time of a component is a good measure for the maximum obtainable computing speed example: optical or superconducting devices can switch in seconds optimistic suggestion: 1 TIPS (Tera Instructions Per Second, ) is possible -However, we must calculate something assume we need 10 phase transitions: 0.1 TIPS

TU-Delft TI1400/11-PDS 8 Why parallelism (3/4) 0.5 cm But what about memory ? It takes light approximately 16 picoseconds to cross 0.5 cm, yielding a possible execution rate of 60 GIPS However, in silicon, speed is about 10 times slower, resulting in 6 GIPS

TU-Delft TI1400/11-PDS 9 Why parallelism (4/4) Speed of sequential computers is limited to a few GIPS Improvements by using parallelism: -multiple functional units (instruction-level parallelism) -multiple CPUs (parallel processing)

TU-Delft TI1400/11-PDS 10 Quantum Computing? “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” Does surrounding noise matter? -Wim van Dam, Nature Physics 2007 May 25, Lockheed Martin (10M$) -D-Wave One 128 qubit Source:

TU-Delft TI1400/11-PDS 11 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 12 Classification of computers (Flynn Taxonomy) Single Instruction, Single Data (SISD) -conventional system Single Instruction, Multiple Data (SIMD) -one instruction on multiple data objects Multiple Instruction, Multiple Data (MIMD) -multiple instruction streams on multiple data streams Multiple Instruction, Single Data (MISD) -?????

TU-Delft TI1400/11-PDS 13 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 14 SIMD (Array) Processors Instruction Issuing Unit PE PE = Processing Element INCR..... CM-5’91 CM-2’87 Peak: 28GFLOPS Sustainable: 5-10% Sources: and and (about the blinking leds)

TU-Delft TI1400/11-PDS 15 MIMD Uniform Memory Access (UMA) architecture N P1P1 P2P2 PmPm M1M1 M2M2 MkMk interconnection network Uniform Memory Access (UMA) computer Any processor can access directly any memory.

TU-Delft TI1400/11-PDS 16 MIMD NUMA architecture N P1P1 P2P2 PmPm M1M1 M2M2 MmMm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory ) Any processor can access directly any memory.

TU-Delft TI1400/11-PDS 17 MIMD Distributed memory architecture P1P1 P2P2 PmPm M1M1 M2M2 MmMm interconnection network Any processor can access any memory, but sometimes through another processor (via messages).

TU-Delft TI1400/11-PDS 18 Example 1: Graphical Processing Units’s CPU versus GPU CPU: Much cache and control logic GPU: Much compute logic

TU-Delft TI1400/11-PDS 19 GPU Architecture Multiple SIMD units SIMD pipelining Simple processors High branch penalty Efficient operation on -parallel data -regular streaming SIMD architecture

TU-Delft TI1400/11-PDS 20 Example 2: Cell B.E. Distributed memory architecture PowerPC 8 identical cores

TU-Delft TI1400/11-PDS 21 Example 3: Intel Quad-core Shared Memory MIMD

TU-Delft TI1400/11-PDS 22 Example 4: Large MIMD Clusters BlueGene/L

TU-Delft TI1400/11-PDS 23 Supercomputers Over Time Source:

TU-Delft TI1400/11-PDS 24 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks (I/O) 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 25 Interconnection networks (I/O between processors) Difficulty in building systems with many processors: the interconnections Important parameters: 1.Diameter: -Maximal distance between any two processors 2.Degree: -Maximal number of connections per processor 3.Total number of connections (Cost) 4.Bisection width -Largest number of simultaneous messages

TU-Delft TI1400/11-PDS 26 Multiple bus Bus 1 Bus 2 (Multiple) bus structures

TU-Delft TI1400/11-PDS 27 Cross bar Cross-bar interconnection network N 2 switches N Sun E10000 Source:

TU-Delft TI1400/11-PDS 28 Multi-stage networks (1/4) P0P0 P1P1 stage1stage2stage3 8 modules 3-bit ids P3P3 P5P5 path from P 5 to P 3 P7P7

TU-Delft TI1400/11-PDS 29 Multi-stage networks (2/4) P0P0 P1P1 stage1stage2 stage3 P4P4 P5P5 connections P 4 -P 0 and P 5 -P 3 both use P3P “Shuffle”: 2 x ½ deck, interleave Shuffle Network

TU-Delft TI1400/11-PDS 30 Multi-stage network (3/4) Multistage networks: multiple steps Example: Shuffle or Omega network Every processor identified by three-bit number (in general, n-bit number) Message from processor to another contains identifier of destination Routing algorithm: In every stage, -inspect one bit of destination -if 0: use upper output -if 1: use lower output

TU-Delft TI1400/11-PDS 31 Multi-stage network (4/4) Properties: -Let N = 2 n be the number of processing elements -Number of stages n = log 2 N -Number of switches per stage N/2 -Total number of (2x2) switches N(log 2 N)/2 Not every pair of connections can be simultaneously realized -Blocking

TU-Delft TI1400/11-PDS 32 Hypercubes (1/3) n = 2 n = 3 n.2 n-1 connections maximum distance n hops Connected PEs differ by 1 bit Routing: - scan bits from right to left - if different, send to neighbor with same bit different - repeat until end 000 -> 111 Non-uniform delay, so for NUMA architectures.

TU-Delft TI1400/11-PDS 33 Hypercubes (2/3) Question: what is the average distance between two nodes in a hypercube?

TU-Delft TI1400/11-PDS 34 Hypercubes (3/3) Answer: -take a specific node (situation is the same for all of them) -number of nodes at distance 0: 1 -number of nodes at distance n: 1 -average: n/2 -#nodes at distance 1 (one bit difference): n -#nodes at distance n-1 (all but one bit difference): n -again average n/2 -similar for distances k and n-k -so overall average distance is n/2

TU-Delft TI1400/11-PDS 35 Mesh Constant number of connections per node

TU-Delft TI1400/11-PDS 36 Torus mesh with wrap-around connections

TU-Delft TI1400/11-PDS 37 Tree

TU-Delft TI1400/11-PDS 38 Fat tree Nodes have multiple parents …

TU-Delft TI1400/11-PDS 39 Local networks Ethernet -based on collision detection -upon collision, back off and randomly try later -speed up to 100 Gb/s (Terabit Ethernet?) Token ring -based on token circulation on ring -possession of token allows putting message on the ring PC

TU-Delft TI1400/11-PDS 40 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 41 Memory organization (1/2) Processor Secondary Cache Network Interface network UMA architectures.

TU-Delft TI1400/11-PDS 42 Memory organization (2/2) Processor Secondary Cache Network Interface network Local Memory NUMA architectures.

TU-Delft TI1400/11-PDS 43 Cache coherence Problem: caches in multiprocessors may have copies of the same variable -Copies must be kept identical Cache coherence: all copies of a shared variable have the same value Solutions: -write through to shared memory and all caches -invalidate cache entries in all other caches Snoopy caches: -Proc.Elements sense write and adapt cache or do invalidate

TU-Delft TI1400/11-PDS 44 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 45 Parallelism PARBEGIN task_1; task_2;.... …. task_n; PAREND PARBEGIN PAREND task 1task n Language construct:

TU-Delft TI1400/11-PDS 46 Shared variables (1/4)..... STWR2, SUM(0)..... Task_ STWR2, SUM(0)..... Task_2 T1T2 SUM shared memory

TU-Delft TI1400/11-PDS 47 Shared variables (2/4) Suppose processsors both 1 and 2 execute: LW A,R0/* A is variable in main memory */ ADDR1,R0 STWR0,A Initially: -A=100 -R1 in processor 1 is 20 -R1 in processor 2 is 40 What is the final value of A? 120, 140, 160? Now consider the final value of A is your bank account balance.

TU-Delft TI1400/11-PDS 48 Shared variables (3/4) So there is a need for mutual exclusion: -different components of the same program need exclusive access to a data structure to ensure consistent values Occurs in many situations: -access to shared variables -access to a printer A solution: a single instruction (Test&Set) that -tests whether somebody else accesses the variable -if so, continue testing (busy waiting) -if not, indicates that the variable is being accessed

TU-Delft TI1400/11-PDS 49 Shared variables (4/4) crit:T&S LOCK,crit STWR2, SUM(0)..... CLRLOCK Task_1Task_2 T1T2 SUM shared memory crit:T&S LOCK,crit STWR2, SUM(0)..... CLR LOCK LOCK

TU-Delft TI1400/11-PDS 50 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers [earlier, see Token Ring et al.] 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 51 Example program Compute dot product of two vectors with a 1.sequential program 2.two tasks with shared memory 3.two tasks with distributed memory using messages Primitives in parallel programs: -create_thread() (create a (sub)process) -mypid() (who am I?)

TU-Delft TI1400/11-PDS 52 Sequential program integer array a[1..N], b[1..N] integer dot_product dot_product :=0; do_dot(a,b) print do_product procedure do_dot(integer array x[1..N], integer array y[1..N]) for k:=1 to N dot_product := dot_product + x[k]*y[k] end

TU-Delft TI1400/11-PDS 53 Shared memory program 1 (1/2) shared integer array a[1..N], b[1..N] shared integer dot_product shared lock dot_product_lock shared barrier done dot_product :=0; create_thread(do_dot,a,b) do_dot(a,b) print dot_product barrier dot_product id=0 id=1

TU-Delft TI1400/11-PDS 54 Shared memory program 1 (2/2) procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id id := mypid(); /* who am I? */ for k:=(id*N/2)+1 to (id+1)*N/2 lock(dot_product_lock) dot_product := dot_product + x[k]*y[k] unlock(dot_product_lock) end barrier(done) end k in [1..N/2] or [N/2+1..N] Critical section

TU-Delft TI1400/11-PDS 55 Shared memory program 2 (1/2) procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id, local_dot_product id := mypid(); local_dot_product :=0; for k:=(id*N/2)+1 to (id+1)*N/2 local_dot_product := local_ dot_product + x[k]*y[k] end lock(dot_product_lock) dot_product := dot_product + local_dot_product unlock(dot_product_lock) barrier (done) end local computation (can exec in parallel) access shared variable (mutex)

TU-Delft TI1400/11-PDS 56 Shared memory program 2 (2/2) barrier dot_product local_dot_product id=0 id=1 local_dot_product

TU-Delft TI1400/11-PDS 57 Message passing program (1/3) integer array a[1..N/2], temp_a [1..N/2], b[1..N/2], temp_b [1..N/2] integer dot_product, id, temp id := mypid(); if (id=0) then send(temp_a[1..N/2], 1); /* send second halves */ send(temp_b[1..N/2], 1); /* of the two arrays to 1 */ else receive(a[1..N/2], 0); /* receive second halves of */ receive(b[1..N/2], 0); /* the two arrays from proc.0*/ end P0P0 P1P1

TU-Delft TI1400/11-PDS 58 Message passing program (2/3) dot_product :=0; do_dot(a,b) /* arrays of length N/2 */ if (id =1) send (dot_product,0) else receive (temp,1) dot_product := dot_product +temp print dot_product end

TU-Delft TI1400/11-PDS 59 Message passing program (3/3) local_dot_product data result id=0id=1 local_dot_product temp_a/b dot_product

TU-Delft TI1400/11-PDS 60 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers [Book] 8.A Programmer’s View 9.Performance Considerations (Amdahl’s Law)

TU-Delft TI1400/11-PDS 61 Speedup Let T P be the time needed to execute a program on P processors Speedup: S P = T 1 /T P Ideal: S P =P (linear speedup) Usually: sublinear speedup due to communication, algorithm, etc Sometimes: superlinear speedup

TU-Delft TI1400/11-PDS 62 Amdahl’s law Suppose a program has: -a parallelizable fraction f -and so a sequential fraction 1-f Then (Amdahl’s law): -S P = T 1 /T P = = 1 / (1-f + f/P) = = P / (P – f(P-1)) Consequence: - If 1-f significant, cannot have T P → 0 even when P → ∞ Sequential PARALLELPARALLEL time 1-f f/P P If f = 0.95 (95%), then: S 16 = 16 / ( x 15) = 9.14 S 64 = 64 / ( x 63) = S 1k ~ S 1M ~ S 100M <