1 Advanced computer systems (Chapter 12)

1 Advanced computer systems (Chapter 12) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_12.ppt

TU-Delft TI1400/11-PDS 2 Large-Scale Computer Systems Today Low-energy defibrillation -Saves lives -Affects >2M people/year Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

TU-Delft TI1400/11-PDS 3 Large-Scale Computer Systems Today Genome sequencing -May save lives -The $1,000 barrier -Large-scale molecular dynamics simulations Tectonic plate movement -May save lives -Adaptive fine mesh simulations -Using 200,000 processors Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

TU-Delft TI1400/11-PDS 4 Large-Scale Computer Systems Today Public Content Generation -Wikipedia -Affects how we think about collaborations “The distribution of effort has increasingly become more uneven, unequal” Sorin Adam Matei Purdue University Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

TU-Delft TI1400/11-PDS 5 Large-Scale Computer Systems Today Online Gaming -World of Warcraft, Zynga -Affects >250M people “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” -75,000 cores -Upkeep: >135,000$/day (?) Source: http://www.gamasutra.com/php-bin/news_index.php?story=25307 and http://spectrum.ieee.org/consumer-electronics/gaming/engineering-everquest/0 and http://35yards.wordpress.com/2011/03/01/world-of-warcraft-by-the-numbers/http://www.gamasutra.com/php-bin/news_index.php?story=25307http://spectrum.ieee.org/consumer-electronics/gaming/engineering-everquest/0http://35yards.wordpress.com/2011/03/01/world-of-warcraft-by-the-numbers/

TU-Delft TI1400/11-PDS 6 Why parallelism (1/4) Fundamental laws of nature: -example: channel widths are becoming so small that quantum properties are going to determine device behaviour -signal propagation time increases when channel widths shrink

TU-Delft TI1400/11-PDS 7 Why parallelism (2/4) Engineering constraints: -Phase transition time of a component is a good measure for the maximum obtainable computing speed example: optical or superconducting devices can switch in 10 -12 seconds optimistic suggestion: 1 TIPS (Tera Instructions Per Second, 10 12 ) is possible -However, we must calculate something assume we need 10 phase transitions: 0.1 TIPS

TU-Delft TI1400/11-PDS 8 Why parallelism (3/4) 0.5 cm But what about memory ? It takes light approximately 16 picoseconds to cross 0.5 cm, yielding a possible execution rate of 60 GIPS However, in silicon, speed is about 10 times slower, resulting in 6 GIPS

TU-Delft TI1400/11-PDS 9 Why parallelism (4/4) Speed of sequential computers is limited to a few GIPS Improvements by using parallelism: -multiple functional units (instruction-level parallelism) -multiple CPUs (parallel processing)

TU-Delft TI1400/11-PDS 10 Quantum Computing? “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” Does surrounding noise matter? -Wim van Dam, Nature Physics 2007 May 25, 2011 -Lockheed Martin (10M$) -D-Wave One 128 qubit Source: http://www.engadget.com/2011/05/29/d-wave-sells-first-commercial-quantum-computer-to-lockheed-marti/http://www.engadget.com/2011/05/29/d-wave-sells-first-commercial-quantum-computer-to-lockheed-marti/

TU-Delft TI1400/11-PDS 11 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 12 Classification of computers (Flynn Taxonomy) Single Instruction, Single Data (SISD) -conventional system Single Instruction, Multiple Data (SIMD) -one instruction on multiple data objects Multiple Instruction, Multiple Data (MIMD) -multiple instruction streams on multiple data streams Multiple Instruction, Single Data (MISD) -?????

TU-Delft TI1400/11-PDS 14 SIMD (Array) Processors Instruction Issuing Unit PE PE = Processing Element INCR..... CM-5’91 CM-2’87 Peak: 28GFLOPS Sustainable: 5-10% Sources: http://cs.adelaide.edu.au/~sacpc/hardware.html#cm5 and http://www.paulos.net/other/cm2.html and http://boards.straightdope.com/sdmb/archive/index.php/t-515675.html (about the blinking leds)http://cs.adelaide.edu.au/~sacpc/hardware.html#cm5http://www.paulos.net/other/cm2.html http://boards.straightdope.com/sdmb/archive/index.php/t-515675.html

TU-Delft TI1400/11-PDS 15 MIMD Uniform Memory Access (UMA) architecture 0 1 2 3 4 5.. N P1P1 P2P2 PmPm...... M1M1 M2M2 MkMk interconnection network Uniform Memory Access (UMA) computer Any processor can access directly any memory.

TU-Delft TI1400/11-PDS 16 MIMD NUMA architecture 0 1 2 3 4 5.. N P1P1 P2P2 PmPm...... M1M1 M2M2 MmMm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory ) Any processor can access directly any memory.

TU-Delft TI1400/11-PDS 17 MIMD Distributed memory architecture 0 1 2 0 1 2 0 1 2 P1P1 P2P2 PmPm...... M1M1 M2M2 MmMm interconnection network Any processor can access any memory, but sometimes through another processor (via messages).

TU-Delft TI1400/11-PDS 18 Example 1: Graphical Processing Units’s CPU versus GPU CPU: Much cache and control logic GPU: Much compute logic

TU-Delft TI1400/11-PDS 19 GPU Architecture Multiple SIMD units SIMD pipelining Simple processors High branch penalty Efficient operation on -parallel data -regular streaming SIMD architecture

TU-Delft TI1400/11-PDS 20 Example 2: Cell B.E. Distributed memory architecture PowerPC 8 identical cores

TU-Delft TI1400/11-PDS 21 Example 3: Intel Quad-core Shared Memory MIMD

TU-Delft TI1400/11-PDS 22 Example 4: Large MIMD Clusters BlueGene/L

TU-Delft TI1400/11-PDS 23 Supercomputers Over Time Source: http://www.top500.orghttp://www.top500.org

TU-Delft TI1400/11-PDS 24 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks (I/O) 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 25 Interconnection networks (I/O between processors) Difficulty in building systems with many processors: the interconnections Important parameters: 1.Diameter: -Maximal distance between any two processors 2.Degree: -Maximal number of connections per processor 3.Total number of connections (Cost) 4.Bisection width -Largest number of simultaneous messages

TU-Delft TI1400/11-PDS 26 Multiple bus Bus 1 Bus 2 (Multiple) bus structures

TU-Delft TI1400/11-PDS 27 Cross bar Cross-bar interconnection network N 2 switches N Sun E10000 Source: http://www.cray-cyber.org/systems/E10k_detail.phphttp://www.cray-cyber.org/systems/E10k_detail.php

TU-Delft TI1400/11-PDS 28 Multi-stage networks (1/4) P0P0 P1P1 stage1stage2stage3 8 modules 3-bit ids P3P3 P5P5 path from P 5 to P 3 P7P7

TU-Delft TI1400/11-PDS 29 Multi-stage networks (2/4) P0P0 P1P1 stage1stage2 stage3 P4P4 P5P5 connections P 4 -P 0 and P 5 -P 3 both use P3P3 0 0 0 1 1 “Shuffle”: 2 x ½ deck, interleave Shuffle Network

TU-Delft TI1400/11-PDS 30 Multi-stage network (3/4) Multistage networks: multiple steps Example: Shuffle or Omega network Every processor identified by three-bit number (in general, n-bit number) Message from processor to another contains identifier of destination Routing algorithm: In every stage, -inspect one bit of destination -if 0: use upper output -if 1: use lower output

TU-Delft TI1400/11-PDS 31 Multi-stage network (4/4) Properties: -Let N = 2 n be the number of processing elements -Number of stages n = log 2 N -Number of switches per stage N/2 -Total number of (2x2) switches N(log 2 N)/2 Not every pair of connections can be simultaneously realized -Blocking

TU-Delft TI1400/11-PDS 32 Hypercubes (1/3) 0001 1011 000 001 010 011111 101 100 110 n = 2 n = 3 n.2 n-1 connections maximum distance n hops Connected PEs differ by 1 bit Routing: - scan bits from right to left - if different, send to neighbor with same bit different - repeat until end 000 -> 111 Non-uniform delay, so for NUMA architectures.

TU-Delft TI1400/11-PDS 33 Hypercubes (2/3) Question: what is the average distance between two nodes in a hypercube?

TU-Delft TI1400/11-PDS 34 Hypercubes (3/3) Answer: -take a specific node (situation is the same for all of them) -number of nodes at distance 0: 1 -number of nodes at distance n: 1 -average: n/2 -#nodes at distance 1 (one bit difference): n -#nodes at distance n-1 (all but one bit difference): n -again average n/2 -similar for distances k and n-k -so overall average distance is n/2

TU-Delft TI1400/11-PDS 35 Mesh Constant number of connections per node

TU-Delft TI1400/11-PDS 36 Torus mesh with wrap-around connections

TU-Delft TI1400/11-PDS 37 Tree

TU-Delft TI1400/11-PDS 38 Fat tree Nodes have multiple parents …

TU-Delft TI1400/11-PDS 39 Local networks Ethernet -based on collision detection -upon collision, back off and randomly try later -speed up to 100 Gb/s (Terabit Ethernet?) Token ring -based on token circulation on ring -possession of token allows putting message on the ring PC

TU-Delft TI1400/11-PDS 41 Memory organization (1/2) Processor Secondary Cache Network Interface network UMA architectures.

TU-Delft TI1400/11-PDS 42 Memory organization (2/2) Processor Secondary Cache Network Interface network Local Memory NUMA architectures.

TU-Delft TI1400/11-PDS 43 Cache coherence Problem: caches in multiprocessors may have copies of the same variable -Copies must be kept identical Cache coherence: all copies of a shared variable have the same value Solutions: -write through to shared memory and all caches -invalidate cache entries in all other caches Snoopy caches: -Proc.Elements sense write and adapt cache or do invalidate

TU-Delft TI1400/11-PDS 45 Parallelism PARBEGIN task_1; task_2;.... …. task_n; PAREND PARBEGIN PAREND task 1task n Language construct:

TU-Delft TI1400/11-PDS 46 Shared variables (1/4)..... STWR2, SUM(0)..... Task_1..... STWR2, SUM(0)..... Task_2 T1T2 SUM shared memory

TU-Delft TI1400/11-PDS 47 Shared variables (2/4) Suppose processsors both 1 and 2 execute: LW A,R0/* A is variable in main memory */ ADDR1,R0 STWR0,A Initially: -A=100 -R1 in processor 1 is 20 -R1 in processor 2 is 40 What is the final value of A? 120, 140, 160? Now consider the final value of A is your bank account balance.

TU-Delft TI1400/11-PDS 48 Shared variables (3/4) So there is a need for mutual exclusion: -different components of the same program need exclusive access to a data structure to ensure consistent values Occurs in many situations: -access to shared variables -access to a printer A solution: a single instruction (Test&Set) that -tests whether somebody else accesses the variable -if so, continue testing (busy waiting) -if not, indicates that the variable is being accessed

TU-Delft TI1400/11-PDS 49 Shared variables (4/4) crit:T&S LOCK,crit...... STWR2, SUM(0)..... CLRLOCK Task_1Task_2 T1T2 SUM shared memory crit:T&S LOCK,crit...... STWR2, SUM(0)..... CLR LOCK LOCK

TU-Delft TI1400/11-PDS 50 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers [earlier, see Token Ring et al.] 8.A Programmer’s View 9.Performance Considerations

TU-Delft TI1400/11-PDS 51 Example program Compute dot product of two vectors with a 1.sequential program 2.two tasks with shared memory 3.two tasks with distributed memory using messages Primitives in parallel programs: -create_thread() (create a (sub)process) -mypid() (who am I?)

TU-Delft TI1400/11-PDS 52 Sequential program integer array a[1..N], b[1..N] integer dot_product dot_product :=0; do_dot(a,b) print do_product procedure do_dot(integer array x[1..N], integer array y[1..N]) for k:=1 to N dot_product := dot_product + x[k]*y[k] end

TU-Delft TI1400/11-PDS 53 Shared memory program 1 (1/2) shared integer array a[1..N], b[1..N] shared integer dot_product shared lock dot_product_lock shared barrier done dot_product :=0; create_thread(do_dot,a,b) do_dot(a,b) print dot_product barrier dot_product id=0 id=1

TU-Delft TI1400/11-PDS 54 Shared memory program 1 (2/2) procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id id := mypid(); /* who am I? */ for k:=(id*N/2)+1 to (id+1)*N/2 lock(dot_product_lock) dot_product := dot_product + x[k]*y[k] unlock(dot_product_lock) end barrier(done) end k in [1..N/2] or [N/2+1..N] Critical section

TU-Delft TI1400/11-PDS 55 Shared memory program 2 (1/2) procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id, local_dot_product id := mypid(); local_dot_product :=0; for k:=(id*N/2)+1 to (id+1)*N/2 local_dot_product := local_ dot_product + x[k]*y[k] end lock(dot_product_lock) dot_product := dot_product + local_dot_product unlock(dot_product_lock) barrier (done) end local computation (can exec in parallel) access shared variable (mutex)

TU-Delft TI1400/11-PDS 56 Shared memory program 2 (2/2) barrier dot_product local_dot_product id=0 id=1 local_dot_product

TU-Delft TI1400/11-PDS 57 Message passing program (1/3) integer array a[1..N/2], temp_a [1..N/2], b[1..N/2], temp_b [1..N/2] integer dot_product, id, temp id := mypid(); if (id=0) then send(temp_a[1..N/2], 1); /* send second halves */ send(temp_b[1..N/2], 1); /* of the two arrays to 1 */ else receive(a[1..N/2], 0); /* receive second halves of */ receive(b[1..N/2], 0); /* the two arrays from proc.0*/ end P0P0 P1P1

TU-Delft TI1400/11-PDS 58 Message passing program (2/3) dot_product :=0; do_dot(a,b) /* arrays of length N/2 */ if (id =1) send (dot_product,0) else receive (temp,1) dot_product := dot_product +temp print dot_product end

TU-Delft TI1400/11-PDS 59 Message passing program (3/3) local_dot_product data result id=0id=1 local_dot_product temp_a/b dot_product

TU-Delft TI1400/11-PDS 60 Agenda 1.Introduction 2.The Flynn Classification of Computers 3.Types of Multi-Processors 4.Interconnection Networks 5.Memory Organization in Multi-Processors 6.Program Parallelism and Shared Variables 7.Multi-Computers [Book] 8.A Programmer’s View 9.Performance Considerations (Amdahl’s Law)

TU-Delft TI1400/11-PDS 61 Speedup Let T P be the time needed to execute a program on P processors Speedup: S P = T 1 /T P Ideal: S P =P (linear speedup) Usually: sublinear speedup due to communication, algorithm, etc Sometimes: superlinear speedup

TU-Delft TI1400/11-PDS 62 Amdahl’s law Suppose a program has: -a parallelizable fraction f -and so a sequential fraction 1-f Then (Amdahl’s law): -S P = T 1 /T P = = 1 / (1-f + f/P) = = P / (P – f(P-1)) Consequence: - If 1-f significant, cannot have T P → 0 even when P → ∞ Sequential PARALLELPARALLEL time 1-f f/P P If f = 0.95 (95%), then: S 16 = 16 / (16 - 0.95 x 15) = 9.14 S 64 = 64 / (64 - 0.95 x 63) = 15.42 S 1k ~ S 1M ~ S 100M < 20 18.61818

1 Advanced computer systems (Chapter 12)

Similar presentations

Presentation on theme: "1 Advanced computer systems (Chapter 12)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Advanced computer systems (Chapter 12)

Similar presentations

Presentation on theme: "1 Advanced computer systems (Chapter 12)"— Presentation transcript:

Similar presentations

About project

Feedback