Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced computer systems (Chapter 12)

Similar presentations


Presentation on theme: "Advanced computer systems (Chapter 12)"— Presentation transcript:

1 Advanced computer systems (Chapter 12)
Alexandru Iosup (lecturer) Parallel and Distributed Systems Course website:

2 Large-Scale Computer Systems Today
Low-energy defibrillation Saves lives Affects >2M people/year Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010,

3 Large-Scale Computer Systems Today
Genome sequencing May save lives The $1,000 barrier Large-scale molecular dynamics simulations Tectonic plate movement Adaptive fine mesh simulations Using 200,000 processors Source: TeraGrid science highlights 2010,

4 Large-Scale Computer Systems Today
Public Content Generation Wikipedia Affects how we think about collaborations “The distribution of effort has increasingly become more uneven, unequal” Sorin Adam Matei Purdue University Source: TeraGrid science highlights 2010,

5 Large-Scale Computer Systems Today
Online Gaming World of Warcraft, Zynga Affects >250M people “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” 75,000 cores Upkeep: >135,000$/day (?) Source: and and

6 Why parallelism (1/4) Fundamental laws of nature:
example: channel widths are becoming so small that quantum properties are going to determine device behaviour signal propagation time increases when channel widths shrink

7 Why parallelism (2/4) Engineering constraints:
Phase transition time of a component is a good measure for the maximum obtainable computing speed example: optical or superconducting devices can switch in seconds optimistic suggestion: 1 TIPS (Tera Instructions Per Second, 1012 ) is possible However, we must calculate something assume we need 10 phase transitions: 0.1 TIPS

8 Why parallelism (3/4) But what about memory ? 0.5 cm
It takes light approximately 16 picoseconds to cross 0.5 cm, yielding a possible execution rate of 60 GIPS However, in silicon, speed is about 10 times slower, resulting in 6 GIPS

9 Why parallelism (4/4) Speed of sequential computers is limited to a few GIPS Improvements by using parallelism: multiple functional units (instruction-level parallelism) multiple CPUs (parallel processing)

10 Quantum Computing? Source: “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” Does surrounding noise matter? Wim van Dam, Nature Physics 2007 May 25, 2011 Lockheed Martin (10M$) D-Wave One 128 qubit

11 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

12 Classification of computers (Flynn Taxonomy)
Single Instruction, Single Data (SISD) conventional system Single Instruction, Multiple Data (SIMD) one instruction on multiple data objects Multiple Instruction, Multiple Data (MIMD) multiple instruction streams on multiple data streams Multiple Instruction, Single Data (MISD) ?????

13 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

14 SIMD (Array) Processors
..... PE CM-5’91 Instruction Issuing Unit INCR CM-2’87 Peak: 28GFLOPS Sustainable: 5-10% PE = Processing Element Sources: and and (about the blinking leds)

15 MIMD Uniform Memory Access (UMA) architecture
Any processor can access directly any memory. P1 P2 ...... Pm interconnection network 3 ...... . 1 4 . 2 5 N M1 M2 Mk Uniform Memory Access (UMA) computer

16 MIMD NUMA architecture
Any processor can access directly any memory. P1 3 . P2 ...... Pm 1 4 . 2 5 N M1 M2 Mm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory)

17 MIMD Distributed memory architecture
Any processor can access any memory, but sometimes through another processor (via messages). P1 P2 1 1 Pm ...... 1 2 2 2 M1 M2 Mm interconnection network

18 Example 1: Graphical Processing Units’s
CPU versus GPU CPU: Much cache and control logic GPU: Much compute logic

19 GPU Architecture SIMD architecture Multiple SIMD units SIMD pipelining
Simple processors High branch penalty Efficient operation on parallel data regular streaming

20 Example 2: Cell B.E. Distributed memory architecture 8 identical cores
PowerPC

21 Example 3: Intel Quad-core
Shared Memory MIMD

22 Example 4: Large MIMD Clusters
BlueGene/L

23 Supercomputers Over Time
Source:

24 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks (I/O) Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

25 Interconnection networks (I/O between processors)
Difficulty in building systems with many processors: the interconnections Important parameters: Diameter: Maximal distance between any two processors Degree: Maximal number of connections per processor Total number of connections (Cost) Bisection width Largest number of simultaneous messages

26 Multiple bus Bus 1 Bus 2 (Multiple) bus structures

27 Cross bar N N2 switches Cross-bar interconnection network Sun E10000
Source:

28 Multi-stage networks (1/4)
P0 path from P5 to P3 P1 P3 8 modules 3-bit ids P5 P7

29 Multi-stage networks (2/4)
connections P4-P0 and P5-P3 both use P0 P1 1 P3 1 P4 P5 Shuffle Network stage1 stage2 stage3 “Shuffle”: 2 x ½ deck, interleave

30 Multi-stage network (3/4)
Multistage networks: multiple steps Example: Shuffle or Omega network Every processor identified by three-bit number (in general, n-bit number) Message from processor to another contains identifier of destination Routing algorithm: In every stage, inspect one bit of destination if 0: use upper output if 1: use lower output

31 Multi-stage network (4/4)
Properties: Let N = 2n be the number of processing elements Number of stages n = log2N Number of switches per stage N/2 Total number of (2x2) switches N(log2N)/2 Not every pair of connections can be simultaneously realized Blocking

32 Non-uniform delay, so for NUMA architectures.
Hypercubes (1/3) Non-uniform delay, so for NUMA architectures. 10 11 n.2n-1 connections n = 2 maximum distance n hops 00 01 Connected PEs differ by 1 bit Routing: - scan bits from right to left - if different, send to neighbor with same bit different - repeat until end 000 -> 111 011 111 010 110 n = 3 001 101 000 100

33 Hypercubes (2/3) Question:
what is the average distance between two nodes in a hypercube?

34 Mesh Constant number of connections per node

35 Torus mesh with wrap-around connections

36 Tree

37 Fat tree Nodes have multiple parents

38 Local networks Ethernet Token ring based on collision detection
upon collision, back off and randomly try later speed up to 100 Gb/s (Terabit Ethernet?) Token ring based on token circulation on ring possession of token allows putting message on the ring PC PC

39 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

40 Memory organization (1/2)
UMA architectures. Processor Secondary Cache Network Interface network

41 Memory organization (2/2)
NUMA architectures. Processor Secondary Cache Local Memory Network Interface network

42 Cache coherence Problem: caches in multiprocessors may have copies of the same variable Copies must be kept identical Cache coherence: all copies of a shared variable have the same value Solutions: write through to shared memory and all caches invalidate cache entries in all other caches Snoopy caches: Proc.Elements sense write and adapt cache or do invalidate

43 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

44 Parallelism Language construct: PARBEGIN PARBEGIN task_1; task_2; ....
…. task_n; PAREND task 1 task n PAREND

45 Shared variables (1/4) Task_1 Task_2 ..... STW R2, SUM(0) .....
shared memory T1 T2

46 Now consider the final value of A is your bank account balance.
Shared variables (2/4) Suppose processsors both 1 and 2 execute: LW A,R0 /* A is variable in main memory */ ADD R1,R0 STW R0,A Initially: A=100 R1 in processor 1 is 20 R1 in processor 2 is 40 What is the final value of A? 120, 140, 160? Now consider the final value of A is your bank account balance.

47 Shared variables (3/4) So there is a need for mutual exclusion:
different components of the same program need exclusive access to a data structure to ensure consistent values Occurs in many situations: access to shared variables access to a printer A solution: a single instruction (Test&Set) that tests whether somebody else accesses the variable if so, continue testing (busy waiting) if not, indicates that the variable is being accessed

48 Shared variables (4/4) Task_1 Task_2 crit: T&S LOCK,crit ......
STW R2, SUM(0) ..... CLR LOCK crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK SUM shared memory LOCK T1 T2

49 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers [earlier, see Token Ring et al.] A Programmer’s View Performance Considerations

50 Example program Compute dot product of two vectors with a
sequential program two tasks with shared memory two tasks with distributed memory using messages Primitives in parallel programs: create_thread() (create a (sub)process) mypid() (who am I?)

51 Sequential program integer array a[1..N], b[1..N] integer dot_product
do_dot(a,b) print do_product procedure do_dot(integer array x[1..N], integer array y[1..N]) for k:=1 to N dot_product := dot_product + x[k]*y[k] end

52 Shared memory program 1 (1/2)
shared integer array a[1..N], b[1..N] shared integer dot_product shared lock dot_product_lock shared barrier done dot_product :=0; create_thread(do_dot,a,b) do_dot(a,b) print dot_product id=0 id=1 dot_product barrier

53 Shared memory program 1 (2/2)
procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id id := mypid(); /* who am I? */ for k:=(id*N/2)+1 to (id+1)*N/2 lock(dot_product_lock) dot_product := dot_product + x[k]*y[k] unlock(dot_product_lock) end barrier(done) k in [1..N/2] or [N/2+1..N] Critical section

54 Shared memory program 2 (1/2)
procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id, local_dot_product id := mypid(); local_dot_product :=0; for k:=(id*N/2)+1 to (id+1)*N/2 local_dot_product := local_ dot_product + x[k]*y[k] end lock(dot_product_lock) dot_product := dot_product + local_dot_product unlock(dot_product_lock) barrier (done) local computation (can exec in parallel) access shared variable (mutex)

55 Shared memory program 2 (2/2)
id=0 id=1 local_dot_product local_dot_product dot_product barrier

56 Message passing program (1/3)
integer array a[1..N/2], temp_a [1..N/2], b[1..N/2], temp_b [1..N/2] integer dot_product, id, temp id := mypid(); if (id=0) then send(temp_a[1..N/2], 1); /* send second halves */ send(temp_b[1..N/2], 1); /* of the two arrays to 1 */ else receive(a[1..N/2], 0); /* receive second halves of */ receive(b[1..N/2], 0); /* the two arrays from proc.0*/ end P0 P1

57 Message passing program (2/3)
dot_product :=0; do_dot(a,b) /* arrays of length N/2 */ if (id =1) send (dot_product,0) else receive (temp,1) dot_product := dot_product +temp print dot_product end

58 Message passing program (3/3)
id=0 id=1 data temp_a/b local_dot_product local_dot_product result dot_product

59 Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers [Book] A Programmer’s View Performance Considerations (Amdahl’s Law)

60 Speedup Let TP be the time needed to execute a program on P processors
Speedup: SP = T1/TP Ideal: SP=P (linear speedup) Usually: sublinear speedup due to communication, algorithm, etc Sometimes: superlinear speedup

61 If f = 0. 95 (95%), then: S16 = 16 / (16 - 0. 95 x 15) = 9
If f = 0.95 (95%), then: S16 = 16 / ( x 15) = S64 = 64 / ( x 63) = S1k ~ S1M ~ S100M < 20 Amdahl’s law Suppose a program has: a parallelizable fraction f and so a sequential fraction 1-f Then (Amdahl’s law): SP = T1/TP = = 1 / (1-f + f/P) = = P / (P – f(P-1)) Consequence: If 1-f significant, cannot have TP→0 even when P→∞ P A R A L L E L P Sequential 1-f f/P time


Download ppt "Advanced computer systems (Chapter 12)"

Similar presentations


Ads by Google