Advanced computer systems (Chapter 12)

Advanced computer systems (Chapter 12)
Alexandru Iosup (lecturer) Parallel and Distributed Systems Course website:

Large-Scale Computer Systems Today
Low-energy defibrillation Saves lives Affects >2M people/year Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010,

Genome sequencing May save lives The $1,000 barrier Large-scale molecular dynamics simulations Tectonic plate movement Adaptive fine mesh simulations Using 200,000 processors Source: TeraGrid science highlights 2010,

Public Content Generation Wikipedia Affects how we think about collaborations “The distribution of effort has increasingly become more uneven, unequal” Sorin Adam Matei Purdue University Source: TeraGrid science highlights 2010,

Online Gaming World of Warcraft, Zynga Affects >250M people “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” 75,000 cores Upkeep: >135,000$/day (?) Source: and and

Why parallelism (1/4) Fundamental laws of nature:
example: channel widths are becoming so small that quantum properties are going to determine device behaviour signal propagation time increases when channel widths shrink

Why parallelism (2/4) Engineering constraints:
Phase transition time of a component is a good measure for the maximum obtainable computing speed example: optical or superconducting devices can switch in seconds optimistic suggestion: 1 TIPS (Tera Instructions Per Second, 1012 ) is possible However, we must calculate something assume we need 10 phase transitions: 0.1 TIPS

Why parallelism (3/4) But what about memory ? 0.5 cm
It takes light approximately 16 picoseconds to cross 0.5 cm, yielding a possible execution rate of 60 GIPS However, in silicon, speed is about 10 times slower, resulting in 6 GIPS

Why parallelism (4/4) Speed of sequential computers is limited to a few GIPS Improvements by using parallelism: multiple functional units (instruction-level parallelism) multiple CPUs (parallel processing)

Quantum Computing? Source: “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” Does surrounding noise matter? Wim van Dam, Nature Physics 2007 May 25, 2011 Lockheed Martin (10M$) D-Wave One 128 qubit

Agenda Introduction The Flynn Classification of Computers
Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

Classification of computers (Flynn Taxonomy)
Single Instruction, Single Data (SISD) conventional system Single Instruction, Multiple Data (SIMD) one instruction on multiple data objects Multiple Instruction, Multiple Data (MIMD) multiple instruction streams on multiple data streams Multiple Instruction, Single Data (MISD) ?????

SIMD (Array) Processors
..... PE CM-5’91 Instruction Issuing Unit INCR CM-2’87 Peak: 28GFLOPS Sustainable: 5-10% PE = Processing Element Sources: and and (about the blinking leds)

MIMD Uniform Memory Access (UMA) architecture
Any processor can access directly any memory. P1 P2 ...... Pm interconnection network 3 ...... . 1 4 . 2 5 N M1 M2 Mk Uniform Memory Access (UMA) computer

MIMD NUMA architecture
Any processor can access directly any memory. P1 3 . P2 ...... Pm 1 4 . 2 5 N M1 M2 Mm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory)

MIMD Distributed memory architecture
Any processor can access any memory, but sometimes through another processor (via messages). P1 P2 1 1 Pm ...... 1 2 2 2 M1 M2 Mm interconnection network

Example 1: Graphical Processing Units’s
CPU versus GPU CPU: Much cache and control logic GPU: Much compute logic

GPU Architecture SIMD architecture Multiple SIMD units SIMD pipelining
Simple processors High branch penalty Efficient operation on parallel data regular streaming

Example 2: Cell B.E. Distributed memory architecture 8 identical cores
PowerPC

Example 3: Intel Quad-core
Shared Memory MIMD

Example 4: Large MIMD Clusters
BlueGene/L

Supercomputers Over Time
Source:

Types of Multi-Processors Interconnection Networks (I/O) Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers A Programmer’s View Performance Considerations

Interconnection networks (I/O between processors)
Difficulty in building systems with many processors: the interconnections Important parameters: Diameter: Maximal distance between any two processors Degree: Maximal number of connections per processor Total number of connections (Cost) Bisection width Largest number of simultaneous messages

Multiple bus Bus 1 Bus 2 (Multiple) bus structures

Cross bar N N2 switches Cross-bar interconnection network Sun E10000
Source:

Multi-stage networks (1/4)
P0 path from P5 to P3 P1 P3 8 modules 3-bit ids P5 P7

Multi-stage networks (2/4)
connections P4-P0 and P5-P3 both use P0 P1 1 P3 1 P4 P5 Shuffle Network stage1 stage2 stage3 “Shuffle”: 2 x ½ deck, interleave

Multi-stage network (3/4)
Multistage networks: multiple steps Example: Shuffle or Omega network Every processor identified by three-bit number (in general, n-bit number) Message from processor to another contains identifier of destination Routing algorithm: In every stage, inspect one bit of destination if 0: use upper output if 1: use lower output

Multi-stage network (4/4)
Properties: Let N = 2n be the number of processing elements Number of stages n = log2N Number of switches per stage N/2 Total number of (2x2) switches N(log2N)/2 Not every pair of connections can be simultaneously realized Blocking

Non-uniform delay, so for NUMA architectures.
Hypercubes (1/3) Non-uniform delay, so for NUMA architectures. 10 11 n.2n-1 connections n = 2 maximum distance n hops 00 01 Connected PEs differ by 1 bit Routing: - scan bits from right to left - if different, send to neighbor with same bit different - repeat until end 000 -> 111 011 111 010 110 n = 3 001 101 000 100

Hypercubes (2/3) Question:
what is the average distance between two nodes in a hypercube?

Mesh Constant number of connections per node

Torus mesh with wrap-around connections

Fat tree … Nodes have multiple parents

Local networks Ethernet Token ring based on collision detection
upon collision, back off and randomly try later speed up to 100 Gb/s (Terabit Ethernet?) Token ring based on token circulation on ring possession of token allows putting message on the ring PC PC

Memory organization (1/2)
UMA architectures. Processor Secondary Cache Network Interface network

Memory organization (2/2)
NUMA architectures. Processor Secondary Cache Local Memory Network Interface network

Cache coherence Problem: caches in multiprocessors may have copies of the same variable Copies must be kept identical Cache coherence: all copies of a shared variable have the same value Solutions: write through to shared memory and all caches invalidate cache entries in all other caches Snoopy caches: Proc.Elements sense write and adapt cache or do invalidate

Parallelism Language construct: PARBEGIN PARBEGIN task_1; task_2; ....
…. task_n; PAREND task 1 task n PAREND

Shared variables (1/4) Task_1 Task_2 ..... STW R2, SUM(0) .....
shared memory T1 T2

Now consider the final value of A is your bank account balance.
Shared variables (2/4) Suppose processsors both 1 and 2 execute: LW A,R0 /* A is variable in main memory */ ADD R1,R0 STW R0,A Initially: A=100 R1 in processor 1 is 20 R1 in processor 2 is 40 What is the final value of A? 120, 140, 160? Now consider the final value of A is your bank account balance.

Shared variables (3/4) So there is a need for mutual exclusion:
different components of the same program need exclusive access to a data structure to ensure consistent values Occurs in many situations: access to shared variables access to a printer A solution: a single instruction (Test&Set) that tests whether somebody else accesses the variable if so, continue testing (busy waiting) if not, indicates that the variable is being accessed

Shared variables (4/4) Task_1 Task_2 crit: T&S LOCK,crit ......
STW R2, SUM(0) ..... CLR LOCK crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK SUM shared memory LOCK T1 T2

Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers [earlier, see Token Ring et al.] A Programmer’s View Performance Considerations

Example program Compute dot product of two vectors with a
sequential program two tasks with shared memory two tasks with distributed memory using messages Primitives in parallel programs: create_thread() (create a (sub)process) mypid() (who am I?)

Sequential program integer array a[1..N], b[1..N] integer dot_product
do_dot(a,b) print do_product procedure do_dot(integer array x[1..N], integer array y[1..N]) for k:=1 to N dot_product := dot_product + x[k]*y[k] end

Shared memory program 1 (1/2)
shared integer array a[1..N], b[1..N] shared integer dot_product shared lock dot_product_lock shared barrier done dot_product :=0; create_thread(do_dot,a,b) do_dot(a,b) print dot_product id=0 id=1 dot_product barrier

procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id id := mypid(); /* who am I? */ for k:=(id*N/2)+1 to (id+1)*N/2 lock(dot_product_lock) dot_product := dot_product + x[k]*y[k] unlock(dot_product_lock) end barrier(done) k in [1..N/2] or [N/2+1..N] Critical section

procedure do_dot(integer array x[1..N], integer array y[1..N]) private integer id, local_dot_product id := mypid(); local_dot_product :=0; for k:=(id*N/2)+1 to (id+1)*N/2 local_dot_product := local_ dot_product + x[k]*y[k] end lock(dot_product_lock) dot_product := dot_product + local_dot_product unlock(dot_product_lock) barrier (done) local computation (can exec in parallel) access shared variable (mutex)

id=0 id=1 local_dot_product local_dot_product dot_product barrier

Message passing program (1/3)
integer array a[1..N/2], temp_a [1..N/2], b[1..N/2], temp_b [1..N/2] integer dot_product, id, temp id := mypid(); if (id=0) then send(temp_a[1..N/2], 1); /* send second halves */ send(temp_b[1..N/2], 1); /* of the two arrays to 1 */ else receive(a[1..N/2], 0); /* receive second halves of */ receive(b[1..N/2], 0); /* the two arrays from proc.0*/ end P0 P1

dot_product :=0; do_dot(a,b) /* arrays of length N/2 */ if (id =1) send (dot_product,0) else receive (temp,1) dot_product := dot_product +temp print dot_product end

id=0 id=1 data temp_a/b local_dot_product local_dot_product result dot_product

Types of Multi-Processors Interconnection Networks Memory Organization in Multi-Processors Program Parallelism and Shared Variables Multi-Computers [Book] A Programmer’s View Performance Considerations (Amdahl’s Law)

Speedup Let TP be the time needed to execute a program on P processors
Speedup: SP = T1/TP Ideal: SP=P (linear speedup) Usually: sublinear speedup due to communication, algorithm, etc Sometimes: superlinear speedup

If f = 0. 95 (95%), then: S16 = 16 / (16 - 0. 95 x 15) = 9
If f = 0.95 (95%), then: S16 = 16 / ( x 15) = S64 = 64 / ( x 63) = S1k ~ S1M ~ S100M < 20 Amdahl’s law Suppose a program has: a parallelizable fraction f and so a sequential fraction 1-f Then (Amdahl’s law): SP = T1/TP = = 1 / (1-f + f/P) = = P / (P – f(P-1)) Consequence: If 1-f significant, cannot have TP→0 even when P→∞ P A R A L L E L P Sequential 1-f f/P time

Advanced computer systems (Chapter 12)

Similar presentations

Presentation on theme: "Advanced computer systems (Chapter 12)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced computer systems (Chapter 12)

Similar presentations

Presentation on theme: "Advanced computer systems (Chapter 12)"— Presentation transcript:

Similar presentations

About project

Feedback