Higher Level Parallelism

Higher Level Parallelism
The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization

Amdahl’s Law The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used Speedup = Original T/Improved T Speedup = Improved Performance/Original Performance

PRAM MODEL All processors share the same memory space CRCW CREW EREW
concurrent read, concurrent write resolution function on collision, (first/or/largest/error) CREW concurrent read, exclusive write EREW exclusive read, exclusive write

PRAM Algorithm Same Program/Algorithm in All Processors
Each Processor also have local memory/registers Ex, Search for one value from in an array Using p processor Array size m p=m 2 Search for the value 2 in the array 3 2 5 7 2 5 1 6

Search CRCW p=m 2 step1: concurrent read A the same memory is accessed
by all processors P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B step2: read B different memory addresses for each processor 3 2 5 7 2 5 1 6 P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6

Search CRCW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: concurrent write
write 1 if A=B else 0 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6 We use “or” resolution 1: Value found 0: Value not found 1 Complexity All operations performed in constant time Count only the cost of communication steps In this case the number of steps is independent of m, (if enough processors) Search is done in constant time O(1) for CRCW and p=m

Search CREW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: compute 1 if A=B else 0
1 1 Same processors can be reused in the next step! step4.1: read A step4.2: read B step4.3: compute A or B log m steps P1 P2 P3 P4 2 1 1 1 1 Complexity We need log m steps to “collect” the result Operations done in constant time O(log m) complexity P1 P2 2 P1 2

Search EREW p=m 2 P1 log m steps P1 P2 P1 P2 P3 P4 P1 P2 P3 P4 P5 P6
It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2

PRAM a Theoretical Model
CRCW Very elegant Not of much practical use, (too hard to implement) CREW This model can be used to develop algorithms for parallel computers, e.g. our search example p=1 (a single processor), check all elements give O(m) p=m (m processors), complexity O(log m), not O(1) From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors 2 THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS

Parallelism so far By pipelineing several instructions (at different stages) are executed simultaneously Pipeline depth limited by hazards SuperScalar designs provide parallel execution units Limited by instruction and machine level parallelism VLIW might improve over hardware instruction issuing All limited by the instruction fetch mechanism Called the FLYNN BOTTLENECK Only a very limited nr of instructions can be fetched each cycle That makes vector operations ineffective

Vector Processors Taking Pipelineing to its limits for vector operations Sometimes referred as a SuperPipeline The same operation is performed on a vector of data No data dependencies in the vector data Ex, add two vectors Solves the FLYNN BOTTLENECK problem A loop over a vector can be issued by a singe instruction Proven to be very effective for scientific calculations CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP

Vector Processor (CRAY-1 like)
MAIN MEMORY FP add/subtract FP multiply Vector load/store FP divide Integer Vector registers Logical SuperPipelined Arithmetical units Scalar registers (like MIPS reg file)

Vector Operations Fully Pipelined Pipeline Latency Sustained rate
CPI = 1, we produce one result each cycle when pipe full Pipeline Latency Startup cost = pipeline depth Vector Add 6 cycles Vector Multiplication 6 cycles Vector Divide 20 cycles Vector Load 12 cycles (depends on memory hierarchy) Sustained rate Time/element for a collection of related vector operations

Vector Processor Design
Vector length control VLR register (Maximum Vector Length, MVL) Strip Mining in software (Vector > MVL causes a loop) Stride How to layout a vectors and matrixes in memory, such that Memory banks can be accessed without collision Vector Chaining Forwarding between vector registers (minimize latency) Vector Mask Register (Boolean valued) Conditional writeback, (if 0 no writeback) Sparse matrixes and conditional execution

Programming By use of language constructs the compiler is able to utilize the vector functions FORTRAN is widely used for scientific calculations built in matrix and vector functions/commands LINPACK A library of optimized linear algebra functions Often used as a benchmark (but does it tell the whole truth?) Some more (implicite) vectorization possible by advanced compilers

Flynn Classification SISD (Single Instruction, Single Data)
The MIPS, and even the Vector Processor SIMD (Single Instruction, Multiple Data) Each instruction activates several execution units in parallel MISD (Multiple Instruction, Single Data) The VLIW architecture might be considered but…. MISD is a seldom used classification MIMD (Multiple Instruction, Multiple Data) Multiprocessor architectures Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures

Communication Bus Total Bandwidth = Link Bandwidth
Bisection Bandwidth = Link Bandwidth Ring Total Bandwidth = P * Link Bandwidth Bisection Bandwidth = 2 * Link Bandwidth Fully Connected Total Bandwidth = (P * P-1)/2 * Link Bandwidth Bisection Bandwidth = (P/2) * Link Bandwidth 2

MultiStage Networks Omega Network Crossbar Switch P1 to P2,P3 P2 to P4
P1 to P6, but P2 to P8 not possible at the same time log P 2 P1 P1 P2 P2 P3 P3 P4 P4 P5 P6 P7 P8

Connection Machines CM-2 (SIMD)
16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory 3-cube 1024 * Chips 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer CM-2 uses a 12-cube for communication between the chips 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Data Vault (Disk Array)

SIMD Programming, Parallel sum
for (i=0;i<65536;i=i+1) /* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit; /* Collect sum from 8192 processors */ repeat half=half/2 /* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit 4 3 send(1,sum) half 2 send(0,sum) limit 2 1 sum=sum+R half 1 send(0,sum) sum=sum+R sum=sum+R Final sum

SIMD vs MIMD SIMD MIMD Single Instruction (one PC)
All processors perform the same work (synchronized) Conditional execution (case/if etc) Each processor holds a enable bit MIMD Each processor has a PC Possible to run different programs: BUT All may run the same program (SPMD), single Program ... Use MIMD style programming for conditional execution Use SIMD style programming for synchronized actions

Memory Architectures for MIMD
Centralized We use a single bus for all main memory Uniform memory access, (after passing the local cache) Distributed The sought address might be hosted by another processor Non-uniform memory access, (dynamic “find” time) The Extreme, a cache only Memory Shared All processors shared the same address space Memory can be used for communication Private All processors have a unique address space Communication must be done by “message passing”

Shared Bus MIMD Usually 2-32 P Processor Processor Processor … Snoop
Tag Snoop Tag Snoop Tag Cache Cache Cache MEMORY I/O Cache Coherency Protocol Write Invalidate The first write to address A causes all other cached references of A to be invalidated Write Update On write to address A all cached references of A is updated (high bus activity) On a cache read miss when using WB caches The cache holding the valid data writes to memory The cache holding the valid data writes directly to the cache requiring the data

Synchronization When using shared data we need to se that only one processor can access the data when updating We need an atomic operation for TEST&SET Processor 2 Processor 1 loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock

Higher Level Parallelism

Similar presentations

Presentation on theme: "Higher Level Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Higher Level Parallelism

Similar presentations

Presentation on theme: "Higher Level Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback