Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.

Similar presentations


Presentation on theme: "Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained."— Presentation transcript:

1 Lecture 19 Beyond Low-level Parallelism

2 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained parallelism Issues in parallelization: communication, synchronization, Amdahl’s Law Basic Single-Bus Shared Memory Machines

3 3 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Intra-Instruction (pipelining) Parallel Instructions (wide issue) Loop-level Parallelism (blocks, loops) Algorithmic Thread Parallelism (loops, plus more) Concurrent programs (multiple programs) Dimension 1: Granularity of Concurrency

4 4 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 2: Flynn’s Classification Single Instruction (stream), Single Data (stream) (SISD) –simple CPU (1-wide pipeline) Single Instruction, Multiple Data (SIMD) –vector computer, multimedia extensions Multiple Instruction, Single Data (MISD) –??? Multiple Instruction, Multiple Data (MIMD) –multiprocessor (now also, multithreaded CPU)

5 5 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 3: Parallel Programming Models Data Parallel Model: –The same computation is applied to a large data set. All such applications can be done in parallel. –E.g. Matrix multiplication: all elements of the product array can be generated in parallel. Function Parallel Model: –A program is partitioned into parallel modules. –E.g.“ Compiler: phases of a compiler can be made into parallel modules: parser, semantic analyzer, optimizer 1, optimizer 2, and code generator. They can be organized into a high-level pipeline.

6 6 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 4: Memory Models CPU Mem CPU Mem CPU Mem Shared Memory ModelMessage Passing Model Interconnect (amenable to single address space)(amenable to multiple address spaces) How processors view each other’s memory space

7 7 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 4: Memory Models Shared memory –All processors have common access to a shared memory space. When a program accesses data, it just give the memory system an address for the data. –DO 10 I = 1, N DO 20 J = 1, N A[I,J] = 0 DO 30 K = 1, N A[I,J] = A[I,J] + B[I,K]*C[K,J] END DO

8 8 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Memory Architecture (Cont.) Message passing (or Distributed Memory) –Each processor has a local memory. For a processor to access a piece of data not present in its own local memory, the processor has to exchange message with the processor whose local memory contains the data. For each data access, the programmer must determine if a thread is executed by the processor whose local memory contains the data. –In general, the programmer has to partition the data and assign them explicitly to the processor local memories. –The programmer then parallelizes the program and assign the threads to processors. Using the assignment decisions, each access is then determined to be either a local memory access or a message exchange.

9 9 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois A Simple Problem min = D[0]; for (i = 0; i < n; i++) { if (D[i] < min) min = D[i]; } cout << min; Program to find the minimum of a large set of input data.

10 10 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Simple Problem, Data Parallelized pstart = p# * 100000; min[p#] = D[0]; for (i = pstart; i < pstart+100000; i++) { if (D[i] < min[p#]) min[p#] = D[i]; } barrier(); if (p# == 0) { real_min = min[0]; for (i = 1; i < pMax; i++) { if (min[i] < real_min) real_min = min[i]; } cout << real_min; }

11 11 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Loop Level Parallelization DOALL: There is no dependence between loop iterations. All can be executed in parallel. DOALL 30 J=1,J1 X(II1+J) = X(II1+J) * SC1 Y(II1+J) = Y(II1+J) * SC1 Z(II1+J) = Z(II1+J) * SC1 END DO

12 12 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Loop Level Parallelization DOACROSS: There are dependences between loop iterations. The dependences are enforced by two synchronization constructs: –advance(synch_pt): signals that the current iteration has passed the synchronization point identified by synch_pt. –await(synch_pt, depend_distance): forces the execution of the current iteration to wait for a previous iteration to pass the synchronization point identified by synch_pt. The iteration is the current iteration number minus depend_distance.

13 13 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois DOACROSS Example DOACROSS 40 I=4,IL AWAIT(1, 3) X(I) = Y(I) + X(I-3) ADVANCE (1) END DO The current iteration depends on synch point 1 of three iterations ago.

14 14 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Objective is to write parallel programs that can use as use many processors as are available. Information needs to be exchanged between processors –The min[] needs to be communicated to processor 0 in example. Parallelization is limited by communication. –In terms of latency –In terms of amount of communication –At some point, the communication overhead will out- weight the benefit of parallelism Issue 1: Communication

15 15 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issue 2: Synchronization Often, parallel tasks need to coordinate and wait for each other. Example: barrier in parallel min algorithm The art of parallel programming involves minimizing synchronization

16 16 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Mutual Exclusion Atomic update to data structures so that the actions from different processors are serialized. –This is especially important for data structures that need to be updated in multiple steps. Each processor should be able to lock out other processors, update the data structure, leave the data structure in a consistent state, and unlock. –This is usually realized by using atomic Read Modify Write instructions. Parking space example, file editing example

17 17 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issue 3 : Serial Component Also, no matter how much you parallelize, the performance is ultimately limited by the serial component. Better known as Amdahl’s Law. Jet Service to Chicago –Jet and propellers have very different flying time but similar taxi time – taxi time is the serial section here.

18 18 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Example Speedup Graph Speedup over a single processor Num of Processors Ideal Speedup Observed Speedup

19 19 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Message Passing Machines Network characteristic are of primary importance Latency, bisection bandwidths, node bandwidth, occupancy of communication Latency = sender overhead + transport latency + receiver overhead

20 20 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Synchronization support for shared memory machines Example: two processors trying to increment a shared variable Requires an atomic memory update instr –example: test and set, or compare and exchange Using cmpxchg, implement barrier(); Scalability issues with traditional synchronization

21 21 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Coherence CPU Mem CPU Mem No Caches: easy hardware, bad performance Cache Better performance, but… what about keeping data valid cache

22 22 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Memory Coherence, continued CPU 0CPU 1 Cache Loc X: val A A Memory X Essential Question: What happens when CPU 0 writes value B to memory location X?

23 23 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois One solution : snooping caches CPU Mem Cache cache snoop tags snoop tags Snoop logic watches bus transactions and either invalidates or updates matching cache lines. Requires an extra set of cache tags. Generally, write-back caches are used. Why? What about the L1 cache?

24 24 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Cache Coherence Protocol Basic premise : writes must update other copies or invalidate other copies. How do we make this work while not choking the bus with transactions OR flooding memory OR bringing each CPU to a crawl? One solution: The Illinois Protocol

25 25 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois The Illinois Protocol Tag MESI The tag for each cache block now contains 2 bits that specify whether the block is modified, exclusively owned, shared, or invalid.

26 26 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Illinois Protocol from the CPU’s perspective Invalid or no tag match ModifiedShared Read, supplied by memory Write, but invalidate others Read, supplied by another CPU Write, but invalidate others Write Exclusive Read Read or Write

27 27 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Illinois Protocol from the bus’s perspective Invalid or no tag match ModifiedShared Exclusive Read Bus Read, and supply data Bus Write (miss) Bus Read Bus Write, but first supply data Bus Write or Invalidation Bus

28 28 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issues with a single bus Keep adding CPUs and eventually the bus saturates. Coherence protocol adds extra traffic Buses are slower than point-to-point interconnects


Download ppt "Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained."

Similar presentations


Ads by Google