Presentation is loading. Please wait.

Presentation is loading. Please wait.

QUESTION #3 MIDTERM a) A four-processor shared-memory system implements the MESI protocol for the cache coherence. For the following sequence of memory.

Similar presentations


Presentation on theme: "QUESTION #3 MIDTERM a) A four-processor shared-memory system implements the MESI protocol for the cache coherence. For the following sequence of memory."— Presentation transcript:

1 QUESTION #3 MIDTERM a) A four-processor shared-memory system implements the MESI protocol for the cache coherence. For the following sequence of memory references, show the state of the line containing the variable a in each processor’s cache after each reference is resolved. Each processor starts out with the line containing a invalid in their cache. P0’s cache P1’s cache P2’s cache P3’s cache P0 reads a P1 reads a P2 reads a P3 writes a P0 reads a MESI cache coherence protocol

2 P0’s cache P1’s cache P2’s cache P3’s cache P0 reads aEIII P1 reads a P2 reads a P3 writes a P0 reads a Initial assumption: a invalid in all caches P0 reads a

3 P0’s cache P1’s cache P2’s cache P3’s cache P0 reads aEIII P1 reads aSSII P2 reads aSSSI P3 writes a P0 reads a P1 reads a P2 reads a

4 P0’s cache P1’s cache P2’s cache P3’s cache P0 reads aEIII P1 reads aSSII P2 reads aSSSI P3 writes aIIIM P0 reads a P3 writes a 1. P0 reads a

5 P0’s cache P1’s cache P2’s cache P3’s cache P0 reads aEIII P1 reads aSSII P2 reads aSSSI P3 writes aIIIM P0 reads aSIIS 2. P3 writes back a’ 3. P0 reads a’

6 QUESTION #3 MIDTERM b) Consider a multiprocessing system with 8 processors that have their local caches and they are connected to the main memory: If Full Map Directory cache coherence protocol is implemented, what is the number of bits per directory? Why? 1 bit is used per processor, so that the number of bits is 8 If Limited Directory cache coherence protocol with only two pointers is implemented, what is the number of bits per directory? Why? Log 2 8=3, the number of bits per pointer is 3 and the total number of bits is 6 Assume that Full Map Directory cache coherence protocol with Centralized Directory Invalidate is implemented. Assume that directory for address X contained all 0s at the beginning. Fill the following table for the following sequence of instructions: Time instantOperationContent of the directory for X 1Processor 0 – read X Processor 5 – read X Processor 0 – writes to X

7 ASSIGNMENT #3 1) Count the number of transactions on the bus for the following sequence of activities involving shared data. Assume that both processors use write-back write-update cache coherency, and a block size of one word. Assume that all the words in both caches are clean. StepProcessorMemory activityMemory address 1P1write100 2P2write104 3P1Read100 4P2Read104 5P1Read104 6P2Read100 Initial assumption: there are multiple cache copies shared

8 StateDescription Valid Exclusive [VAL-X] This is the only cache copy and is consistent with global memory Shared Clean [SH-CLN] There are multiple caches copies shared Shared Dirty [SH-DRT] There are multiple shared caches copies. This is the last one being updated (Ownership) Dirty [DIRTY] This copy is not shared by other caches and has been updated. It is not consistent with global memory (Ownership) Write-update / write-back states

9 100:104:100: 104: 100:

10 StepProcessorMemory activity Memory address ActionComment 1P1write100P1 writes to 100One bus transfer to move from the word at 100 from P1 to P2 cache 2P2write104P2 writes to 104One bus transfer to move from the word at 104 from P2 to P1 cache 3P1Read100P1 reads 100No bus transfer; word read from P1 cache 4P2Read104P2 reads 104No bus transfer; word read from P2 cache 5P1Read104P1 reads 104No bus transfer; word read from P1 cache 6P2Read100P2 reads 100No bus transfer; word read from P2 cache

11 ASSIGNMENT #3 2)Two processors require access to the same line of data from data memory. Processors have a cache and use the MESI protocol. Initially both caches are empty. Figure bellow depicts the consequence of a read of line x by Processor P1. If this is the start of a sequence of accesses, draw the subsequent figures for the following sequence: 1. P2 reads x 2. P1 writes to x 3. P1 writes to x 4. P2 reads x P1 reads x

12 P2 reads xP1 writes to x 1. P2 reads x2. P1 writes back x3. P2 reads x

13 R = Read, W = Write, Z = Replace i = local processor, j = other processor ASSIGNMENT #3 3a) Is the simplest possible cache coherence protocol. It requires that all processors use a write-through policy. If a write is made to a location cached in remote caches, then the copies of the line in remote caches are invalidated.  easy to implement but requires more bus and memory traffic because of the write-through policy Write-Through Cache State Transitions

14 ASSIGNMENT #3 3b) Makes a distinction between shared and exclusive states. When a cache first loads a line, it puts it in the shared state. If the line is already in the modified state in another cache, that cache must block the read until the line is updated back to main memory, similar to the MESI protocol. The difference between the two is that the shared state is split into the shared and exclusive states for MESI:  reduces the number of write- invalidate operations on the bus

15 QUIZ 3 QUESTION #2 What is the diameter of A hypercube with 256 processors? 8 A 2D mesh with 64 processors? 14 A linear array with 32 processors?31 A star network with 17 processors (1 in the middle and 16 leaf processors)? 2 A 2D torus with p processors (assume that routing is bidirectional) 2* What is the bisection width of A hypercube with 256 processors? 128 A 2d mesh with 64 processors? 8 A linear array with 32 processors?1 A star network with 17 processors (1 in the middle and 16 leaf processors)? 8

16 QUESTION #3 b) Modify this program in order to compute cumulative sums C. Cumulative sum C is an array of n elements which are computed as C(i)=Z(1)+Z(2)+…Z(i). Write a program for parallel computation of cumulative sums on M processors. Input array is Z and it has n elements. Cumulative sums C(1), …., C(n) are printed by a processor 0.

17 INITIALIZE; //assign proc_nums and M where M is the number of processors read_array(Z, n); //read the array and array size n from file BARRIER(M); //waits for M processors to get to this point in the program local_sum = 0; size_to_sum = n/M; lower_ind = size_to_sum * proc_num; upper_ind = size_to_sum * (proc_num + 1); for (i = lower_ind; i < upper_ind; i++) { C[i]=0; C[i]= C[i-1]+Z[i]; } BARRIER(M); //waits for M processors to get to this point in the program for (j=M-1;j>=1;j--) { if (proc_num>=j) { for (i = lower_ind; i < upper_ind; i++) { C[i]= C[i]+C[size_to_sum * j]; } BARRIER (M); } BARRIER(M); //waits for M processors to get to this point in the program if (proc_num == 0) for (i=0;i<=n;i++) printf("C[i]= %d", C[i]); END;


Download ppt "QUESTION #3 MIDTERM a) A four-processor shared-memory system implements the MESI protocol for the cache coherence. For the following sequence of memory."

Similar presentations


Ads by Google