Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy.

Similar presentations


Presentation on theme: "Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy."— Presentation transcript:

1 Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy of variable X in any of the caches and X=10. For each of the following protocols, show the state of variable X in caches and memory after each of the preceding statements is executed. (a) two-state write-through write invalidate protocol R = Read, W = Write, Z = Replace i = local processor, j = other processor State of P1’s cacheContent of x in P1’s cache State of P2’s cacheContent of x in P2’s cache Content of memory location x 1. Processor P1 reads variable X V10I- 2. P2 reads X V10V 3. P2 performs operation X=X+2 I10V12 4. P1 performs the operation X=X*2 V24I1224 5. P1 reads X V24V

2 Quiz 3: solutions QUESTION #2 (b) basic MSI write-back invalidation protocol State of P1’s cacheContent of x in P1’s cache State of P2’s cacheContent of x in P2’s cache Content of memory location x 1. Processor P1 reads variable X RO10INV-10 2. P2 reads X RO10RO10 3. P2 performs operation X=X+2 INV10RW1210 4. P1 performs the operation X=X*2 RW24INV1210 5. P1 reads X RO24RO24

3 1. P2 reads X 2. P1 writes back X’ 3. P2 reads X’

4 Quiz 3: solutions QUESTION #3(a) The following MPI program is given. What is the order of printing? Why? #include #include "mpi.h" main(int argc, char** argv) { int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d.\n", my_PE_num); MPI_Finalize(); } MPI_Init  initiate computation MPI_Comm_rank  determine the integer identifier assigned to the current process (processes in a process group are identified with unique, contiguous integers numbered from 0) MPI_COMM_WORLD  default value which identifies all processes involved in a computation MPI_Finalize  terminate computation  There is no defined order of printing  the order in which processes are executing the printf command is not defined by MPI_Comm_rank Hello from 3. Hello from 1. Hello from 0. Hello from 2.

5 Quiz 4: QUESTION #1 4. Explain how scheduling in-forest / out-forest task graphs works: First, determine the level of each node, which is the maximum number of nodes (including itself) on any path from the given node to a terminal node  the level of each node is used as each node’s priority Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority

6 Quiz 4: QUESTION #2 Task graph is shown bellow together with the execution and communication times: a. Draw Gantt chart with communication when this program is executed on two processors. Schedule program on these processor so that the overall time is minimized. What is the total time needed?  total time is 30 y 15 a bc x Task Graph TaskExecution time a b c 10 15 y ArcCommunication (a,b) y=5 (a,c)x=10 P1 P2 a idle 10 25 c 30 b

7 Quiz 4: QUESTION #2 Task graph is shown bellow together with the execution and communication times: b. Which technique will help eliminating communication time? What is the total time needed? NODE DUPLICATION  total time is 25 a bc x Task Graph TaskExecution time a b c 10 15 y ArcCommunication (a,b) y=5 (a,c)x=10 P1 P2 a a 10 c b 25 y 15 P1 P2 a idle 10 25 c 30 b

8 Quiz 4: QUESTION #1 1. Which of the following statements is false? a)Node duplication reduces the overall number of computational operations in the system b)Node duplication reduces communication delays c)Node duplication is used to reduce the idle time

9 Vector Processing: Architectures that have high-level operations that work on linear arrays of numbers or “vectors’ Some typical vector-based instructions:

10 Convoy  set of vector instructions that could potentially begin execution together in one clock period:

11 Enhancing Vector performance: Chaining  allows a vector operation to start as soon as the individual elements of its vector source operand become available:

12 Quiz 4: QUESTION #1 3. If we compare a program that deals with arrays written for the vector and for the scalar processor, we can see that the vector program has the smaller number of instructions and it also executes the smaller number of operations. Why? The number of instructions is reduced because the whole loops can be replaced with one (or a few) instruction. The number of operations is reduced as well because the operations needed to handle the loop such as incrementing indexes do not need to be executed in software.

13 Quiz 4: QUESTION #3 (a, b, 17 points each, total 34 points) Consider a vector program given bellow for Y=X*Z+Y. All vectors have length of 64. Suppose that the hardware have 2 load/store units capable of performing 2 loads, or 2 stores, or 1 load and 1 store vector operation at the same time, one pipelined vector multiplier and one pipelined vector adder. Suppose that chaining is not allowed and that the start-up times are 12 for LV and SV, 7 for MULV and 6 for ADDV. a. How many convoys do we have? b. What is the total execution time? LV V5,Rz ;load vector Z LV V1,Rx ;load vector X MULV V2,V1,V5;vector multiply LVV3,Ry ;load vector Y ADDVV4,V2,V3 ;vector add SV Ry,V4 ;store the result 4 convoys: 1.LV, LV 2.MULV, LV 3.ADDV 4.SV  4 x 16 + 12 + 12 + 6 + 12 = 298

14 Final: QUESTION #5.1-2 Consider the following code implemented on a vector processor used to multiply 64 element vector Y = a × X: L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply SV Ry,V2 ; store the result Startup delay: Load and store unit 12, Multiply unit 7 clock cycles Compute the total execution time of vector instructions if the instructions are chained. Assume that: a) There is only 1 load/store unit L.D  12 + 64 + 12 + 64 = 152 LV MULVS SV

15 Final: QUESTION #5.1-2 Consider the following code implemented on a vector processor used to multiply 64 element vector Y = a × X: L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply SV Ry,V2 ; store the result Startup delay: Load and store unit 12, Multiply unit 7 clock cycles Compute the total execution time of vector instructions if the instructions are chained. Assume that: b) There are one load and one store unit L.D  12 + 7 + 12 + 64 = 95 LV MULVS SV


Download ppt "Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy."

Similar presentations


Ads by Google