Presentation is loading. Please wait.

Presentation is loading. Please wait.

©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

Similar presentations


Presentation on theme: "©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture."— Presentation transcript:

1 ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture

2 ©RG:E0243:L2- Parallel Architecture 2 Overview  Parallel Architecture  Cache coherence problem  Memory consistency

3 ©RG:E0243:L2- Parallel Architecture 3 Trends  Ever increasing transistor density  multiple processors (multiple core) on a single chip (CMP)  Beyond Instruction level parallelism  thread-level parallelism  Speculative execution  Speculative Multithreaded execution

4 ©RG:E0243:L2- Parallel Architecture 4 Recall:  Amdahl’s Law:  For a program with x part sequential execution, speedup is limited by 1/x.  Speedup = (Exec. Time in Uniproc.)/ Exec. Time in N Procs.)  Efficiency = Speedup of N Procs. /N

5 ©RG:E0243:L2- Parallel Architecture 5 Space of Parallel Computing Programming Models  What programmer uses in coding applns.  Specifies synch. And communication.  Programming Models:  Shared address space, e.g., OpenMP  Message passing, e.g., MPI Parallel Architecture  Shared Memory  Centralized shared memory (UMA)  Distributed Shared Memory (NUMA)  Distributed Memory  A.k.a. Message passing  E.g., Clusters

6 ©RG:E0243:L2- Parallel Architecture 6 Shared Memory Architectures  Shared, global, address space, hence called Shared Address Space  Any processor can directly reference any memory location  Communication occurs implicitly as result of loads and stores  Centralized: latencies to memory uniform, but uniformly large  Distributed: Non-Uniform Memory Access (NUMA)

7 ©RG:E0243:L2- Parallel Architecture 7 M Network  Centralized Shared Memory M M $ P $ P $ P  Network Distributed Shared Memory M $ P M $ P  Shared Memory Architecture

8 ©RG:E0243:L2- Parallel Architecture 8 Distributed Memory Architecture Network M $ P  M $ P M $ P  Message Passing Architecture  Memory is private to each node  Processes communicate by messages Proc. Node Proc. Node Proc. Node

9 ©RG:E0243:L2- Parallel Architecture 9 Caches and Cache Coherence  Caches play key role in all cases  Reduce average data access time  Reduce bandwidth demands placed on shared interconnect  Private processor caches create a problem  Copies of a variable can be present in multiple caches  A write by processor P may not be visible to P’ !  P’ will keep accessing stale value from its cache!  Cache coherence problem

10 ©RG:E0243:L2- Parallel Architecture 10 Cache Coherence Problem: Example  Processors see different values for u after event 3  With write back caches, value written back to memory depends on which cache flushes or writes back value. I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u :5 1 u 2 u 3 u = 7 Read Write Read

11 ©RG:E0243:L2- Parallel Architecture 11 Cache Coherence Problem  Multiple processors with private caches  Potential data consistency problem: the cache coherence problem  Processes shouldn’t read `stale’ data  Intuitively, Reading an address should return the last value written to that address  Solutions  Hardware: cache coherence mechanisms  Invalidation-based vs. Update-based  Snoopy vs. directory  Software: compiler assisted cache coherence

12 ©RG:E0243:L2- Parallel Architecture 12 Example: Snoopy Bus Protocols  Assumption: shared bus interconnect where all cache controllers monitor all bus activity  Called snooping  There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches  Corrective action could involve updating or invalidating a cache block

13 ©RG:E0243:L2- Parallel Architecture 13 Snoopy Invalidate Protocol I/O devices Memory P 1 $$ $ P 2 P 3 4 u = ? u :5 1 u 2 u 3 u = 7

14 ©RG:E0243:L2- Parallel Architecture 14 Invalidate vs Update  Basic question of program behavior:  Is a block written by one processor later read by others before it is overwritten?  Invalidate  readers will take a miss  multiple writes without additional traffic  clears out copies that are not used again  Update  avoids misses on later references  multiple useless updates

15 ©RG:E0243:L2- Parallel Architecture 15 MSI Invalidation Protocol  Cache Block States  I: Invalid  S: Shared (one or more cache copies)  M: Modified or Dirty (only copy) Encoded in 2 bits and updated by protocol  Processor Events:  PrRd (read)  PrWr (write)  Bus Transactions  BusRd: asks for copy with no intent to modify  BusRdX: asks for copy with intent to modify  Flush: write back (updates main memory)

16 ©RG:E0243:L2- Parallel Architecture 16 MSI: State Transition Diagram M I PrRd/BusRd PrWr/BusRdX PrRd/-PrWr/- BusRdX/Flush BusRd/Flush BusRdX/— PrRd/— BusRd /— PrWr/BusRdX S

17 ©RG:E0243:L2- Parallel Architecture 17 MESI (4-state) Invalidation Protocol  Problem with MSI protocol  Reading and modifying data is 2 bus xactions, even if no one is sharing  BusRd (I->S) followed by BusRdX or BusUpgr (S->M)  Add exclusive state: write locally without xaction,  Memory is up to date, so cache not necessarily owner  States  invalid  exclusive or exclusive-clean (only this cache has copy, but not modified)  shared (two or more caches may have copies)  modified (dirty )

18 ©RG:E0243:L2- Parallel Architecture 18 MESI - State Transition Diagram BusRd/Flush BusRdX/Flush PrW r/BusRdX PrWr PrRd/— BusRd/Flush E M I S PrWr/-- PrRd PrRd/ BusRd(S) BusRdX/Flush BusRd/ Flush PrW r/BusRdX PrRd/ BusRd (S )

19 ©RG:E0243:L2- Parallel Architecture 19 Scalability Issues of Snoopy Protocol  Snoopy cache ideally suited for bus-based IN.  Shared bus IN saturates performance for large no. of procs. (beyond 8 procs.)  For non-bus-based IN, coherence messages can be broadcast – expensive  Only a few procs. may have a copy of the shared data.  May be more efficient to maintain a directory of caches that have a copy of the cache block.

20 ©RG:E0243:L2- Parallel Architecture 20 Directory Based Coherence  Memory (or Cache) maintains a list (directory) of procs. that have the copy of a block  On write, memory controller sends Invalidate (or Update) signal only to procs. that have a copy  Memory also knows the current owner (in case of Dirty blocks)  memory controller requests owner for updated copy

21 ©RG:E0243:L2- Parallel Architecture 21 Generic Solution: Directories P1 Cache Memory Scalable Interconnection Network Comm. Assist P1 Cache Comm Assist DirectoryMemory Directory Directory presence bitsdirty bit

22 ©RG:E0243:L2- Parallel Architecture 22 Memory Consistency Model  Memory consistency model  Order in which memory operations will appear to execute  What value can a read return?  Contract between appln. software and system.  Affects ease-of-programming and performance

23 ©RG:E0243:L2- Parallel Architecture 23 Understanding Program Order: Example Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A;  What value of A will be printed by process P3?  Role of Program order in ensuring P3 reads the value of A as 1.

24 ©RG:E0243:L2- Parallel Architecture 24 Example 2 Software Implementation of Mutex: Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical section critical section  Can both P1 and P2 enter the critical section? i.e., evaluate the “if” condition as true?

25 ©RG:E0243:L2- Parallel Architecture 25 Sequential Consistency: Definition A system is sequentially consistent if  Operations within a processor follow program order  Operations of all processors were executed in some (interleaved) sequential order  All processors see the same sequential order

26 ©RG:E0243:L2- Parallel Architecture 26 Implicit Memory Model  Sequential consistency (SC) [Lamport]  Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1P3P2Pn

27 ©RG:E0243:L2- Parallel Architecture 27 Sequential Consistency: Definition A system is sequentially consistent if  Operations within a processor follow program order  Operations of all processors were executed in some (interleaved) sequential order  All processors see the same sequential order Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A;

28 ©RG:E0243:L2- Parallel Architecture 28 Under SC can P3 print A as 0? Initially A = B = 0; Process P1 Process P2 Process P3 (w1)A = 1; (r2) while (A==0); (r3) while (B==0); (w2) B = 1; (r3’) Print A; w1 r3’ w2 r2 r3

29 ©RG:E0243:L2- Parallel Architecture 29 Sequential Consistency  SC ensures all Memory orders:  Write  Read  Write  Write  Read  Read  Read  Write  SC treats all Memory operations same way!

30 ©RG:E0243:L2- Parallel Architecture 30 Sequential Consistency: Conditions  Before a load is allowed to perform w.r.t any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed  Before a store is allowed to perform w.r.t. any other processor, all previous LOAD must be globally performed and all previous STORE must be performed.  What this means is read  read, read  write, write  read, and write  write order are maintained!

31 ©RG:E0243:L2- Parallel Architecture 31 Processor Consistency: Definition A system is Processor consistent if  Writes issued by a processor must be in program order  Read  read, read  write, and write  write order  But no write  read order  Operations of all processors were executed in some (interleaved) sequential order  All processors need not see the same sequential order of writes from different processors Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A; Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical sectioncritical section

32 ©RG:E0243:L2- Parallel Architecture 32 Example 2 Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical sectioncritical section

33 ©RG:E0243:L2- Parallel Architecture 33 Weak Consistency  Distinguishes between ordinary memory operations and synchronization operations (e.g., lock acquire/release)  A system is weak consistent if  Before a load/store is allowed to perform, all previous synchronization accesses must be performed  Before a synchronization operation is performed, all previous load/store must be performed  Synchronization accesses are sequentially consistent.

34 ©RG:E0243:L2- Parallel Architecture 34 Weak Consistency  Weak ordering:  Divide memory operations into data operations and synchronization operations  Synchronization operations act like a fence:  All data operations before synch in program order must complete before synch is executed  All data operations after synch in program order must wait for synch to complete  Synchs are performed in program order

35 ©RG:E0243:L2- Parallel Architecture 35 Weak Consistency  Weak ordering:  Implementation of fence: processor has counter that is incremented when data op is issued, and decremented when data op is completed  Example: PowerPC has SYNC instruction

36 ©RG:E0243:L2- Parallel Architecture 36 An Example: Load Store Load Store Load Store Load Store Sequential Consistency Processor Consistency

37 ©RG:E0243:L2- Parallel Architecture 37 Example: Weak Consistency : Sync(Acq) Load/Store Sync(Rel) Sync(Acq) Sync(Rel) Load/Store No ordering among the loads/stores here!

38 ©RG:E0243:L2- Parallel Architecture 38 Another model: Release Consistency  Synchronization accesses are divided into  Acquires: operations like lock  Release: operations like unlock  Semantics of acquire:  Acquire must complete before all following memory accesses  Semantics of release:  all memory operations before release must complete  but accesses after release in program order do not have to wait for release  operations which follow release and which need to wait must be protected by an acquire

39 ©RG:E0243:L2- Parallel Architecture 39 Release Consistency  Further distinguishes between lock acquire and lock release synch. Operation.  A system is release consistent if  Before a load/store is allowed to perform, all previous acquire accesses must be performed  Before a release synchronization operation is performed, all previous load/store must be performed  Synchronization accesses are processor consistent.

40 ©RG:E0243:L2- Parallel Architecture 40 Example: Release Consistency Sync(Acq) Load/Store Sync(Rel) Sync(Acq) Sync(Rel) Load/Store Weak Consistency Acquire Load/Store Release Acquire Release Load/Store Acquire treated as READ/LOAD Release treated as WRITE/STORE

41 ©RG:E0243:L2- Parallel Architecture 41 Ordering in Consistency Models ModelRRRWRRRW WRWRWWWWSARSAWSARSAW R SAW SAR SAW SA SRRSRWSRRSRW R  S R W  S R SC PC R SAR SA SRWSRW WC RC WC : S A  S A, S A  S R, S R  S R, S R  S A RC : S A  S A, S A  S R, S R  S R

42 Reading Material  S. V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial” WRL Research Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf  K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbon, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors”’ ISCA 1991.

43 Term Project Steps [Step 0] Choose a team of 2 (you can identify your partner) [Step 1] Choose an area of interest [Step 2] Read some (recent) papers and make a hypothesis [Step 3] Check in literature to see if it has already been studied? If yes, go back to Step 1 [Step 4] Feasible to do the study in 2 months? If not go back to Step 1 [Step 5] Do some initial study (experimentation) [Step 6] Analyse, report, relief! :-)

44 Term Project Steps [Step 0] Choose a team of 2 (you can identify your partner) [Step 1] Choose an area of interest [Step 2] Read some (recent) papers and make a hypothesis [Step 3] Check in literature to see if it has already been studied? If yes, go back to Step 1 [Step 4] Feasible to do the study in 2 months? If not go back to Step 1 [Step 5] Do some initial study (experimentation) [Step 6] Analyse, report, relief! :-)

45 Term Project Expectations  Non-trivial project  Should have an element of surpise  Be ambitious, but realistic too!  Must learn/get something new!  You can iterate your ideas with me during the next two weeks.  Look beyond what we have discussed so far in the class!  Try to choose some recent papers/topics!

46 Term Project  Schedule  Proposal Due by Sept. 20  First Review by Oct. 18  Report and Demo/Presentation : Nov. 29 Sept. 30 Oct. 29


Download ppt "©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture."

Similar presentations


Ads by Google