Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Processing Problems Cache Coherence False Sharing Synchronization.

Similar presentations


Presentation on theme: "Parallel Processing Problems Cache Coherence False Sharing Synchronization."— Presentation transcript:

1 Parallel Processing Problems Cache Coherence False Sharing Synchronization

2 Cache Coherence $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a 2. P2: Wr a, 5 3. P1: Rd a 4. P2: Wr a, 3 5. P1: Rd a DRAM P1,P2 are write-back caches

3 Whatever are we to do? Write-Invalidate Write-Update

4 Write Invalidate $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 7 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3 2 P1,P2 are write-back caches 4

5 Write Update $$$ P1P2 Current a value in:P1$ P2$ DRAM * * 7 1. P2: Rd a * 7 7 2. P2: Wr a, 5* 5 7 3. P1: Rd a5 5 7 4. P2: Wr a, 3 5. P1: Rd a DRAM 1 3, 4 2 P1,P2 are write-back caches 4

6 Performance Considerations Invalidate Update Writing makes data exclusive Receiving changed data slower Once shared, always shared Once shared, writes always on bus Get changed data very quickly

7 Cache Coherence False Sharing $$$ P1P2 Current contents in:P1$ P2$ * 1.P2: Rd A[0] 2.P1: Rd A[1] 3. P2: Wr A[0], 5 4. P1: Wr A[1], 3 DRAM P1,P2 cacheline size: 4 words

8 Look closely at example P1 and P2 do not access the same element A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

9 False Sharing Different/same processors access different/same items in different/same cache block Leads to ___________ misses

10 Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3291364/slides/slide_10.jpg", "name": "Cache Performance // Pn = my processor number (rank) // NumProcs = total active processors // N = total number of elements // NElem = N / NumProcs For(i=0;i

11 Which is worse? Both access the same number of elements No processors access the same elements as each other

12 Synchronization Sum += A[i]; Two processors, i = 0, i = 50 Before the action: –Sum = 5 –A[0] = 10 –A[50] = 33 What is the proper result?

13 Synchronization Sum = Sum + A[i]; Assembly for this equation, assuming –A[i] is already in $t0: –&Sum is already in $s0

14 Synchronization Ordering #1 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

15 Synchronization Ordering #2 P1 instEffectP2 instEffect Given$t0 = 10Given$t0 = 33 Lw$t1 = Lw$t1 = add$t1 =Add$t1 = SwSum = SwSum = lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

16 Does Cache Coherence solve it? Did load bring in an old value? Sum += A[i] is ___________ –Atomic – operation occurs in one unit, and nothing may interrupt it.

17 Synchronization Problem Reading and writing memory is a non-atomic operation –You can not read and write a memory location in a single operation We need __________________ that allow us to read and write without interruption

18 Solution Software Solution –“lock” – –“unlock” – Hardware –Provide primitives that read & write in order to implement lock and unlock

19 Software Using lock and unlock Sum += A[i]

20 Hardware Implementing lock & unlock Swap$1, 100($2) –Swap the contents of $1 and M[$2+100]

21 Hardware: Implementing lock & unlock with swap Lock: Li$t0, 1 Loop:swap $t0, 0($a0) bne$t0, $0, loop Unlock: sw $0, 0($a0) If lock has 0, it is free If lock has 1, it is held

22 Summary Cache coherence must be implemented for shared memory to work False sharing causes bad cache performance Hardware primitives necessary for synchronizing shared data


Download ppt "Parallel Processing Problems Cache Coherence False Sharing Synchronization."

Similar presentations


Ads by Google