Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Consistency. zReads and writes of the shared memory face consistency problem zNeed to achieve controlled consistency in memory events zShared.

Similar presentations


Presentation on theme: "Memory Consistency. zReads and writes of the shared memory face consistency problem zNeed to achieve controlled consistency in memory events zShared."— Presentation transcript:

1 Memory Consistency

2

3 zReads and writes of the shared memory face consistency problem zNeed to achieve controlled consistency in memory events zShared memory behavior determined by: yProgram order yMemory access order zChallenges yModern processors reorder operations yCompiler optimizations (scalar replacement, instruction rescheduling

4 Basic Concept zOn a multiprocessor: yConcurrent instruction streams (threads) on different processors yMemory events performed by one process may create data to be used by another xEvents: read and write zMemory consistency model specifies how the memory events initiated by one process should be observed by other processes zEvent ordering yDeclare which memory access is allowed, which process should wait for a later access when processes compete

5 Uniprocessor vs. Multiprocessor Model

6 Understanding Program Order Initially X = 2 P1P2….. r0=Read(X)r1=Read(x) r0=r0+1r1=r1+1 Write(r0,X)Write(r1,X) …..…… Possible execution sequences: P1:r0=Read(X)P2:r1=Read(X) P2:r1=Read(X)P2:r1=r1+1 P1:r0=r0+1P2:Write(r1,X) P1:Write(r0,X)P1:r0=Read(X) P2:r1=r1+1P1:r0=r0+1 P2:Write(r1,X)P1:Write(r0,X) x=3 x=4

7 Interleaving zProgram orders of individual instruction streams may need to be modified because of interaction among them yFinding optimum global memory order is an NP hard problem a. A=1; b. Print B,C; c. B=1; d. Print A,C; P2 e. C=1; f. Print A,B; P3 A, B, C shared variables (initially 0) Shared Memory switch P1

8 Example zConcatenate program orders in P1, P2 and P3 y6-tuple binary strings (64 output combinations) y(a,b,c,d,e,f) => (001011) (in order execution) y(a,c,e,b,d,f) => (111111) (in order execution) y(b,d,f,e,a,c) => (000000) (out of order execution) x6! (720 possible permutations) a. A=1; b. Print B,C; c. B=1; d. Print A,C; P2 e. C=1; f. Print A,B; P3 A, B, C shared variables (initially 0) Shared Memory switch P1

9 Mutual exclusion problem zmutual exclusion problem in concurrent programming yallow two threads to share a single-use resource without conflict, using only shared memory for communication. yavoid the strict alternation of a naive turn- taking algorithm

10 Definition zIf two processes attempt to enter a critical section at the same time, allow only one process in, based on whose turn it is. zIf one process is already in the critical section, the other process will wait for the first process to exit. zHow would you implement this without ymutual exclusion, yfreedom from deadlock, and yfreedom from starvation.

11 Solution: Dekker’s Algorithm zThis is done by the use of two flags f0 and f1 which indicate an intention to enter the critical section and a turn variable which indicates who has priority between the two processes.

12 flag[0] := false flag[1] := false turn := 0 // or 1 flag[0] := true while flag[1] = true { if turn ≠ 0 { flag[0] := false while turn ≠ 0 { } flag[0] := true } } // critical section... turn := 1 flag[0] := false // remainder // section flag[1] := true while flag[0] = true { if turn ≠ 1 { flag[1] := false while turn ≠ 1 { } flag[1] := true } } // critical section... turn := 0 flag[1] := false // remainder // section P0P1

13 Disadvantages zlimited to two processes zmakes use of busy waiting instead of process suspension. zModern CPUs execute their instructions in an out-of-order fashion, yeven memory accesses can be reordered

14 flag[0] = 0; flag[1] = 0; turn; flag[0] = 1; turn = 1; while (flag[1] == 1 && turn == 1) { // busy wait } // critical section... // end of critical section flag[0] = 0; flag[1] = 1; turn = 0; while (flag[0] == 1 && turn == 0) { // busy wait } // critical section... // end of critical section flag[1] = 0; P0P1 Peterson’s Algorithm

15 Lamport's bakery algorithm za bakery with a numbering machine ythe 'customers' will be threads, identified by the letter i, obtained from a global variable. ymore than one thread might get the same number // declaration and initial values of global variables Entering: array [1..NUM_THREADS] of bool = {false}; Number: array [1.. NUM_THREADS] of integer = {0}; 1 lock(integer i) { 2 Entering[i] = true; 3 Number[i] = 1 + max(Number[1],..., Number[NUM_THREADS]); 4 Entering[i] = false; 5 for (j = 1; j <= NUM_THREADS; j++) { 6 // Wait until thread j receives its number: 7 while (Entering[j]) { /* nothing */ } 8 // Wait until all threads with smaller numbers or with the same 9 // number, but with higher priority, finish their work: 10 while ((Number[j] != 0) && ((Number[j], j) < (Number[i], i))) { 11 /* nothing */ 12 } 13 } 14 } 15 unlock(integer i) { 16 Number[i] = 0; 17 } 18 Thread(integer i) { 19 while (true) { 20 lock(i); 21 // The critical section goes here unlock(i); 23 // non-critical section } 25 }

16 Models Strict Consistency: Read always returns with most recent Write to same address Sequential Consistency: The result of any execution appears as the interleaving of individual programs strictly in sequential program order Processor Consistency: Writes issued by each processor are in program order, but writes from different processors can be out of order (Goodman) Weak Consistency: Programmer uses synch operations to enforce sequential consistency (Dubois) Reads from each processor is not restricted More opportunities for pipelining

17 Relationship to Cache Coherence Protocol zCache coherence protocol must observe the constraints imposed by the memory consistency model yEx: Read hit in a cache xReading without waiting for the completion of a previous write my violate sequential consistency zCache coherence protocol provides a mechanism to propagate the newly written value zMemory consistency model places an additional constraint on when the value can be propagated to a given processor

18 Latency Tolerance zScalable systems yDistributed shared memory architecture yAccess to remote memory: long latency yProcessor speed vs. the memory and interconnect zNeed for yLatency reduction, avoidance, hiding

19 Latency Avoidance zOrganize user applications at architectural, compiler or application levels to achieve program/data locality zPossible when applications exhibit: yTemporal or spatial locality zHow do you enhance locality?

20 Locality Enhancement zArchitectural support: yCache coherency protocols, memory consistency models, fast message passing, etc. zUser support yHigh Performance Fortran: program instructs compiler how to allocate the data (example ?) zSoftware support yCompiler performs certain transformations xExample?

21 Latency Reduction zWhat if locality is limited? zData access is dynamically changing? yFor ex: sorting algorithms zWe need latency reduction mechanisms yTarget communication subsystem xInterconnect xNetwork interface xFast communication software Cluster: TCP, UDP, etc

22 Latency Hiding zHide communication latency within computation yOverlapping techniques xPrefetching techniques Hide read latency xDistributed coherent caches Reduce cache misses Shorten time to retrieve clean copy xMultiple context processors Switch from one context to another when long-latency operations is encountered (hardware supported multithreading)

23 Memory Delays zSMP yhigh in multiprocessors due to added contention for shared resources such as a shared bus and memory modules zDistributed yare even more pronounced in distributed-memory multiprocessors where memory requests may need to be satisfied across an interconnection network. zBy masking some or all of these significant memory latencies, prefetching can be an effective means of speeding up multiprocessor applications

24 Data Prefetching zOverlapping computation with memory accesses yRather than waiting for a cache miss to perform a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference.

25 Cache Hierarchy zPopular latency reducing technique zBut still common for scientific programs to spend more than half their run times stalled on memory requests ypartially a result of the “on demand” fetch policy xfetch data into the cache from main memory only after the processor has requested a word and found it absent from the cache.

26 Why do scientific applications exhibit poor cache utilization? zIs something wrong with the principle of locality? zThe traversal of large data arrays is often at the heart of this problem. zTemporal locality in array computations yonce an element has been used to compute a result, it is often not referenced again before it is displaced from the cache to make room for additional array elements. zSequential array accesses patterns exhibit a high degree of spatial locality, many other types of array access patterns do not. yFor example, in a language which stores matrices in row-major order, a row-wise traversal of a matrix will result in consecutively referenced elements being widely separated in memory. Such strided reference patterns result in low spatial locality if the stride is greater than the cache block size. In this case, only one word per cache block is actually used while the remainder of the block remains untouched even though cache space has been allocated for it.

27 Time: Computation and memory references satisfied within the cache hierarchy main memory access time Memory references r1,r2 and r3 not in the cache

28 Challenges zCache pollution yData arrives early enough to hide all of the memory latency yData must be held in the processor cache for some period of time before it is used by the processor. yDuring this time, the prefetched data are exposed to the cache replacement policy and may be evicted from the cache before use. yMoreover, the prefetched data may displace data in the cache that is currently in use by the processor. zMemory bandwidth yBack to figure: xNo prefetch: the three memory requests occur within the first 31 time units of program startup, xWith prefetch: these requests are compressed into a period of 19 time units. yBy removing processor stall cycles, prefetching effectively increases the frequency of memory requests issued by the processor. yMemory systems must be designed to match this higher bandwidth to avoid becoming saturated and nullifying the benefits of prefetching.

29 Spatial Locality zBlock transfer is a way of prefetching (1960s) zSoftware prefetching later (1980s)

30 Binding Prefetch zNon-blocking load instructions ythese instructions are issued in advance of the actual use to take advantage of the parallelism between the processor and memory subsystem. yRather than loading data into the cache, however, the specified word is placed directly into a processor register. zthe value of the prefetched variable is bound to a named location at the time the prefetch is issued.

31 Software-Initiated Data Prefetching zSome form of fetch instruction ycan be as simple as a load into a processor register zFetches are non-blocking memory operations yAllow prefetches to bypass other outstanding memory operations in the cache. zFetch instructions cannot cause exceptions zThe hardware required to implement software- initiated prefetching is modest

32 Prefetch Challenges zprefetch scheduling. yjudicious placement of fetch instructions within the target application. ynot possible to precisely predict when to schedule a prefetch so that data arrives in the cache at the moment it will be requested by the processor yuncertainties not predictable at compile time xcareful consideration when statically scheduling prefetch instructions. ymay be added by the programmer or by the compiler during an optimization pass. xprogramming effort ?

33 Suitable spots for “Fetch” zmost often used within loops responsible for large array calculations. ycommon in scientific codes, yexhibit poor cache utilization ypredictable array referencing patterns.

34 Example: assume a four-word cache block How to solve these two issues? software piplining Issues: Cache misses during the first iteration Unnecessary prefetches in the last iteration of the unrolled loop

35 Assumptions zimplicit assumption yPrefetching one iteration ahead of the data’s actual use is sufficient to hide the latency zWhat if the loops contain small computational bodies. yDefine prefetch distance xinitiate prefetches d iterations before the data is referenced xHow do you determine “d”? Let –“l” be the average cache miss latency, measured in processor cycles, –“s” be the estimated cycle time of the shortest possible execution path through one loop iteration, including the prefetch overhead. ydyd

36 Revisiting the example zlet us assume an average miss latency of 100 processor cycles and a loop iteration time of 45 cycles yd=3 (handle a prefetch distance of three)

37 Case Study zGiven a distributed-shared multiprocessor zlet’s define a remote access cache (RAC) yAssume that RAC is located at the network interface of each node yMotivation: prefetched remote data could be accessed at a speed comparable to that of local memory while the processor cache hierarchy was reserved for demand-fetched data. zWhich one is better: Having RAC or pretefetching data directly into the processor cache hierarchy? yDespite significantly increasing cache contention and yreducing overall cache space, yThe latter approach results in higher cache hit rates, xdominant performance factor.

38 Case Study zTransfer of individual cache blocks across the interconnection network of a multiprocessor yields low network efficiency ywhat if we propose transferring prefetched data in larger units? zMethod: a compiler schedules a single prefetch command before the loop is entered rather than software pipelining prefetches within a loop. ytransfer of large blocks of remote memory used within the loop body yprefetched into local memory to prevent excessive cache pollution. zIssues: ybinding prefetch since data stored in a processor’s local memory are not exposed to any coherency policy yimposes constraints on the use of prefetched data which, in turn, limits the amount of remote data that can be prefetched.

39 What about besides the “loops”? zPrefetching is normally restricted to loops yarray accesses whose indices are linear functions of the loop indices ycompiler must be able to predict memory access patterns when scheduling prefetches. ysuch loops are relatively common in scientific codes but far less so in general applications. zIrregular data structures ydifficult to reliably predict when a particular data will be accessed yonce a cache block has been accessed, there is less of a chance that several successive cache blocks will also be requested when data structures such as graphs and linked lists are used. ycomparatively high temporal locality xresult in high cache utilization thereby diminishing the benefit of prefetching.

40 What is the overhead of fetch instructions? zrequire extra execution cycles zfetch source addresses must be calculated and stored in the processor yto avoid recalculation for the matching load or store instruction. xHow: Register space xProblem: compiler will have less register space to allocate to other active variables. fetch instructions increase register pressure It gets worse when –the prefetch distance is greater than one –multiple prefetch addresses zcode expansion ymay degrade instruction cache performance. zsoftware-initiated prefetching is done statically yunable to detect when a prefetched block has been prematurely evicted and needs to be re-fetched.

41 Hardware-Initiated Data Prefetching zPrefetching capabilities without the need for programmer or compiler intervention. zNo changes to existing executables yinstruction overhead completely eliminated. zcan take advantage of run-time information to potentially make prefetching more effective.

42 Cache Blocks zTypically: fetch data from main memory into the processor cache in units of cache blocks. ymultiple word cache blocks are themselves a form of data prefetching. ylarge cache blocks xEffective prefetching vs cache pollution. yWhat is the complication for SMPs with private caches xfalse sharing: when two or more processors wish to access different words within the same cache block and at least one of the accesses is a store. xcache coherence traffic is generated to ensure that the changes made to a block by a store operation are seen by all processors caching the block. Unnecessary traffic Increasing the cache block size increases the likelihood of such occurances zHow do we take advantage of spatial locality without introducing some of the problems associated with large cache blocks?

43 Sequential prefetching zone block lookahead (OBL) approach yinitiates a prefetch for block b+1 when block b is accessed. zHow is it different from doubling the block size? yprefetched blocks are treated separately with regard to the cache replacement and coherency policies.

44 OBL: Case Study zAssume that a large block contains one word which is frequently referenced and several other words which are not in use. zAssume that an LRU replacement policy is used, zWhat is the implication? ythe entire block will be retained even though only a portion of the block’s data is actually in use. zHow do we solve? yReplace large block with two smaller blocks, xone of them could be evicted to make room for more active data. xuse of smaller cache blocks reduces the probability of false sharing

45 OBL implementations zBased on “what type of access to block b initiates the prefetch of b+1” yprefetch on miss xInitiates a prefetch for block b+1 whenever an access for block b results in a cache miss. xIf b+1 is already cached, no memory access is initiated ytagged prefetch algorithms xAssociates a tag bit with every memory block. xUse this bit to detect when a block is demand-fetched or when a prefetched block is referenced for the first time. xThen, next sequential block is fetched. yWhich one is better in terms of reducing miss rate? Prefetch on miss vs tagged prefetch?

46 Prefetch on miss vs tagged prefetch Accessing three contiguous blocks strictly sequential access pattern:

47 Shortcoming of the OBL zprefetch may not be initiated far enough in advance of the actual use to avoid a processor memory stall. yA sequential access stream resulting from a tight loop, for example, may not allow sufficient time between the use of blocks b and b+1 to completely hide the memory latency.

48 How do you solve this shortcoming? zIncrease the number of blocks prefetched after a demand fetch from one to “d” yAs each prefetched block, b, is accessed for the first time, the cache is interrogated to check if blocks b+1,... b+d are present in the cache zWhat if d=1? What kind of prefetching is this? yTagged

49 Another technique with d-prefetch zd prefetched blocks are brought into a FIFO stream buffer before being brought into the cache. yAs each buffer entry is referenced, it is brought into the cache while the remaining blocks are moved up in the queue and a new block is prefetched into the tail position. yIf a miss occurs in the cache and the desired block is also not found at the head of the stream buffer, the buffer is flushed. zAdvantage: yprefetched data are not placed directly into the cache, yavoids cache pollution. zDisadvantage: yrequires that prefetched blocks be accessed in a strictly sequential order to take advantage of the stream buffer.

50 Tradeoffs of d-prefetching? zGood: increasing the degree of prefetching yreduces miss rates in sections of code that show a high degree of spatial locality zBad yadditional traffic and cache pollution are generated by sequential prefetching during program phases that show little spatial locality. zWhat if are able to vary the “d”

51 Adaptive sequential prefetching zd is matched to the degree of spatial locality exhibited by the program at a particular point in time. za prefetch efficiency metric is periodically calculated zPrefetch efficiency yratio of useful prefetches to total prefetches xa useful prefetch occurs whenever a prefetched block results in a cache hit. zd is initialized to one, yincremented whenever efficiency exceeds a predetermined upper threshold ydecremented whenever the efficiency drops below a lower threshold yIf d=0, no prefetching zWhich one is better? adaptive or tagged prefetching? yMiss ratio vs Memory traffic and contention

52 Sequential prefetching summary zDoes sequential prefetching require changes to existing executables? zWhat about the hardware complexity? zWhich one offers both simplicity and performance? yTAGGED zCompared to software-initiated prefetching, what might be the problem? ytend to generate more unnecessary prefetches. yNon-sequential access patterns are not good xEx: such as scalar references or array accesses with large strides, will result in unnecessary prefetch requests xdo not exhibit the spatial locality upon which sequential prefetching is based. zTo enable prefetching of strided and other irregular data access patterns, several more elaborate hardware prefetching techniques have been proposed.

53 Prefetching with arbitrary strides zReference Prediction Table State: initial, transient, steady

54 RPT Entries State Transition

55 Matrix Multiplication Assume that starting addresses a=10000 b=20000 c=30000, and 1 word cache block After the first iteration of inner loop

56 Matrix Multiplication After the second iteration of inner loop Hits/misses?

57 Matrix Multiplication After the third iteration b and c hits provided that a prefetch of distance one is enough

58 RPT Limitations zPrefetch distance to one loop iteration yLoop entrance : miss yLoop exit: unnecessary prefetch zHow can we solve this? yUse longer distance yPrefetch address = effective address + (stride x distance ) ywith lookahead program counter (LA-PC)

59 Summary zPrefetches ytimely, useful, and introduce little overhead. zReduce secondary effects in the memory system zstrategies are diverse and no single strategy provides optimal performance

60 Summary zPrefetching schemes are diverse. zTo help categorize a particular approach it is useful to answer three basic questions concerning the prefetching mechanism: y1) When are prefetches initiated, y2) Where are prefetched data placed, y3) What is the unit of prefetch?

61 Software vs Hardware Prefetching zPrefetch instructions actually increase the amount of work done by the processor. zHardware-based prefetching techniques do not require the use of explicit fetch instructions. yhardware monitors the processor in an attempt to infer prefetching opportunities. yno instruction overhead ygenerates more unnecessary prefetches than software-initiated schemes. xneed to speculate on future memory accesses without the benefit of compile-time information Cache pollution Consume memory bandwidth

62 Conclusions zPrefetches can be initiated either by yexplicit fetch operation within a program (software initiated) ylogic that monitors the processor’s referencing pattern (hardware- initiated). zPrefetches must be timely. yissued too early xchance that the prefetched data will displace other useful data or be displaced itself before use. yissued too late xmay not arrive before the actual memory reference and introduce stalls zPrefetches must be precise. yThe software approach issues prefetches only for data that is likely to be used yHardware schemes tend to fetch more data unnecessarily.

63 Conclusions zThe decision of where to place prefetched data in the memory hierarchy yhigher level of the memory hierarchy to provide a performance benefit. zThe majority of schemes yprefetched data in some type of cache memory. zPrefetched data in processor registers ybinding and additional constraints must be imposed on the use of the data. zFinally, multiprocessor systems can introduce additional levels into the memory hierarchy which must be taken into consideration.

64 Conclusions zData can be prefetched in units of single words, cache blocks or larger blocks of memory. ydetermined by the organization of the underlying cache and memory system. zUniprocessors and SMPs yCache blocks appropriate zDistributed memory multiprocessor ylarger memory blocks xto amortize the cost of initiating a data transfer across an interconnection network


Download ppt "Memory Consistency. zReads and writes of the shared memory face consistency problem zNeed to achieve controlled consistency in memory events zShared."

Similar presentations


Ads by Google