1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

2 Memory  Bits: 0, 1; Bytes: 8 bits  Memory size  PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes  Memory performance measures:  Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied.  Cycle time: minimum time between two successive memory requests t0 t1 t2 Memory request satisfied Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 t2 Memory busy t0 < t < t2 DRAM only

3 Memory Hierarchy  Memory can be fast (costly) or slow (cheaper).  Increase overall performance: use locality of reference  Faster memory (also smaller) closer to CPU;  slower memory (also larger) farther away from CPU.  Have often-used data in fast memory; leave less- often-used data in slow memory.  Key: When lower levels of hierarchy send value at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of data  Cache line

4 Memory Hierarchy  Performance of different levels can be very different  e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6 cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower. Registers Level-1 cache Level-2 cache Main memory Secondary memory (hard disk) Network storage … Cache: a piece of fast memory Expensive, CA$H ? Increasing speed Increasing cost Decreasing size Decreasing speed Decreasing cost Increasing size

5 How Memory Hierarchy Works  (RISC processor) CPU works only on data in registers.  If data is not in register, request data from memory and load to register …  Data in register come only from and go only to L1 cache.  When CPU requests data from memory, L1 cache takes over;  If data is in L1 cache (cache hit), return data to CPU immediately; end memory access;  If data is not in L1 cache (cache miss) …

6 How Memory Hierarchy Works  If data is not in L1 cache, L1 cache forwards memory request down to L2 cache.  If L2 cache has the data (cache hit), it returns the data to L1 cache, which in turn returns data to CPU; end memory access;  If L2 cache does not have the data (cache miss) …  If data is not in L2 cache, L2 cache forwards memory request down to main memory.  If data is in main memory, main memory passes data to L2 cache, which then passes data to L1 cache, which then passes data to CPU.  If data is not in memory …  Then request is passed to OS to read data from secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register.

7 Cache Line  A cache line is the smallest unit of data that can be transferred to or from memory (and L2 cache).  usually between 32 and 128 bytes  May contain several data items  When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred.  When the data in variable X is requested from memory, the cache line containing X (and adjacent data) is transferred to cache. X[10]X[11]X[12]X[13]X[9]X[14] Cache line Assume: 32-byte cache line, X[11] is requested by CPU Result: X[10] – X[13] is brought into cache from memory.

8 Cache Effect on Performance  Cache miss  degrading performance  When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy  Increasing cache hit rate  higher performance  Efficiency directly related to reuse of data in cache  To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming. for(i=0;i<100;i++) y[i] = 2*x[index[i]]; for(i=0;i<100;i++) y[i] = 2*x[i]; for(i=0;i<100;i=i+4) y[i] = 2*x[i]; sequential access strides Indirect addressing

9 Where in Cache to Put Data from Memory  Cache is organized into cache lines.  Memory is also logically organized into cache lines. … 32-byte cache line 1 MB (32,768 cache lines) … 2 GB (67,108,864 cache lines) cache Main memory Memory size >> cache size Number of cache lines in memory >> number of cache lines in cache. Many cache lines in memory correspond to one cache line in cache.

10 Cache Classification  Direct-mapped cache  Given a memory cache line, it is always placed in one specific cache line in cache.  Fully associative cache  Given a memory cache line, it can be placed in any of the cache lines in cache.  N-way set associative cache  Given a memory cache line, it can be placed in any of a set of N cache lines in cache.

11 Direct-Mapped Cache  A set of memory cache lines always correspond to exactly the same cache line in cache.  Cheap to implement in hardware;  May cause cache thrashing: repeatedly displacing and loading cache lines. … ………… 8 KB 0 8K 16K … 2G Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache)

12 Cache Thrashing: Example  Assumptions:  Direct-mapped cache;  Cache size: 1 MB;  Cache line: 32 bytes; double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … 1 double value = 8 bytes 131072 double values = 1 MB 1 cache line = 32 bytes = 4 double values X[131072]: 1 MB memory Y[131072]: 1 MB memory

13 Cache Thrashing: Example 1 MB 32768 lines X[0]X[1]X[2]X[3] X[4]X[5]X[6]X[7] ………… ………… Y[0]Y[1]Y[2]Y[3] Y[4]Y[5]Y[6]Y[7] ………… ………… ………… ………… cacheMemory 1 MB 32768 lines i=0: load line X[0]-X[3] into cache; load X[0] from cache to register; load line Y[0]-Y[3] into cache, displacing line X[0]-X[3]; load Y[0] from cache into register; add, update Y[0] in cache; i=1: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[1] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[1] from cache to register; add, update Y[1] in cache; i=2: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[2] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[2] from cache to register; add, update Y[2] in cache; i=3: … No cache reuse! Poor performance! Avoid cache thrashing! double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; …

14 Fully Associative Cache  A cache line from memory can be placed anywhere in cache;  No cache thrashing; but costly.  Direct-mapped cache at one extreme of spectrum; fully associative cache at another extreme of spectrum.  Disadvantage: search entire cache to determine if a specific cache line is present.

15 N-Way Set Associative Cache  Compromise between direct-mapped cache and fully associative cache  The cache lines in cache is divided into a number of sets; Each set contains N cache lines.  Given a cache line from memory, the index of set it belongs to is first calculated; Then it is placed in one of the N cache lines in this set. … 2 GB (67,108,864 cache lines) … 1 MB 32,768 cache lines 16,384 sets Each set has 2 lines cache Main memory 2-way set associative cache Less likely to cause cache thrashing; Less costly; Direct-mapped cache is 1-way set associative cache; Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache.

16 Instruction/Data Cache  CPU may have separate instruction cache and data cache (split cache).  CPU may have a single cache, for both instructions and data from memory (unified cache).

17 Remember …  Efficiency directly related to cache reuse  Cache thrashing is eliminated by padding arrays (array dimensions should not be a multiple of cache line – avoid powers of 2)  To improve cache reuse,  Access memory sequentially as much as possible  Avoid stride, random access, indirect addressing  Avoid cache thrashing.

18 Example  Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss. double X[1024][1024], Y[1024][1024]; int i,j; … for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j]; X[0][0] X[0][1] X[0][1023] X[1][0] X[1][1] …… X[1023][1023] …… Y[0][0] Y[0][1] Y[0][1023] Y[1][0] Y[1][1] …… Y[1023][1023] …… stride 1024 or 8KB

19 Virtual Memory, Memory Paging … 0 4KB 8KB 2GB … 0 4KB 8KB 2GB … 0 1024KB 1028KB 1032KB 1036KB 1040KB 1044KB 1048KB 4GB Program #1 Program #2 Physical Memory Modern computers use virtual memory; Memory address seen in a program (virtual address) is not the actual address in physical memory; Memory is divided into pages (e.g. 4KB); A memory page in program’s address space corresponds to a page in physical memory; To access memory, need to translate program’s virtual address to the actual address in physical memory. This is done using a page table;

20 Translation Look-aside Buffer (TLB)  TLB is a special cache for the page tables  Faster access to TLB for virtual-physical translation.  When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there);  If program exhibits locality of references, entries in TLB can be reused  TLB hit  better performance  Otherwise  TLB miss  performance degrades.  Large stride in memory access pattern  TLB miss (and cache miss).

21 Remedies  Use large memory page size  On some systems, the memory page size can be modified by user programs, e.g. IBM SP, HP machines  Avoid large stride in memory access; Sequential access to memory as much as possible.

22 Interleaved Memory  Memory interleaving: alleviating the impact of memory cycle time.  Total memory divided into a set of memory banks;  Contiguous memory addresses reside on different banks.  When accessing memory sequentially, effect of memory cycle time minimized  When current bank is busy, next bank is idle and can be accessed immediately.  Stride in memory access not favorable  may access the same bank repeatedly, need to wait due to cycle time  poor performance Total 2GB memory Divide into 4 memory banks Each bank: 512 MB Cache line: assumed 32 bytes 0-31 128-159 … 32-63 160-191 … 64-95 192-223 … 96-127 224-255 … Bank 1Bank 2Bank 3Bank 4 1 cache line (32 bytes)

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Similar presentations

Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Similar presentations

Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

Similar presentations

About project

Feedback