Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg
2 Memory Bits: 0, 1; Bytes: 8 bits Memory size PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes Memory performance measures: Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied. Cycle time: minimum time between two successive memory requests t0 t1 t2 Memory request satisfied Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 t2 Memory busy t0 < t < t2 DRAM only
3 Memory Hierarchy Memory can be fast (costly) or slow (cheaper). Increase overall performance: use locality of reference Faster memory (also smaller) closer to CPU; slower memory (also larger) farther away from CPU. Have often-used data in fast memory; leave less- often-used data in slow memory. Key: When lower levels of hierarchy send value at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of data Cache line
4 Memory Hierarchy Performance of different levels can be very different e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6 cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower. Registers Level-1 cache Level-2 cache Main memory Secondary memory (hard disk) Network storage … Cache: a piece of fast memory Expensive, CA$H ? Increasing speed Increasing cost Decreasing size Decreasing speed Decreasing cost Increasing size
5 How Memory Hierarchy Works (RISC processor) CPU works only on data in registers. If data is not in register, request data from memory and load to register … Data in register come only from and go only to L1 cache. When CPU requests data from memory, L1 cache takes over; If data is in L1 cache (cache hit), return data to CPU immediately; end memory access; If data is not in L1 cache (cache miss) …
6 How Memory Hierarchy Works If data is not in L1 cache, L1 cache forwards memory request down to L2 cache. If L2 cache has the data (cache hit), it returns the data to L1 cache, which in turn returns data to CPU; end memory access; If L2 cache does not have the data (cache miss) … If data is not in L2 cache, L2 cache forwards memory request down to main memory. If data is in main memory, main memory passes data to L2 cache, which then passes data to L1 cache, which then passes data to CPU. If data is not in memory … Then request is passed to OS to read data from secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register.
7 Cache Line A cache line is the smallest unit of data that can be transferred to or from memory (and L2 cache). usually between 32 and 128 bytes May contain several data items When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred. When the data in variable X is requested from memory, the cache line containing X (and adjacent data) is transferred to cache. XXXXXX Cache line Assume: 32-byte cache line, X is requested by CPU Result: X – X is brought into cache from memory.
8 Cache Effect on Performance Cache miss degrading performance When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy Increasing cache hit rate higher performance Efficiency directly related to reuse of data in cache To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming. for(i=0;i<100;i++) y[i] = 2*x[index[i]]; for(i=0;i<100;i++) y[i] = 2*x[i]; for(i=0;i<100;i=i+4) y[i] = 2*x[i]; sequential access strides Indirect addressing
9 Where in Cache to Put Data from Memory Cache is organized into cache lines. Memory is also logically organized into cache lines. … 32-byte cache line 1 MB (32,768 cache lines) … 2 GB (67,108,864 cache lines) cache Main memory Memory size >> cache size Number of cache lines in memory >> number of cache lines in cache. Many cache lines in memory correspond to one cache line in cache.
10 Cache Classification Direct-mapped cache Given a memory cache line, it is always placed in one specific cache line in cache. Fully associative cache Given a memory cache line, it can be placed in any of the cache lines in cache. N-way set associative cache Given a memory cache line, it can be placed in any of a set of N cache lines in cache.
11 Direct-Mapped Cache A set of memory cache lines always correspond to exactly the same cache line in cache. Cheap to implement in hardware; May cause cache thrashing: repeatedly displacing and loading cache lines. … ………… 8 KB 0 8K 16K … 2G Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache)
12 Cache Thrashing: Example Assumptions: Direct-mapped cache; Cache size: 1 MB; Cache line: 32 bytes; double X, Y; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … 1 double value = 8 bytes 131072 double values = 1 MB 1 cache line = 32 bytes = 4 double values X: 1 MB memory Y: 1 MB memory
13 Cache Thrashing: Example 1 MB 32768 lines XXXX XXXX ………… ………… YYYY YYYY ………… ………… ………… ………… cacheMemory 1 MB 32768 lines i=0: load line X-X into cache; load X from cache to register; load line Y-Y into cache, displacing line X-X; load Y from cache into register; add, update Y in cache; i=1: load X-X into cache, displacing Y-Y, write line Y-Y back to memory; load X from cache to register; load Y-Y into cache, displacing X-X; load Y from cache to register; add, update Y in cache; i=2: load X-X into cache, displacing Y-Y, write line Y-Y back to memory; load X from cache to register; load Y-Y into cache, displacing X-X; load Y from cache to register; add, update Y in cache; i=3: … No cache reuse! Poor performance! Avoid cache thrashing! double X, Y; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; …
14 Fully Associative Cache A cache line from memory can be placed anywhere in cache; No cache thrashing; but costly. Direct-mapped cache at one extreme of spectrum; fully associative cache at another extreme of spectrum. Disadvantage: search entire cache to determine if a specific cache line is present.
15 N-Way Set Associative Cache Compromise between direct-mapped cache and fully associative cache The cache lines in cache is divided into a number of sets; Each set contains N cache lines. Given a cache line from memory, the index of set it belongs to is first calculated; Then it is placed in one of the N cache lines in this set. … 2 GB (67,108,864 cache lines) … 1 MB 32,768 cache lines 16,384 sets Each set has 2 lines cache Main memory 2-way set associative cache Less likely to cause cache thrashing; Less costly; Direct-mapped cache is 1-way set associative cache; Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache.
16 Instruction/Data Cache CPU may have separate instruction cache and data cache (split cache). CPU may have a single cache, for both instructions and data from memory (unified cache).
17 Remember … Efficiency directly related to cache reuse Cache thrashing is eliminated by padding arrays (array dimensions should not be a multiple of cache line – avoid powers of 2) To improve cache reuse, Access memory sequentially as much as possible Avoid stride, random access, indirect addressing Avoid cache thrashing.
18 Example Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss. double X, Y; int i,j; … for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j]; X X X X X …… X …… Y Y Y Y Y …… Y …… stride 1024 or 8KB
19 Virtual Memory, Memory Paging … 0 4KB 8KB 2GB … 0 4KB 8KB 2GB … 0 1024KB 1028KB 1032KB 1036KB 1040KB 1044KB 1048KB 4GB Program #1 Program #2 Physical Memory Modern computers use virtual memory; Memory address seen in a program (virtual address) is not the actual address in physical memory; Memory is divided into pages (e.g. 4KB); A memory page in program’s address space corresponds to a page in physical memory; To access memory, need to translate program’s virtual address to the actual address in physical memory. This is done using a page table;
20 Translation Look-aside Buffer (TLB) TLB is a special cache for the page tables Faster access to TLB for virtual-physical translation. When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there); If program exhibits locality of references, entries in TLB can be reused TLB hit better performance Otherwise TLB miss performance degrades. Large stride in memory access pattern TLB miss (and cache miss).
21 Remedies Use large memory page size On some systems, the memory page size can be modified by user programs, e.g. IBM SP, HP machines Avoid large stride in memory access; Sequential access to memory as much as possible.
22 Interleaved Memory Memory interleaving: alleviating the impact of memory cycle time. Total memory divided into a set of memory banks; Contiguous memory addresses reside on different banks. When accessing memory sequentially, effect of memory cycle time minimized When current bank is busy, next bank is idle and can be accessed immediately. Stride in memory access not favorable may access the same bank repeatedly, need to wait due to cycle time poor performance Total 2GB memory Divide into 4 memory banks Each bank: 512 MB Cache line: assumed 32 bytes 0-31 128-159 … 32-63 160-191 … 64-95 192-223 … 96-127 224-255 … Bank 1Bank 2Bank 3Bank 4 1 cache line (32 bytes)